Numpy - multidimensional data arrays¶
Based on lectures from http://github.com/jrjohansson/scientific-python-lectures
The numpy
package (module) is used in almost all numerical computation using Python. It is a package that provide high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized (formulated with vectors and matrices), performance is very good.
from numpy import *
Creating numpy
arrays¶
There are a number of ways to initialize new numpy arrays, for example from
- a Python list or tuples
- using functions that are dedicated to generating numpy arrays, such as
arange
,linspace
,zeros
,ones
, etc. - reading data from files
M = array([[1,2],[3,4],[5,6.+1j]],order='F') # from list
M.shape, M.size, M.dtype, M.strides#, M.order, M.strides
#help(M)
((3, 2), 6, dtype('complex128'), (16, 48))
automatic typecasting:
array([[1,2],[3,4],[5,6]])
array(((1,2),(3,5.0+0j)))
array([[1.+0.j, 2.+0.j], [3.+0.j, 5.+0.j]])
forcing array dtype
M = array([[1,2],[3,4]], dtype=complex128) # from list
print(M)
print(M.shape, M.size, M.dtype,'\n')
N = array([[1,2],[3,4]])
print(N)
print(N.shape,N.size,N.dtype,'\n')
N = array([[1.0,2.0],[3.0,4.5]],dtype=int)
print(N)
print(N.shape,N.size,N.dtype,'\n')
[[1.+0.j 2.+0.j] [3.+0.j 4.+0.j]] (2, 2) 4 complex128 [[1 2] [3 4]] (2, 2) 4 int64 [[1 2] [3 4]] (2, 2) 4 int64
arange
is just like range
but for floating point numbers
arange(1,10,0.1)
array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. , 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8, 8.9, 9. , 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9])
array(range(10,100,1))/10.
array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. , 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8, 8.9, 9. , 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9])
linspace
is very similar, except the boundary of the interval are included, and we need to give number of points, rather than step size.
linspace(1,10,91)
array([ 1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. , 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8, 8.9, 9. , 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9, 10. ])
logspace
gives logarithmic mesh, but notice that it starts at $10^{n_0}$ and stops at $10^{n_1}$, with $n_2$ points.
logspace(-3,2,20)
array([1.00000000e-03, 1.83298071e-03, 3.35981829e-03, 6.15848211e-03, 1.12883789e-02, 2.06913808e-02, 3.79269019e-02, 6.95192796e-02, 1.27427499e-01, 2.33572147e-01, 4.28133240e-01, 7.84759970e-01, 1.43844989e+00, 2.63665090e+00, 4.83293024e+00, 8.85866790e+00, 1.62377674e+01, 2.97635144e+01, 5.45559478e+01, 1.00000000e+02])
Most common initialization of array with zeros
:
Z=zeros((3,3,2),dtype=complex)
Z.shape, Z.size, Z.dtype
((3, 3, 2), 18, dtype('complex128'))
Z=ones((3,3),dtype=float)*1j
print(Z)
Z.shape, Z.size, Z.dtype
[[0.+1.j 0.+1.j 0.+1.j] [0.+1.j 0.+1.j 0.+1.j] [0.+1.j 0.+1.j 0.+1.j]]
((3, 3), 9, dtype('complex128'))
automatic typecasting
works in a peculiar way (for efficiency)
Z=ones((3,3), dtype=int)
print(Z.dtype,'\n')
# Z[0]=1j gives an error
Z[0,0]=3.5 # works but does not produce what one would expect
print(Z)
print(Z.dtype,'\n')
Z = Z*1.0 # here we convert Z to floating point array
Z[0,0]=3.5
print(Z)
print(Z.dtype,'\n')
int64 [[3 1 1] [1 1 1] [1 1 1]] int64 [[3.5 1. 1. ] [1. 1. 1. ] [1. 1. 1. ]] float64
more on typecasting
of arrays
Z=ones((3,3))
#Z *= 1j # issues an error, but would not work
Z = Z*1j
Z.shape, Z.size, Z.dtype
((3, 3), 9, dtype('complex128'))
Random number generators¶
from numpy import random
print( log(1-random.rand()) ) # uniformly distributed random number between [0,1]. Just one number
print( random.rand(5,5) ) # uniform distributed numbers, but now entire matrix (5x5). This is more efficient
print( random.randn(5,5) ) # standard normal distribution random matrix
-0.17337513784243797 [[0.56881674 0.36756604 0.92727077 0.92041113 0.7611761 ] [0.9857488 0.10278334 0.16468976 0.9570681 0.82738266] [0.04235813 0.79173131 0.90325538 0.7884061 0.56916988] [0.7154723 0.93741704 0.74372263 0.45958584 0.11067024] [0.82480486 0.6314464 0.1675178 0.86774818 0.65639374]] [[-0.24886888 0.39185147 0.32858017 0.18511517 0.51619927] [ 0.38582314 1.44976021 1.56871706 -0.11409876 0.80015954] [-0.33282485 1.27906909 -0.35642492 1.89100805 -0.83537094] [ 1.02517392 -0.23555296 -0.75769643 -0.12255801 -0.87359381] [-0.48332535 0.90830106 1.33670182 0.35061041 -1.95512178]]
File I/O¶
Very common form is comma-separated values (CSV) or tab-separated values (TSV). To read data from such files into Numpy arrays we can use the numpy.loadtxt
or numpy.genfromtxt
File stockholm_td_adj.dat.txt
contains Stockholm temperature over the years. The columns are [$year$,$month$,$day$,$T_{average}^{day}$,$T_{min}^{day}$,$T_{max}^{day}$]
## Check if file exists
!tail stockholm_td_adj.dat.txt
2011 12 22 -0.4 -1.0 -1.0 1 2011 12 23 3.7 3.1 3.1 1 2011 12 24 3.2 2.6 2.6 1 2011 12 25 4.2 3.5 3.5 1 2011 12 26 8.2 7.5 7.5 1 2011 12 27 8.3 7.6 7.6 1 2011 12 28 2.6 1.9 1.9 1 2011 12 29 4.9 4.2 4.2 1 2011 12 30 0.6 -0.1 -0.1 1 2011 12 31 -2.6 -3.3 -3.3 1
data = loadtxt('stockholm_td_adj.dat.txt')
data.shape
(77431, 7)
# inline figures from matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import pylab as plt
# time in years when we have year/month/day
t = data[:,0]+data[:,1]/12.+data[:,2]/365
print(t[:5])
print(t[-5:])
[1800.08607306 1800.08881279 1800.09155251 1800.09429224 1800.09703196] [2012.0739726 2012.07671233 2012.07945205 2012.08219178 2012.08493151]
Plot average daily temperature versus time
plt.plot(t, data[:,4])
plt.show()
# a bit more extended in x-direction
plt.figure(figsize=(14,4))
plt.plot(t, data[:,4])
plt.title('temperature in Stockholm')
plt.xlabel('year')
plt.ylabel('temperature $[\degree C]$')
plt.show()
Lets save the data in the form [t,$T_{average}$]
a=[1,2,3]
b=[4,5,6]
vstack((a,b)).T
array([[1, 4], [2, 5], [3, 6]])
vstack((t,data[:,4])).T.shape
(77431, 2)
data.shape
(77431, 7)
savetxt('StockholmT.dat', vstack((t,data[:,4])).T)
!tail StockholmT.dat
2.012060273972602772e+03 -1.000000000000000000e+00 2.012063013698630130e+03 3.100000000000000089e+00 2.012065753424657487e+03 2.600000000000000089e+00 2.012068493150684844e+03 3.500000000000000000e+00 2.012071232876712429e+03 7.500000000000000000e+00 2.012073972602739786e+03 7.599999999999999645e+00 2.012076712328767144e+03 1.899999999999999911e+00 2.012079452054794501e+03 4.200000000000000178e+00 2.012082191780821859e+03 -1.000000000000000056e-01 2.012084931506849216e+03 -3.299999999999999822e+00
More efficient binary storage of data to the disc¶
save('ST_data.npy',data)
!ls -lt
total 87056 -rw-r--r-- 1 haule staff 4336264 Oct 1 12:50 ST_data.npy -rw-r--r-- 1 haule staff 3891176 Oct 1 12:49 StockholmT.dat -rw-r--r-- 1 haule staff 353931 Oct 1 12:46 02_Numpy.ipynb -rw-r--r-- 1 haule staff 135957 Oct 1 11:47 02_Numpy_with_solution.ipynb -rw-r--r--@ 1 haule staff 2650749 Oct 1 11:35 stockholm_daily_mean_temperature.csv -rw-r--r-- 1 haule staff 121276 Sep 26 15:28 01_Basic_Python_with_solution2.ipynb -rw-r--r-- 1 haule staff 752 Sep 26 15:12 mymodule.py drwxr-xr-x 6 haule staff 192 Sep 26 00:17 __pycache__ -rw-r--r-- 1 haule staff 57228 Sep 24 12:27 00_Introduction.ipynb -rw-r--r--@ 1 haule staff 484607 Sep 24 01:10 01_Basic_Python.html -rw-r--r-- 1 haule staff 116394 Sep 24 01:08 01_Basic_Python.ipynb drwxr-xr-x 3 haule staff 96 Sep 24 01:06 old2 -rw-r--r-- 1 haule staff 873384 Sep 24 01:02 01_Basic_Python_with_solution.ipynb drwxr-xr-x 3 haule staff 96 Sep 24 00:19 anaconda_projects -rw-r--r-- 1 haule staff 249995 Nov 26 2024 05_Atom_in_LDA.ipynb -rw-r--r--@ 1 haule staff 551848 Nov 18 2024 05_Atom_in_LDA.html -rw-r--r-- 1 haule staff 963711 Nov 11 2024 04_Scipy_Hydrogen_atom_with_solution.ipynb -rw-r--r-- 1 haule staff 656013 Nov 10 2024 04_Scipy_Hydrogen_atom.ipynb -rw-r--r-- 1 haule staff 21328528 Nov 4 2024 03_Scipy.ipynb -rw-r--r-- 1 haule staff 2591 Oct 24 2024 pendulum.py -rw-r--r--@ 1 haule staff 733718 Oct 24 2024 double_pendulum.mp4 -rw-r--r-- 1 haule staff 63 Oct 7 2024 my_out.txt -rw-r--r--@ 1 haule staff 2864946 Sep 3 2024 stockholm_td_adj.dat drwxr-xr-x 6 haule staff 192 Sep 3 2024 old drwxr-xr-x 6 haule staff 192 Mar 18 2024 img drwxr-xr-x 4 haule staff 128 Mar 4 2021 backup drwxr-xr-x 9 haule staff 288 Mar 3 2021 tmp -rw-r--r-- 1 haule staff 3327 Mar 3 2021 excor.py -rw-r--r--@ 1 haule staff 2864984 Feb 22 2021 stockholm_td_adj.dat.txt
data2=load('ST_data.npy')
allclose(data,data2) # how close are the two sets of data
True
import numpy as np
np.max(abs(data-data2))
0.0
amax(abs(data-data2))
0.0
help(ravel)
Help on _ArrayFunctionDispatcher in module numpy: ravel(a, order='C') Return a contiguous flattened array. A 1-D array, containing the elements of the input, is returned. A copy is made only if needed. As of NumPy 1.10, the returned array will have the same type as the input array. (for example, a masked array will be returned for a masked array input) Parameters ---------- a : array_like Input array. The elements in `a` are read in the order specified by `order`, and packed as a 1-D array. order : {'C','F', 'A', 'K'}, optional The elements of `a` are read using this index order. 'C' means to index the elements in row-major, C-style order, with the last axis index changing fastest, back to the first axis index changing slowest. 'F' means to index the elements in column-major, Fortran-style order, with the first index changing fastest, and the last index changing slowest. Note that the 'C' and 'F' options take no account of the memory layout of the underlying array, and only refer to the order of axis indexing. 'A' means to read the elements in Fortran-like index order if `a` is Fortran *contiguous* in memory, C-like order otherwise. 'K' means to read the elements in the order they occur in memory, except for reversing the data when strides are negative. By default, 'C' index order is used. Returns ------- y : array_like y is a contiguous 1-D array of the same subtype as `a`, with shape ``(a.size,)``. Note that matrices are special cased for backward compatibility, if `a` is a matrix, then y is a 1-D ndarray. See Also -------- ndarray.flat : 1-D iterator over an array. ndarray.flatten : 1-D array copy of the elements of an array in row-major order. ndarray.reshape : Change the shape of an array without changing its data. Notes ----- In row-major, C-style order, in two dimensions, the row index varies the slowest, and the column index the quickest. This can be generalized to multiple dimensions, where row-major order implies that the index along the first axis varies slowest, and the index along the last quickest. The opposite holds for column-major, Fortran-style index ordering. When a view is desired in as many cases as possible, ``arr.reshape(-1)`` may be preferable. However, ``ravel`` supports ``K`` in the optional ``order`` argument while ``reshape`` does not. Examples -------- It is equivalent to ``reshape(-1, order=order)``. >>> x = np.array([[1, 2, 3], [4, 5, 6]]) >>> np.ravel(x) array([1, 2, 3, 4, 5, 6]) >>> x.reshape(-1) array([1, 2, 3, 4, 5, 6]) >>> np.ravel(x, order='F') array([1, 4, 2, 5, 3, 6]) When ``order`` is 'A', it will preserve the array's 'C' or 'F' ordering: >>> np.ravel(x.T) array([1, 4, 2, 5, 3, 6]) >>> np.ravel(x.T, order='A') array([1, 2, 3, 4, 5, 6]) When ``order`` is 'K', it will preserve orderings that are neither 'C' nor 'F', but won't reverse axes: >>> a = np.arange(3)[::-1]; a array([2, 1, 0]) >>> a.ravel(order='C') array([2, 1, 0]) >>> a.ravel(order='K') array([2, 1, 0]) >>> a = np.arange(12).reshape(2,3,2).swapaxes(1,2); a array([[[ 0, 2, 4], [ 1, 3, 5]], [[ 6, 8, 10], [ 7, 9, 11]]]) >>> a.ravel(order='C') array([ 0, 2, 4, 1, 3, 5, 6, 8, 10, 7, 9, 11]) >>> a.ravel(order='K') array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Display which years are included in the datafile
#print(set(data[:,0])) # display year for all datapoints
print(set(array(data[:,0],dtype=int))) # display year for all datapoints
array(data[100::365,0],dtype=int) # the years with 365 spacings, and then last years
{1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011}
array([1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011])
Fancy indexing Index is itself an array of integer numbers, i.e, which rows or columns?
help(loadtxt)
Help on function loadtxt in module numpy: loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes', max_rows=None, *, quotechar=None, like=None) Load data from a text file. Parameters ---------- fname : file, str, pathlib.Path, list of str, generator File, filename, list, or generator to read. If the filename extension is ``.gz`` or ``.bz2``, the file is first decompressed. Note that generators must return bytes or strings. The strings in a list or produced by a generator are treated as lines. dtype : data-type, optional Data-type of the resulting array; default: float. If this is a structured data-type, the resulting array will be 1-dimensional, and each row will be interpreted as an element of the array. In this case, the number of columns used must match the number of fields in the data-type. comments : str or sequence of str or None, optional The characters or list of characters used to indicate the start of a comment. None implies no comments. For backwards compatibility, byte strings will be decoded as 'latin1'. The default is '#'. delimiter : str, optional The character used to separate the values. For backwards compatibility, byte strings will be decoded as 'latin1'. The default is whitespace. .. versionchanged:: 1.23.0 Only single character delimiters are supported. Newline characters cannot be used as the delimiter. converters : dict or callable, optional Converter functions to customize value parsing. If `converters` is callable, the function is applied to all columns, else it must be a dict that maps column number to a parser function. See examples for further details. Default: None. .. versionchanged:: 1.23.0 The ability to pass a single callable to be applied to all columns was added. skiprows : int, optional Skip the first `skiprows` lines, including comments; default: 0. usecols : int or sequence, optional Which columns to read, with 0 being the first. For example, ``usecols = (1,4,5)`` will extract the 2nd, 5th and 6th columns. The default, None, results in all columns being read. .. versionchanged:: 1.11.0 When a single column has to be read it is possible to use an integer instead of a tuple. E.g ``usecols = 3`` reads the fourth column the same way as ``usecols = (3,)`` would. unpack : bool, optional If True, the returned array is transposed, so that arguments may be unpacked using ``x, y, z = loadtxt(...)``. When used with a structured data-type, arrays are returned for each field. Default is False. ndmin : int, optional The returned array will have at least `ndmin` dimensions. Otherwise mono-dimensional axes will be squeezed. Legal values: 0 (default), 1 or 2. .. versionadded:: 1.6.0 encoding : str, optional Encoding used to decode the inputfile. Does not apply to input streams. The special value 'bytes' enables backward compatibility workarounds that ensures you receive byte arrays as results if possible and passes 'latin1' encoded strings to converters. Override this value to receive unicode arrays and pass strings as input to converters. If set to None the system default is used. The default value is 'bytes'. .. versionadded:: 1.14.0 max_rows : int, optional Read `max_rows` rows of content after `skiprows` lines. The default is to read all the rows. Note that empty rows containing no data such as empty lines and comment lines are not counted towards `max_rows`, while such lines are counted in `skiprows`. .. versionadded:: 1.16.0 .. versionchanged:: 1.23.0 Lines containing no data, including comment lines (e.g., lines starting with '#' or as specified via `comments`) are not counted towards `max_rows`. quotechar : unicode character or None, optional The character used to denote the start and end of a quoted item. Occurrences of the delimiter or comment characters are ignored within a quoted item. The default value is ``quotechar=None``, which means quoting support is disabled. If two consecutive instances of `quotechar` are found within a quoted field, the first is treated as an escape character. See examples. .. versionadded:: 1.23.0 like : array_like, optional Reference object to allow the creation of arrays which are not NumPy arrays. If an array-like passed in as ``like`` supports the ``__array_function__`` protocol, the result will be defined by it. In this case, it ensures the creation of an array object compatible with that passed in via this argument. .. versionadded:: 1.20.0 Returns ------- out : ndarray Data read from the text file. See Also -------- load, fromstring, fromregex genfromtxt : Load data with missing values handled as specified. scipy.io.loadmat : reads MATLAB data files Notes ----- This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values. Each row in the input text file must have the same number of values to be able to read all values. If all rows do not have same number of values, a subset of up to n columns (where n is the least number of values present in all rows) can be read by specifying the columns via `usecols`. .. versionadded:: 1.10.0 The strings produced by the Python float.hex method can be used as input for floats. Examples -------- >>> from io import StringIO # StringIO behaves like a file object >>> c = StringIO("0 1\n2 3") >>> np.loadtxt(c) array([[0., 1.], [2., 3.]]) >>> d = StringIO("M 21 72\nF 35 58") >>> np.loadtxt(d, dtype={'names': ('gender', 'age', 'weight'), ... 'formats': ('S1', 'i4', 'f4')}) array([(b'M', 21, 72.), (b'F', 35, 58.)], dtype=[('gender', 'S1'), ('age', '<i4'), ('weight', '<f4')]) >>> c = StringIO("1,0,2\n3,0,4") >>> x, y = np.loadtxt(c, delimiter=',', usecols=(0, 2), unpack=True) >>> x array([1., 3.]) >>> y array([2., 4.]) The `converters` argument is used to specify functions to preprocess the text prior to parsing. `converters` can be a dictionary that maps preprocessing functions to each column: >>> s = StringIO("1.618, 2.296\n3.141, 4.669\n") >>> conv = { ... 0: lambda x: np.floor(float(x)), # conversion fn for column 0 ... 1: lambda x: np.ceil(float(x)), # conversion fn for column 1 ... } >>> np.loadtxt(s, delimiter=",", converters=conv) array([[1., 3.], [3., 5.]]) `converters` can be a callable instead of a dictionary, in which case it is applied to all columns: >>> s = StringIO("0xDE 0xAD\n0xC0 0xDE") >>> import functools >>> conv = functools.partial(int, base=16) >>> np.loadtxt(s, converters=conv) array([[222., 173.], [192., 222.]]) This example shows how `converters` can be used to convert a field with a trailing minus sign into a negative number. >>> s = StringIO('10.01 31.25-\n19.22 64.31\n17.57- 63.94') >>> def conv(fld): ... return -float(fld[:-1]) if fld.endswith(b'-') else float(fld) ... >>> np.loadtxt(s, converters=conv) array([[ 10.01, -31.25], [ 19.22, 64.31], [-17.57, 63.94]]) Using a callable as the converter can be particularly useful for handling values with different formatting, e.g. floats with underscores: >>> s = StringIO("1 2.7 100_000") >>> np.loadtxt(s, converters=float) array([1.e+00, 2.7e+00, 1.e+05]) This idea can be extended to automatically handle values specified in many different formats: >>> def conv(val): ... try: ... return float(val) ... except ValueError: ... return float.fromhex(val) >>> s = StringIO("1, 2.5, 3_000, 0b4, 0x1.4000000000000p+2") >>> np.loadtxt(s, delimiter=",", converters=conv, encoding=None) array([1.0e+00, 2.5e+00, 3.0e+03, 1.8e+02, 5.0e+00]) Note that with the default ``encoding="bytes"``, the inputs to the converter function are latin-1 encoded byte strings. To deactivate the implicit encoding prior to conversion, use ``encoding=None`` >>> s = StringIO('10.01 31.25-\n19.22 64.31\n17.57- 63.94') >>> conv = lambda x: -float(x[:-1]) if x.endswith('-') else float(x) >>> np.loadtxt(s, converters=conv, encoding=None) array([[ 10.01, -31.25], [ 19.22, 64.31], [-17.57, 63.94]]) Support for quoted fields is enabled with the `quotechar` parameter. Comment and delimiter characters are ignored when they appear within a quoted item delineated by `quotechar`: >>> s = StringIO('"alpha, #42", 10.0\n"beta, #64", 2.0\n') >>> dtype = np.dtype([("label", "U12"), ("value", float)]) >>> np.loadtxt(s, dtype=dtype, delimiter=",", quotechar='"') array([('alpha, #42', 10.), ('beta, #64', 2.)], dtype=[('label', '<U12'), ('value', '<f8')]) Quoted fields can be separated by multiple whitespace characters: >>> s = StringIO('"alpha, #42" 10.0\n"beta, #64" 2.0\n') >>> dtype = np.dtype([("label", "U12"), ("value", float)]) >>> np.loadtxt(s, dtype=dtype, delimiter=None, quotechar='"') array([('alpha, #42', 10.), ('beta, #64', 2.)], dtype=[('label', '<U12'), ('value', '<f8')]) Two consecutive quote characters within a quoted field are treated as a single escaped character: >>> s = StringIO('"Hello, my name is ""Monty""!"') >>> np.loadtxt(s, dtype="U", delimiter=",", quotechar='"') array('Hello, my name is "Monty"!', dtype='<U26') Read subset of columns when all rows do not contain equal number of values: >>> d = StringIO("1 2\n2 4\n3 9 12\n4 16 20") >>> np.loadtxt(d, usecols=(0, 1)) array([[ 1., 2.], [ 2., 4.], [ 3., 9.], [ 4., 16.]])
x=arange(1,100)
print(x[:10:2])
print(x[ [0,2,4,6,8] ]) # fancy indexing. Does not exists in C++
[1 3 5 7 9] [1 3 5 7 9]
data[[0,365,2*365,3*365,4*365,5*365+1,6*365+1],0] # fancy indexing for multidimensional array
array([1800., 1801., 1802., 1803., 1804., 1805., 1806.])
data[range(1,212*365,365),0] # leap year makes trouble
array([1800., 1801., 1802., 1803., 1804., 1805., 1806., 1807., 1808., 1808., 1809., 1810., 1811., 1812., 1813., 1814., 1815., 1816., 1817., 1818., 1819., 1820., 1821., 1822., 1823., 1824., 1825., 1826., 1827., 1828., 1829., 1830., 1831., 1832., 1833., 1834., 1835., 1836., 1837., 1838., 1839., 1840., 1841., 1842., 1843., 1844., 1845., 1846., 1847., 1848., 1849., 1850., 1851., 1852., 1853., 1854., 1855., 1856., 1857., 1858., 1859., 1860., 1861., 1862., 1863., 1864., 1865., 1866., 1867., 1868., 1869., 1870., 1871., 1872., 1873., 1874., 1875., 1876., 1877., 1878., 1879., 1880., 1881., 1882., 1883., 1884., 1885., 1886., 1887., 1888., 1889., 1890., 1891., 1892., 1893., 1894., 1895., 1896., 1897., 1898., 1899., 1900., 1901., 1902., 1903., 1904., 1905., 1906., 1907., 1908., 1909., 1910., 1911., 1912., 1913., 1914., 1915., 1916., 1917., 1918., 1919., 1920., 1921., 1922., 1923., 1924., 1925., 1926., 1927., 1928., 1929., 1930., 1931., 1932., 1933., 1934., 1935., 1936., 1937., 1938., 1939., 1940., 1941., 1942., 1943., 1944., 1945., 1946., 1947., 1948., 1949., 1950., 1951., 1952., 1953., 1954., 1955., 1956., 1957., 1958., 1959., 1960., 1961., 1962., 1963., 1964., 1965., 1966., 1967., 1968., 1969., 1970., 1971., 1972., 1973., 1974., 1975., 1976., 1977., 1978., 1979., 1980., 1981., 1982., 1983., 1984., 1985., 1986., 1987., 1988., 1989., 1990., 1991., 1992., 1993., 1994., 1995., 1996., 1997., 1998., 1999., 2000., 2001., 2002., 2003., 2004., 2005., 2006., 2007., 2008., 2009., 2010.])
Using mask to pick data*
In addition to fancy indexing, Python allows yet another way to pick the data.
Create a mask of [True,....False....]
values, and pick from the array only columns/rows where True
.
How to compute average temperature in the year of 1973?
data[:,0] >= 1973
data[:,0] < 1974
array([ True, True, True, ..., False, False, False])
# Create mask for the year 1973
mask = logical_and(1973<=data[:,0], data[:,0]<1974)
mask
array([False, False, False, ..., False, False, False])
data[mask,0] # All should have 1973
array([1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.])
T1973 = data[mask,4]
print('Average temperature in 1973=', sum(T1973)/len(T1973))
Average temperature in 1973= 6.646027397260275
where(mask)
(array([63187, 63188, 63189, 63190, 63191, 63192, 63193, 63194, 63195, 63196, 63197, 63198, 63199, 63200, 63201, 63202, 63203, 63204, 63205, 63206, 63207, 63208, 63209, 63210, 63211, 63212, 63213, 63214, 63215, 63216, 63217, 63218, 63219, 63220, 63221, 63222, 63223, 63224, 63225, 63226, 63227, 63228, 63229, 63230, 63231, 63232, 63233, 63234, 63235, 63236, 63237, 63238, 63239, 63240, 63241, 63242, 63243, 63244, 63245, 63246, 63247, 63248, 63249, 63250, 63251, 63252, 63253, 63254, 63255, 63256, 63257, 63258, 63259, 63260, 63261, 63262, 63263, 63264, 63265, 63266, 63267, 63268, 63269, 63270, 63271, 63272, 63273, 63274, 63275, 63276, 63277, 63278, 63279, 63280, 63281, 63282, 63283, 63284, 63285, 63286, 63287, 63288, 63289, 63290, 63291, 63292, 63293, 63294, 63295, 63296, 63297, 63298, 63299, 63300, 63301, 63302, 63303, 63304, 63305, 63306, 63307, 63308, 63309, 63310, 63311, 63312, 63313, 63314, 63315, 63316, 63317, 63318, 63319, 63320, 63321, 63322, 63323, 63324, 63325, 63326, 63327, 63328, 63329, 63330, 63331, 63332, 63333, 63334, 63335, 63336, 63337, 63338, 63339, 63340, 63341, 63342, 63343, 63344, 63345, 63346, 63347, 63348, 63349, 63350, 63351, 63352, 63353, 63354, 63355, 63356, 63357, 63358, 63359, 63360, 63361, 63362, 63363, 63364, 63365, 63366, 63367, 63368, 63369, 63370, 63371, 63372, 63373, 63374, 63375, 63376, 63377, 63378, 63379, 63380, 63381, 63382, 63383, 63384, 63385, 63386, 63387, 63388, 63389, 63390, 63391, 63392, 63393, 63394, 63395, 63396, 63397, 63398, 63399, 63400, 63401, 63402, 63403, 63404, 63405, 63406, 63407, 63408, 63409, 63410, 63411, 63412, 63413, 63414, 63415, 63416, 63417, 63418, 63419, 63420, 63421, 63422, 63423, 63424, 63425, 63426, 63427, 63428, 63429, 63430, 63431, 63432, 63433, 63434, 63435, 63436, 63437, 63438, 63439, 63440, 63441, 63442, 63443, 63444, 63445, 63446, 63447, 63448, 63449, 63450, 63451, 63452, 63453, 63454, 63455, 63456, 63457, 63458, 63459, 63460, 63461, 63462, 63463, 63464, 63465, 63466, 63467, 63468, 63469, 63470, 63471, 63472, 63473, 63474, 63475, 63476, 63477, 63478, 63479, 63480, 63481, 63482, 63483, 63484, 63485, 63486, 63487, 63488, 63489, 63490, 63491, 63492, 63493, 63494, 63495, 63496, 63497, 63498, 63499, 63500, 63501, 63502, 63503, 63504, 63505, 63506, 63507, 63508, 63509, 63510, 63511, 63512, 63513, 63514, 63515, 63516, 63517, 63518, 63519, 63520, 63521, 63522, 63523, 63524, 63525, 63526, 63527, 63528, 63529, 63530, 63531, 63532, 63533, 63534, 63535, 63536, 63537, 63538, 63539, 63540, 63541, 63542, 63543, 63544, 63545, 63546, 63547, 63548, 63549, 63550, 63551]),)
# where tells you the index where True
indices = where(mask)
X1973 = data[indices,4] # This gives similar data in 1973, but not identical
print(T1973.shape, X1973.shape, X1973[0].shape)
print('Average temperature in 1973=', sum(X1973[0,:])/len(X1973[0,:]))
print('Min/Max temperature in 1973=', min(X1973[0,:]),'/',max(X1973[0,:]))
(365,) (1, 365) (365,) Average temperature in 1973= 6.646027397260275 Min/Max temperature in 1973= -13.4 / 24.6
What is the mean monthly temperatures for each month of the year?
Let's do Ferbrurary first
Febr=data[:,1]==2
mean(data[Febr,4])
-3.6189076332052785
Now loop for all months
monthly_mean=[mean(data[data[:,1]==month,4]) for month in range(1,13)]
print(monthly_mean)
[-3.435696895922094, -3.6189076332052785, -1.2675593426658551, 3.467468553459119, 8.956664637857576, 14.134669811320755, 16.85421485088253, 15.774163116250763, 11.661477987421383, 6.635316494217895, 1.725896226415094, -1.495815581253804]
plt.bar(range(1,13),monthly_mean);
plt.xlabel('month')
plt.ylabel('average temperature')
plt.show()
Linear Algebra¶
It is implemented in low level fortran/C code, and is much more efficient than code written in Python.
A = random.rand(3,3)
print(A * A) # It is not matrix-matrix product, but element-wise product
print()
print(multiply(A,A))
print()
print(A @ A) # It is a matrix product
print()
print(matmul(A,A))
print()
print(dot(A,A))
[[0.36093564 0.49112738 0.3454516 ] [0.0326742 0.15172822 0.07930084] [0.38569598 0.13538831 0.06647908]] [[0.36093564 0.49112738 0.3454516 ] [0.0326742 0.15172822 0.07930084] [0.38569598 0.13538831 0.06647908]] [[0.85263275 0.91027256 0.70200131] [0.35389569 0.38202238 0.28854073] [0.5997486 0.67342732 0.53511529]] [[0.85263275 0.91027256 0.70200131] [0.35389569 0.38202238 0.28854073] [0.5997486 0.67342732 0.53511529]] [[0.85263275 0.91027256 0.70200131] [0.35389569 0.38202238 0.28854073] [0.5997486 0.67342732 0.53511529]]
Matrix product or matrix-vector product can be performed by dot
command
v1 = random.rand(3)
print(v1)
print( dot(A,v1) ) # matrix.vector product
print( dot(v1,v1) ) # length of vector^2
[0.90384217 0.11632482 0.88106607] [1.1423783 0.45680155 0.83129789] 1.6067395569997096
print(A@v1)
print(v1 @ v1)
[1.1423783 0.45680155 0.83129789] 1.6067395569997096
# still works, but risky code.
w=A*v1 # w[i,j]=A[i,j]*v1[j]
sum(w,axis=1) # sum_j w[i,j]
array([1.1423783 , 0.45680155, 0.83129789])
Slightly less efficient, but nicer code can be obtained by matrix
clas
A = random.rand(3,3)
v1 = random.rand(3)
M = matrix(A)
v = matrix(v1).T # create a column vector
M*M # this is now matrix-matrix product
matrix([[1.03557775, 0.57695675, 0.45415059], [1.39786258, 0.84282684, 0.56473361], [1.64605465, 0.83554251, 0.79854281]])
print(M)
print(M.T)
[[0.68090218 0.34319693 0.28130346] [0.84773167 0.3296405 0.54175668] [0.99896053 0.81812572 0.27259252]] [[0.68090218 0.84773167 0.99896053] [0.34319693 0.3296405 0.81812572] [0.28130346 0.54175668 0.27259252]]
M*v # this is matrix*vector product
matrix([[0.62384864], [0.83362863], [1.07026551]])
v.T * M # vector*matrix product
matrix([[1.54101839, 0.92098224, 0.69212031]])
v1 @ A
array([1.54101839, 0.92098224, 0.69212031])
v.T * v # inner-product
matrix([[1.20375134]])
v1 @ v1
1.2037513419762569
Array/Matrix transformations
.T
ortranspose(M)
transposes matrix.H
hermitian conjugateconjugate(M)
,M.conjugate()
, orM.conj()
conjugates the matrixreal(M)
andimag(M)
takes real and imaginary part of the matrix
print(M.H)
print()
print(M.T)
[[0.68090218 0.84773167 0.99896053] [0.34319693 0.3296405 0.81812572] [0.28130346 0.54175668 0.27259252]] [[0.68090218 0.84773167 0.99896053] [0.34319693 0.3296405 0.81812572] [0.28130346 0.54175668 0.27259252]]
A.T.conj()
array([[0.68090218, 0.84773167, 0.99896053], [0.34319693, 0.3296405 , 0.81812572], [0.28130346, 0.54175668, 0.27259252]])
More advanced linear algebra operations¶
Library linalg
:
linalg.det(A)
linalg.inv(A)
or justM.I
linalg.eig
,linalg.eigvals
,linalg.eigh
linalg.svd()
linalg.solve()
linalg.cholesky()
from numpy import linalg
print( linalg.det(A) )
linalg.inv(A)
-0.01168067780018246
array([[ -2.80442504, -3.94824272, 5.77661937], [ 16.77305126, 34.85703472, -35.79937377], [ -9.5978337 , -23.33691224, 23.57726314]])
M.I
matrix([[ -2.80442504, -3.94824272, 5.77661937], [ 16.77305126, 34.85703472, -35.79937377], [ -9.5978337 , -23.33691224, 23.57726314]])
linalg.inv(A)
array([[ -2.80442504, -3.94824272, 5.77661937], [ 16.77305126, 34.85703472, -35.79937377], [ -9.5978337 , -23.33691224, 23.57726314]])
The eigenvalue problem for a matrix $A$:
$\displaystyle A v_n = \lambda_n v_n$
where $v_n$ is the $n$th eigenvector and $\lambda_n$ is the $n$th eigenvalue.
Recall: For complex matrix, symmetric means A.H == A
, and for real means A.T==A
Symmetric matrix has real eigenvalues, and eigenvectors are unitary (U.H @ U = 1
). These are very useful properties because eigenvalues and eigenvectors can be obtained by higher accuracy in symmetric case, and also more efficiently.
To calculate eigenvalues of a matrix, use the eigvals
(symmetric/hermitian eigvalsh
) and for calculating both eigenvalues and eigenvectors, use the function eig
(or eigh
):
AH=0.5*(A+A.T)
linalg.eigvals(AH)
array([ 1.19430407, -0.58615652, 0.04387795])
linalg.eigvalsh(AH)
array([-0.58615652, 0.04387795, 1.19430407])
linalg.eig(AH)
EigResult(eigenvalues=array([ 1.19430407, -0.58615652, 0.04387795]), eigenvectors=array([[ 0.64441319, 0.75863323, -0.0959545 ], [ 0.43585536, -0.46750729, -0.76906895], [ 0.62830069, -0.45377589, 0.63192222]]))
linalg.eigh(AH)
EighResult(eigenvalues=array([-0.58615652, 0.04387795, 1.19430407]), eigenvectors=array([[-0.75863323, -0.0959545 , -0.64441319], [ 0.46750729, -0.76906895, -0.43585536], [ 0.45377589, 0.63192222, -0.62830069]]))
# Checking that diagonalization works:
w, v = linalg.eig(A)
# \sum_j A[i,j]*v[j,k] = v[i,k] * w[k]
allclose(A @ v , v * w)
True
# Checking that diagonalization works:
w, v = linalg.eigh(AH)
# \sum_j A[i,j]*v[j,k] = v[i,k] * w[k]
allclose(AH @ v , v * w)
True
sum(abs(AH@v-v*w))
1.4181364416110398e-15
#A = array([[1,2,3], [4,5,6], [7,8,9]])
A = array([[1,0.1,0.3],[0.4,1,0.9],[-0.3,-0.4,1]])
b = array([1,2,3])
x = linalg.solve(A,b) # A @ x==b
print(x)
allclose(A @ x, b)
[ 0.21722846 -0.62172285 2.8164794 ]
True
# solve is more efficient, unless the RHS is very very large.
# But result should be equivalent, when the matrix is non-singular
Ai = linalg.inv(A)
x = Ai @ b
print(x)
allclose(A @ x, b)
[ 0.21722846 -0.62172285 2.8164794 ]
True
linalg.det(A)
1.335
SVD is the workhorse behind tensor network methods (DMRG, MPS, PEPS).
The idea is to truncate the Hilbert space, keeping only a few largest singular values. Because physics is usually local, and long range entanglement is small, this is very successful in 1D and even 2D.
Example: given two subsystems, we try to write the vawe function as product of two independent wave functions. This would correspond to taking a single singular value. In such settings, singular values encode entanglement spectrum between subsystems.
D = np.random.randn(5, 6)
# Perform SVD
U, S, Vh = np.linalg.svd(D, full_matrices=False)
print("U shape:", U.shape)
print("S shape:", S.shape)
print("Vh shape:", Vh.shape)
print("singular values=", S)
# Reconstruct D
D_reconstructed = U @ np.diag(S) @ Vh
# Check accuracy
print("Reconstruction error:", np.linalg.norm(D - D_reconstructed))
print("Allclose?", np.allclose(D, D_reconstructed))
U shape: (5, 5) S shape: (5,) Vh shape: (5, 6) singular values= [3.28859941 2.73183573 1.70949554 1.1226227 0.47027764] Reconstruction error: 7.477371771289063e-15 Allclose? True
sum, cumsum, trace, diag¶
v1
array([0.26246902, 0.22844303, 0.44831921])
print( sum(v1) )
print( cumsum(v1) )
print( trace(A) )
print( diag(A) )
print( sum(diag(A)) )
0.9392312628310859 [0.26246902 0.49091205 0.93923126] 3.0 [1. 1. 1.] 3.0
diag([1,2,3])
array([[1, 0, 0], [0, 2, 0], [0, 0, 3]])
diag(A)
array([1., 1., 1.])
Reshaping, resizing, and stacking arrays¶
A = array([[1,2,3], [4,5,6], [7,8,9]])
print(A.shape)
Ag = reshape(A, (9,1)) # this is not new data
print(Ag.shape)
(3, 3) (9, 1)
Ag[0]=10
A # we change A when we change Ag
array([[10, 2, 3], [ 4, 5, 6], [ 7, 8, 9]])
Ax = A.flatten() # flatten creates 1D array of all data, but creates a copy
print(Ax)
Ax[8]=100 # changing a copy
print(A)
[ 1. 0.1 0.3 0.4 1. 0.9 -0.3 -0.4 1. ] [[ 1. 0.1 0.3] [ 0.4 1. 0.9] [-0.3 -0.4 1. ]]
Ay=A.ravel()
print(Ay)
Ay[8]=100 # changing original data, hence ravel does not copy, and is more efficient.
print(A)
[ 1. 0.1 0.3 0.4 1. 0.9 -0.3 -0.4 1. ] [[ 1. 0.1 0.3] [ 0.4 1. 0.9] [ -0.3 -0.4 100. ]]
Vectorizing functions¶
Every function written in Python is very slow. However numpy type operations are fast, because they are written in fortran/C
from numpy import *
data = loadtxt('stockholm_td_adj.dat.txt')
Temp = data[:,3]
Temp**2 # this is fast, written in C
array([ 37.21, 237.16, 225. , ..., 24.01, 0.36, 6.76])
array([Temp[i]**2 for i in range(len(Temp))]) # This is slow, written in Python
array([ 37.21, 237.16, 225. , ..., 24.01, 0.36, 6.76])
What if we have a function that can not simply work on arrays?
For example, theta function?
def Theta(x):
"For positive numbers returns 1, for negative 0"
if x>=0:
return 1
else:
return 0
# Does not work on array
Theta(Temp)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[90], line 2 1 # Does not work on array ----> 2 Theta(Temp) Cell In[89], line 3, in Theta(x) 1 def Theta(x): 2 "For positive numbers returns 1, for negative 0" ----> 3 if x>=0: 4 return 1 5 else: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
We can vectorize Theta, to make it applicable to arrays.
This is simply achieved by call to numpy function vectorize
, which will create low-level routine from your function
Theta_vec = vectorize(Theta)
# This is very fast now, and creates 0 or ones
positive_temperatures=Theta_vec(Temp)
positive_temperatures
array([0, 0, 0, ..., 1, 1, 0])
How to calculate number of days in a year with positive temperatures?
# Boolean array to select data with positive temperatures
positives = array(positive_temperatures, dtype=bool)
# keeps data with positive temperatures only
kept_years = data[positives,0]
# now we just need to check how many of these data are in each year
kept_years
array([1800., 1800., 1800., ..., 2011., 2011., 2011.])
kept_years[0],kept_years[-1]
(1800.0, 2011.0)
years = list(range(1800,2013,1))
print(years)
[1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012]
help(histogram)
Help on _ArrayFunctionDispatcher in module numpy: histogram(a, bins=10, range=None, density=None, weights=None) Compute the histogram of a dataset. Parameters ---------- a : array_like Input data. The histogram is computed over the flattened array. bins : int or sequence of scalars or str, optional If `bins` is an int, it defines the number of equal-width bins in the given range (10, by default). If `bins` is a sequence, it defines a monotonically increasing array of bin edges, including the rightmost edge, allowing for non-uniform bin widths. .. versionadded:: 1.11.0 If `bins` is a string, it defines the method used to calculate the optimal bin width, as defined by `histogram_bin_edges`. range : (float, float), optional The lower and upper range of the bins. If not provided, range is simply ``(a.min(), a.max())``. Values outside the range are ignored. The first element of the range must be less than or equal to the second. `range` affects the automatic bin computation as well. While bin width is computed to be optimal based on the actual data within `range`, the bin count will fill the entire range including portions containing no data. weights : array_like, optional An array of weights, of the same shape as `a`. Each value in `a` only contributes its associated weight towards the bin count (instead of 1). If `density` is True, the weights are normalized, so that the integral of the density over the range remains 1. density : bool, optional If ``False``, the result will contain the number of samples in each bin. If ``True``, the result is the value of the probability *density* function at the bin, normalized such that the *integral* over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability *mass* function. Returns ------- hist : array The values of the histogram. See `density` and `weights` for a description of the possible semantics. bin_edges : array of dtype float Return the bin edges ``(length(hist)+1)``. See Also -------- histogramdd, bincount, searchsorted, digitize, histogram_bin_edges Notes ----- All but the last (righthand-most) bin is half-open. In other words, if `bins` is:: [1, 2, 3, 4] then the first bin is ``[1, 2)`` (including 1, but excluding 2) and the second ``[2, 3)``. The last bin, however, is ``[3, 4]``, which *includes* 4. Examples -------- >>> np.histogram([1, 2, 1], bins=[0, 1, 2, 3]) (array([0, 2, 1]), array([0, 1, 2, 3])) >>> np.histogram(np.arange(4), bins=np.arange(5), density=True) (array([0.25, 0.25, 0.25, 0.25]), array([0, 1, 2, 3, 4])) >>> np.histogram([[1, 2, 1], [1, 0, 1]], bins=[0,1,2,3]) (array([1, 4, 1]), array([0, 1, 2, 3])) >>> a = np.arange(5) >>> hist, bin_edges = np.histogram(a, density=True) >>> hist array([0.5, 0. , 0.5, 0. , 0. , 0.5, 0. , 0.5, 0. , 0.5]) >>> hist.sum() 2.4999999999999996 >>> np.sum(hist * np.diff(bin_edges)) 1.0 .. versionadded:: 1.11.0 Automated Bin Selection Methods example, using 2 peak random data with 2000 points: >>> import matplotlib.pyplot as plt >>> rng = np.random.RandomState(10) # deterministic random data >>> a = np.hstack((rng.normal(size=1000), ... rng.normal(loc=5, scale=2, size=1000))) >>> _ = plt.hist(a, bins='auto') # arguments are passed to np.histogram >>> plt.title("Histogram with 'auto' bins") Text(0.5, 1.0, "Histogram with 'auto' bins") >>> plt.show()
hist = histogram(kept_years, bins=years)
shape(hist[0]),shape(hist[1])
((212,), (213,))
# inline figures from matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(hist[1][:-1], hist[0])
plt.show()
negative = array(1-Theta_vec(Temp), dtype=bool)
kept2 = data[negative,0]
hist2 = histogram(kept2, bins=years)
plt.plot(hist2[1][:-1], hist2[0])
plt.show()
Homework 4¶
Download the data file stockholm_daily_mean_temperature.csv
from
which contains data starting in 1756 till this year. And plot the number of days with positive (in Celsius) temperature over the years.
Note that some dates in which data is missing have very negative temperature. These data should be discarted, and average should be computed on the rest of the data.
datT = loadtxt('stockholm_daily_mean_temperature.csv',delimiter=',',skiprows=1,usecols=[1,2,3])
datd = loadtxt('stockholm_daily_mean_temperature.csv',delimiter=',',skiprows=1,usecols=[0],dtype=str)
datd_ =array([[int(x) for x in dt.split('-')] for dt in datd])
datd = datd_