Numpy - multidimensional data arrays¶

Based on lectures from http://github.com/jrjohansson/scientific-python-lectures

The numpy package (module) is used in almost all numerical computation using Python. It is a package that provide high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized (formulated with vectors and matrices), performance is very good.

In [1]:
from numpy import *

Creating numpy arrays¶

There are a number of ways to initialize new numpy arrays, for example from

  • a Python list or tuples
  • using functions that are dedicated to generating numpy arrays, such as arange, linspace, zeros, ones, etc.
  • reading data from files
In [2]:
M = array([[1,2],[3,4],[5,6]])  # from list
M.shape, M.size, M.dtype
Out[2]:
((3, 2), 6, dtype('int64'))

automatic typecasting:

In [3]:
array([[1,2],[3,4],[5,6]])

array(((1,2),(3,5.0+0j)))
Out[3]:
array([[1.+0.j, 2.+0.j],
       [3.+0.j, 5.+0.j]])

forcing array dtype

In [4]:
M = array([[1,2],[3,4]], dtype=complex)  # from list
print(M)
print(M.shape, M.size, M.dtype,'\n')
N = array([[1,2],[3,4]])
print(N)
print(N.shape,N.size,N.dtype,'\n')

N = array([[1.0,2.0],[3.0,4.0]],dtype=int)
print(N)
print(N.shape,N.size,N.dtype,'\n')
[[1.+0.j 2.+0.j]
 [3.+0.j 4.+0.j]]
(2, 2) 4 complex128 

[[1 2]
 [3 4]]
(2, 2) 4 int64 

[[1 2]
 [3 4]]
(2, 2) 4 int64 

arange is just like range but for floating point numbers

In [5]:
arange(1,10,0.1)
Out[5]:
array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2,
       2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5,
       3.6, 3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8,
       4.9, 5. , 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1,
       6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2, 7.3, 7.4,
       7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7,
       8.8, 8.9, 9. , 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9])
In [6]:
array(range(10,100,1))/10.
Out[6]:
array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2,
       2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5,
       3.6, 3.7, 3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8,
       4.9, 5. , 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6. , 6.1,
       6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2, 7.3, 7.4,
       7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7,
       8.8, 8.9, 9. , 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9])

linspace is very similar, except the boundary of the interval are included, and we need to give number of points, rather than step size.

In [7]:
linspace(1,10,91)
Out[7]:
array([ 1. ,  1.1,  1.2,  1.3,  1.4,  1.5,  1.6,  1.7,  1.8,  1.9,  2. ,
        2.1,  2.2,  2.3,  2.4,  2.5,  2.6,  2.7,  2.8,  2.9,  3. ,  3.1,
        3.2,  3.3,  3.4,  3.5,  3.6,  3.7,  3.8,  3.9,  4. ,  4.1,  4.2,
        4.3,  4.4,  4.5,  4.6,  4.7,  4.8,  4.9,  5. ,  5.1,  5.2,  5.3,
        5.4,  5.5,  5.6,  5.7,  5.8,  5.9,  6. ,  6.1,  6.2,  6.3,  6.4,
        6.5,  6.6,  6.7,  6.8,  6.9,  7. ,  7.1,  7.2,  7.3,  7.4,  7.5,
        7.6,  7.7,  7.8,  7.9,  8. ,  8.1,  8.2,  8.3,  8.4,  8.5,  8.6,
        8.7,  8.8,  8.9,  9. ,  9.1,  9.2,  9.3,  9.4,  9.5,  9.6,  9.7,
        9.8,  9.9, 10. ])

logspace gives logarithmic mesh, but notice that it starts at $10^{n_0}$ and stops at $10^{n_1}$, with $n_2$ points.

In [8]:
logspace(-3,2,20)
Out[8]:
array([1.00000000e-03, 1.83298071e-03, 3.35981829e-03, 6.15848211e-03,
       1.12883789e-02, 2.06913808e-02, 3.79269019e-02, 6.95192796e-02,
       1.27427499e-01, 2.33572147e-01, 4.28133240e-01, 7.84759970e-01,
       1.43844989e+00, 2.63665090e+00, 4.83293024e+00, 8.85866790e+00,
       1.62377674e+01, 2.97635144e+01, 5.45559478e+01, 1.00000000e+02])

Most common initialization of array with zeros:

In [9]:
Z=zeros((3,3,2),dtype=complex)

Z.shape, Z.size, Z.dtype
Out[9]:
((3, 3, 2), 18, dtype('complex128'))
In [10]:
Z=ones((3,3),dtype=float)*1j
print(Z)
Z.shape, Z.size, Z.dtype
[[0.+1.j 0.+1.j 0.+1.j]
 [0.+1.j 0.+1.j 0.+1.j]
 [0.+1.j 0.+1.j 0.+1.j]]
Out[10]:
((3, 3), 9, dtype('complex128'))

automatic typecasting works in a peculiar way (for efficiency)

In [11]:
Z=ones((3,3), dtype=int)
print(Z.dtype,'\n')

# Z[0]=1j gives an error
Z[0,0]=3.5  # works but does not produce what one would expect
print(Z)
print(Z.dtype,'\n')

Z = Z*1.0
print(Z)
print(Z.dtype,'\n')
int64 

[[3 1 1]
 [1 1 1]
 [1 1 1]]
int64 

[[3. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
float64 

more on typecasting of arrays

In [12]:
Z=ones((3,3))
#Z *= 1j   # issues an error, but would not work
Z = Z*1j
Z.shape, Z.size, Z.dtype
Out[12]:
((3, 3), 9, dtype('complex128'))

Random number generators¶

In [13]:
from numpy import random

print( random.rand() )  # uniformly distributed random number between [0,1]. Just one number

print( random.rand(5,5) ) # uniform distributed numbers, but now entire matrix (5x5). This is more efficient

print( random.randn(5,5) ) # standard normal distribution random matrix
0.4424572947069897
[[0.53843306 0.99147563 0.08242172 0.70803025 0.45144027]
 [0.30352543 0.95592698 0.15632569 0.03031062 0.75697192]
 [0.15529967 0.23346317 0.46019387 0.82794052 0.46406339]
 [0.41557102 0.15009526 0.17569651 0.82212964 0.47945423]
 [0.24442611 0.00845499 0.90505182 0.8835122  0.87457195]]
[[ 1.31996955 -0.90868666  0.44797821 -0.12203896 -0.86072223]
 [-0.21062149 -0.44791792  0.49514057  2.01554688  0.11847693]
 [ 2.14604074  1.30075337  1.35743156 -0.10303435 -0.03941411]
 [ 1.22709541 -0.25760708  0.84345144  1.217953   -0.01290997]
 [-3.18835727  0.06320338 -0.6347911   0.8671392   0.02415385]]

File I/O¶

Very common form is comma-separated values (CSV) or tab-separated values (TSV). To read data from such files into Numpy arrays we can use the numpy.loadtxt or numpy.genfromtxt

File stockholm_td_adj.dat.txt contains Stockholm temperature over the years. The columns are [$year$,$month$,$day$,$T_{average}^{day}$,$T_{min}^{day}$,$T_{max}^{day}$]

In [14]:
## Check if file exists
!tail stockholm_td_adj.dat.txt
2011 12 22    -0.4    -1.0    -1.0 1
2011 12 23     3.7     3.1     3.1 1
2011 12 24     3.2     2.6     2.6 1
2011 12 25     4.2     3.5     3.5 1
2011 12 26     8.2     7.5     7.5 1
2011 12 27     8.3     7.6     7.6 1
2011 12 28     2.6     1.9     1.9 1
2011 12 29     4.9     4.2     4.2 1
2011 12 30     0.6    -0.1    -0.1 1
2011 12 31    -2.6    -3.3    -3.3 1
In [15]:
data = loadtxt('stockholm_td_adj.dat.txt')
data.shape
Out[15]:
(77431, 7)
In [16]:
# inline figures from matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
In [17]:
import pylab as plt
In [18]:
# time in years when we have year/month/day
t = data[:,0]+data[:,1]/12.+data[:,2]/365
In [19]:
print(t[:5])
print(t[-5:])
[1800.08607306 1800.08881279 1800.09155251 1800.09429224 1800.09703196]
[2012.0739726  2012.07671233 2012.07945205 2012.08219178 2012.08493151]

Plot average daily temperature versus time

In [20]:
plt.plot(t, data[:,3]);
No description has been provided for this image
In [21]:
# a bit more extended in x-direction
plt.figure(figsize=(14,4))
plt.plot(t, data[:,3])
plt.title('temperature in Stockholm')
plt.xlabel('year')
plt.ylabel('temperature $[\degree C]$');
No description has been provided for this image

Lets save the data in the form [t,$T_{average}$]

In [22]:
a=[1,2,3]
b=[4,5,6]
vstack((a,b)).T
Out[22]:
array([[1, 4],
       [2, 5],
       [3, 6]])
In [23]:
vstack((t,data[:,3])).T.shape
Out[23]:
(77431, 2)
In [24]:
data.shape
Out[24]:
(77431, 7)
In [25]:
savetxt('StockholmT.dat', vstack((t,data[:,3])).T)
In [26]:
!tail StockholmT.dat
2.012060273972602772e+03 -4.000000000000000222e-01
2.012063013698630130e+03 3.700000000000000178e+00
2.012065753424657487e+03 3.200000000000000178e+00
2.012068493150684844e+03 4.200000000000000178e+00
2.012071232876712429e+03 8.199999999999999289e+00
2.012073972602739786e+03 8.300000000000000711e+00
2.012076712328767144e+03 2.600000000000000089e+00
2.012079452054794501e+03 4.900000000000000355e+00
2.012082191780821859e+03 5.999999999999999778e-01
2.012084931506849216e+03 -2.600000000000000089e+00

More efficient binary storage of data to the disc¶

In [27]:
save('ST_data.npy',data)
!ls -lt
total 49656
-rw-r--r--  1 haule  staff  4336264 Feb 27 15:28 ST_data.npy
-rw-r--r--  1 haule  staff  3889864 Feb 27 15:28 StockholmT.dat
-rw-r--r--  1 haule  staff   418009 Feb 27 15:28 02_Numpy.ipynb
-rw-r--r--  1 haule  staff    68512 Feb 20 16:06 01_Basic_Python_with_solution.ipynb
-rw-r--r--@ 1 haule  staff  2634919 Feb 20 15:33 stockholm_daily_mean_temperature.csv
-rw-r--r--@ 1 haule  staff  2864946 Feb 20 15:30 stockholm_td_adj.dat
drwxr-xr-x  4 haule  staff      128 Feb 15 17:00 __pycache__
-rw-r--r--  1 haule  staff      752 Feb 15 16:53 mymodule.py
-rw-r--r--  1 haule  staff       60 Feb 15 15:47 to_call_mymodule.py
-rw-r--r--@ 1 haule  staff   406027 Feb 15 14:51 01_Basic_Python.html
-rw-r--r--  1 haule  staff    59332 Feb 15 14:51 01_Basic_Python.ipynb
-rw-r--r--@ 1 haule  staff       64 Feb  6 16:21 my_out.txt
-rw-r--r--  1 haule  staff    48261 Feb  1 16:51 00_Introduction.ipynb
-rw-r--r--  1 haule  staff  1153901 Jan  6 16:18 03_Scipy.ipynb
-rw-r--r--  1 haule  staff      264 Jan  3 17:57 ttt.cc
-rw-r--r--  1 haule  staff  2864984 Jan  3 17:57 stockholm_td_adj.dat.txt
-rw-r--r--  1 haule  staff     2038 Jan  3 17:57 pendulum.py
-rw-r--r--@ 1 haule  staff   430767 Jan  3 17:57 double_pendulum.mp4
-rw-r--r--  1 haule  staff   533268 Jan  3 17:57 04_Scipy_example_Hydrogen_atom.ipynb
-rw-r--r--@ 1 haule  staff  1040661 Mar  4  2021 04_Scipy_example_Hydrogen_atom.html
-rw-r--r--@ 1 haule  staff  1303621 Mar  4  2021 03_Scipy.html
-rw-r--r--@ 1 haule  staff   627779 Jan 20  2021 02_Numpy.html
-rw-r--r--@ 1 haule  staff   611082 Jan 20  2021 00_Introduction.html
-rw-r--r--@ 1 haule  staff  1756407 Dec 18  2020 Scientific-Computing-with-Python.pdf
In [28]:
data2=load('ST_data.npy')
In [29]:
allclose(data,data2)  # how close are the two sets of data
Out[29]:
True
In [30]:
max(abs(data-data2).ravel())
Out[30]:
0.0
In [31]:
amax(abs(data-data2))
Out[31]:
0.0
In [32]:
help(ravel)
Help on _ArrayFunctionDispatcher in module numpy:

ravel(a, order='C')
    Return a contiguous flattened array.
    
    A 1-D array, containing the elements of the input, is returned.  A copy is
    made only if needed.
    
    As of NumPy 1.10, the returned array will have the same type as the input
    array. (for example, a masked array will be returned for a masked array
    input)
    
    Parameters
    ----------
    a : array_like
        Input array.  The elements in `a` are read in the order specified by
        `order`, and packed as a 1-D array.
    order : {'C','F', 'A', 'K'}, optional
    
        The elements of `a` are read using this index order. 'C' means
        to index the elements in row-major, C-style order,
        with the last axis index changing fastest, back to the first
        axis index changing slowest.  'F' means to index the elements
        in column-major, Fortran-style order, with the
        first index changing fastest, and the last index changing
        slowest. Note that the 'C' and 'F' options take no account of
        the memory layout of the underlying array, and only refer to
        the order of axis indexing.  'A' means to read the elements in
        Fortran-like index order if `a` is Fortran *contiguous* in
        memory, C-like order otherwise.  'K' means to read the
        elements in the order they occur in memory, except for
        reversing the data when strides are negative.  By default, 'C'
        index order is used.
    
    Returns
    -------
    y : array_like
        y is a contiguous 1-D array of the same subtype as `a`,
        with shape ``(a.size,)``.
        Note that matrices are special cased for backward compatibility,
        if `a` is a matrix, then y is a 1-D ndarray.
    
    See Also
    --------
    ndarray.flat : 1-D iterator over an array.
    ndarray.flatten : 1-D array copy of the elements of an array
                      in row-major order.
    ndarray.reshape : Change the shape of an array without changing its data.
    
    Notes
    -----
    In row-major, C-style order, in two dimensions, the row index
    varies the slowest, and the column index the quickest.  This can
    be generalized to multiple dimensions, where row-major order
    implies that the index along the first axis varies slowest, and
    the index along the last quickest.  The opposite holds for
    column-major, Fortran-style index ordering.
    
    When a view is desired in as many cases as possible, ``arr.reshape(-1)``
    may be preferable. However, ``ravel`` supports ``K`` in the optional
    ``order`` argument while ``reshape`` does not.
    
    Examples
    --------
    It is equivalent to ``reshape(-1, order=order)``.
    
    >>> x = np.array([[1, 2, 3], [4, 5, 6]])
    >>> np.ravel(x)
    array([1, 2, 3, 4, 5, 6])
    
    >>> x.reshape(-1)
    array([1, 2, 3, 4, 5, 6])
    
    >>> np.ravel(x, order='F')
    array([1, 4, 2, 5, 3, 6])
    
    When ``order`` is 'A', it will preserve the array's 'C' or 'F' ordering:
    
    >>> np.ravel(x.T)
    array([1, 4, 2, 5, 3, 6])
    >>> np.ravel(x.T, order='A')
    array([1, 2, 3, 4, 5, 6])
    
    When ``order`` is 'K', it will preserve orderings that are neither 'C'
    nor 'F', but won't reverse axes:
    
    >>> a = np.arange(3)[::-1]; a
    array([2, 1, 0])
    >>> a.ravel(order='C')
    array([2, 1, 0])
    >>> a.ravel(order='K')
    array([2, 1, 0])
    
    >>> a = np.arange(12).reshape(2,3,2).swapaxes(1,2); a
    array([[[ 0,  2,  4],
            [ 1,  3,  5]],
           [[ 6,  8, 10],
            [ 7,  9, 11]]])
    >>> a.ravel(order='C')
    array([ 0,  2,  4,  1,  3,  5,  6,  8, 10,  7,  9, 11])
    >>> a.ravel(order='K')
    array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

Manipulating data¶

Indexing and slicing¶

data[lower:upper:step, lower:upper:step]

Display which years are included in the datafile

In [33]:
print(set(data[:,0]))  # display year for all datapoints
array(data[::365,0],dtype=int) # the years with 365 spacings, and then last years
{1800.0, 1801.0, 1802.0, 1803.0, 1804.0, 1805.0, 1806.0, 1807.0, 1808.0, 1809.0, 1810.0, 1811.0, 1812.0, 1813.0, 1814.0, 1815.0, 1816.0, 1817.0, 1818.0, 1819.0, 1820.0, 1821.0, 1822.0, 1823.0, 1824.0, 1825.0, 1826.0, 1827.0, 1828.0, 1829.0, 1830.0, 1831.0, 1832.0, 1833.0, 1834.0, 1835.0, 1836.0, 1837.0, 1838.0, 1839.0, 1840.0, 1841.0, 1842.0, 1843.0, 1844.0, 1845.0, 1846.0, 1847.0, 1848.0, 1849.0, 1850.0, 1851.0, 1852.0, 1853.0, 1854.0, 1855.0, 1856.0, 1857.0, 1858.0, 1859.0, 1860.0, 1861.0, 1862.0, 1863.0, 1864.0, 1865.0, 1866.0, 1867.0, 1868.0, 1869.0, 1870.0, 1871.0, 1872.0, 1873.0, 1874.0, 1875.0, 1876.0, 1877.0, 1878.0, 1879.0, 1880.0, 1881.0, 1882.0, 1883.0, 1884.0, 1885.0, 1886.0, 1887.0, 1888.0, 1889.0, 1890.0, 1891.0, 1892.0, 1893.0, 1894.0, 1895.0, 1896.0, 1897.0, 1898.0, 1899.0, 1900.0, 1901.0, 1902.0, 1903.0, 1904.0, 1905.0, 1906.0, 1907.0, 1908.0, 1909.0, 1910.0, 1911.0, 1912.0, 1913.0, 1914.0, 1915.0, 1916.0, 1917.0, 1918.0, 1919.0, 1920.0, 1921.0, 1922.0, 1923.0, 1924.0, 1925.0, 1926.0, 1927.0, 1928.0, 1929.0, 1930.0, 1931.0, 1932.0, 1933.0, 1934.0, 1935.0, 1936.0, 1937.0, 1938.0, 1939.0, 1940.0, 1941.0, 1942.0, 1943.0, 1944.0, 1945.0, 1946.0, 1947.0, 1948.0, 1949.0, 1950.0, 1951.0, 1952.0, 1953.0, 1954.0, 1955.0, 1956.0, 1957.0, 1958.0, 1959.0, 1960.0, 1961.0, 1962.0, 1963.0, 1964.0, 1965.0, 1966.0, 1967.0, 1968.0, 1969.0, 1970.0, 1971.0, 1972.0, 1973.0, 1974.0, 1975.0, 1976.0, 1977.0, 1978.0, 1979.0, 1980.0, 1981.0, 1982.0, 1983.0, 1984.0, 1985.0, 1986.0, 1987.0, 1988.0, 1989.0, 1990.0, 1991.0, 1992.0, 1993.0, 1994.0, 1995.0, 1996.0, 1997.0, 1998.0, 1999.0, 2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0, 2009.0, 2010.0, 2011.0}
Out[33]:
array([1800, 1801, 1802, 1803, 1804, 1804, 1805, 1806, 1807, 1808, 1809,
       1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820,
       1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831,
       1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842,
       1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853,
       1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864,
       1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875,
       1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886,
       1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897,
       1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908,
       1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919,
       1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930,
       1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941,
       1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952,
       1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963,
       1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974,
       1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985,
       1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996,
       1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007,
       2008, 2009, 2010, 2011])

Fancy indexing Index is itself an array of integer numbers, i.e, which rows or columns?

In [34]:
x=arange(1,100)
print(x[:10:2])
print(x[ [0,2,4,6,8] ])  # fancy indexing. Does not exists in C++
[1 3 5 7 9]
[1 3 5 7 9]
In [35]:
data[[0,365,2*365,3*365,4*365,5*365+1,6*365+1],0] # fancy indexing for multidimensional array
Out[35]:
array([1800., 1801., 1802., 1803., 1804., 1805., 1806.])
In [36]:
data[range(100,200*365,365),0] # leap year makes trouble
Out[36]:
array([1800., 1801., 1802., 1803., 1804., 1805., 1806., 1807., 1808.,
       1809., 1810., 1811., 1812., 1813., 1814., 1815., 1816., 1817.,
       1818., 1819., 1820., 1821., 1822., 1823., 1824., 1825., 1826.,
       1827., 1828., 1829., 1830., 1831., 1832., 1833., 1834., 1835.,
       1836., 1837., 1838., 1839., 1840., 1841., 1842., 1843., 1844.,
       1845., 1846., 1847., 1848., 1849., 1850., 1851., 1852., 1853.,
       1854., 1855., 1856., 1857., 1858., 1859., 1860., 1861., 1862.,
       1863., 1864., 1865., 1866., 1867., 1868., 1869., 1870., 1871.,
       1872., 1873., 1874., 1875., 1876., 1877., 1878., 1879., 1880.,
       1881., 1882., 1883., 1884., 1885., 1886., 1887., 1888., 1889.,
       1890., 1891., 1892., 1893., 1894., 1895., 1896., 1897., 1898.,
       1899., 1900., 1901., 1902., 1903., 1904., 1905., 1906., 1907.,
       1908., 1909., 1910., 1911., 1912., 1913., 1914., 1915., 1916.,
       1917., 1918., 1919., 1920., 1921., 1922., 1923., 1924., 1925.,
       1926., 1927., 1928., 1929., 1930., 1931., 1932., 1933., 1934.,
       1935., 1936., 1937., 1938., 1939., 1940., 1941., 1942., 1943.,
       1944., 1945., 1946., 1947., 1948., 1949., 1950., 1951., 1952.,
       1953., 1954., 1955., 1956., 1957., 1958., 1959., 1960., 1961.,
       1962., 1963., 1964., 1965., 1966., 1967., 1968., 1969., 1970.,
       1971., 1972., 1973., 1974., 1975., 1976., 1977., 1978., 1979.,
       1980., 1981., 1982., 1983., 1984., 1985., 1986., 1987., 1988.,
       1989., 1990., 1991., 1992., 1993., 1994., 1995., 1996., 1997.,
       1998., 1999.])

Using mask to pick data*

In addition to fancy indexing, Python allows yet another way to pick the data. Create a mask of [True,....False....] values, and pick from the array only columns/rows where True.

How to compute average temperature in the year of 1973?

In [37]:
data[:,0] >= 1973
data[:,0] < 1974
Out[37]:
array([ True,  True,  True, ..., False, False, False])
In [38]:
# Create mask for the year 1973
mask = logical_and(1973<=data[:,0], data[:,0]<1974)
In [39]:
mask
Out[39]:
array([False, False, False, ..., False, False, False])
In [40]:
data[mask,0]  # All should have 1973
Out[40]:
array([1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973., 1973.,
       1973., 1973., 1973., 1973., 1973.])
In [41]:
T1973 = data[mask,3]
print('Average temperature in 1973=', sum(T1973)/len(T1973))
Average temperature in 1973= 7.414794520547944
In [42]:
where(mask)
Out[42]:
(array([63187, 63188, 63189, 63190, 63191, 63192, 63193, 63194, 63195,
        63196, 63197, 63198, 63199, 63200, 63201, 63202, 63203, 63204,
        63205, 63206, 63207, 63208, 63209, 63210, 63211, 63212, 63213,
        63214, 63215, 63216, 63217, 63218, 63219, 63220, 63221, 63222,
        63223, 63224, 63225, 63226, 63227, 63228, 63229, 63230, 63231,
        63232, 63233, 63234, 63235, 63236, 63237, 63238, 63239, 63240,
        63241, 63242, 63243, 63244, 63245, 63246, 63247, 63248, 63249,
        63250, 63251, 63252, 63253, 63254, 63255, 63256, 63257, 63258,
        63259, 63260, 63261, 63262, 63263, 63264, 63265, 63266, 63267,
        63268, 63269, 63270, 63271, 63272, 63273, 63274, 63275, 63276,
        63277, 63278, 63279, 63280, 63281, 63282, 63283, 63284, 63285,
        63286, 63287, 63288, 63289, 63290, 63291, 63292, 63293, 63294,
        63295, 63296, 63297, 63298, 63299, 63300, 63301, 63302, 63303,
        63304, 63305, 63306, 63307, 63308, 63309, 63310, 63311, 63312,
        63313, 63314, 63315, 63316, 63317, 63318, 63319, 63320, 63321,
        63322, 63323, 63324, 63325, 63326, 63327, 63328, 63329, 63330,
        63331, 63332, 63333, 63334, 63335, 63336, 63337, 63338, 63339,
        63340, 63341, 63342, 63343, 63344, 63345, 63346, 63347, 63348,
        63349, 63350, 63351, 63352, 63353, 63354, 63355, 63356, 63357,
        63358, 63359, 63360, 63361, 63362, 63363, 63364, 63365, 63366,
        63367, 63368, 63369, 63370, 63371, 63372, 63373, 63374, 63375,
        63376, 63377, 63378, 63379, 63380, 63381, 63382, 63383, 63384,
        63385, 63386, 63387, 63388, 63389, 63390, 63391, 63392, 63393,
        63394, 63395, 63396, 63397, 63398, 63399, 63400, 63401, 63402,
        63403, 63404, 63405, 63406, 63407, 63408, 63409, 63410, 63411,
        63412, 63413, 63414, 63415, 63416, 63417, 63418, 63419, 63420,
        63421, 63422, 63423, 63424, 63425, 63426, 63427, 63428, 63429,
        63430, 63431, 63432, 63433, 63434, 63435, 63436, 63437, 63438,
        63439, 63440, 63441, 63442, 63443, 63444, 63445, 63446, 63447,
        63448, 63449, 63450, 63451, 63452, 63453, 63454, 63455, 63456,
        63457, 63458, 63459, 63460, 63461, 63462, 63463, 63464, 63465,
        63466, 63467, 63468, 63469, 63470, 63471, 63472, 63473, 63474,
        63475, 63476, 63477, 63478, 63479, 63480, 63481, 63482, 63483,
        63484, 63485, 63486, 63487, 63488, 63489, 63490, 63491, 63492,
        63493, 63494, 63495, 63496, 63497, 63498, 63499, 63500, 63501,
        63502, 63503, 63504, 63505, 63506, 63507, 63508, 63509, 63510,
        63511, 63512, 63513, 63514, 63515, 63516, 63517, 63518, 63519,
        63520, 63521, 63522, 63523, 63524, 63525, 63526, 63527, 63528,
        63529, 63530, 63531, 63532, 63533, 63534, 63535, 63536, 63537,
        63538, 63539, 63540, 63541, 63542, 63543, 63544, 63545, 63546,
        63547, 63548, 63549, 63550, 63551]),)
In [43]:
# where tells you the index where True
indices = where(mask)
X1973 = data[indices,3]; # This gives similar data in 1973, but not identical
In [44]:
print(T1973.shape, X1973.shape, X1973[0].shape)
print('Average temperature in 1973=', sum(X1973[0,:])/len(X1973[0,:]))
print('Min/Max temperature in 1973=', min(X1973[0,:]),'/',max(X1973[0,:]))
(365,) (1, 365) (365,)
Average temperature in 1973= 7.414794520547944
Min/Max temperature in 1973= -12.8 / 25.6

What is the mean monthly temperatures for each month of the year?

Let's do Ferbrurary first

In [45]:
Febr=data[:,1]==2
mean(data[Febr,3])
Out[45]:
-3.212109570736596

Now loop for all months

In [46]:
monthly_mean=[mean(data[data[:,1]==month,3]) for month in range(1,13)]
In [47]:
print(monthly_mean)
[-3.0447656725502132, -3.212109570736596, -0.8321515520389532, 3.888474842767295, 9.560970785149117, 14.659591194968554, 17.318837492391967, 16.117650639074864, 11.81937106918239, 6.760057821059039, 1.9458490566037738, -1.211275106512477]
In [48]:
plt.bar(range(1,13),monthly_mean);
plt.xlabel('month')
plt.ylabel('average temperature')
Out[48]:
Text(0, 0.5, 'average temperature')
No description has been provided for this image

Linear Algebra¶

It is implemented in low level fortran/C code, and is much more efficient than code written in Python.

In [49]:
A = random.rand(3,3)


print(A * A)  # It is not matrix-matrix product, but element-wise product
print()
print(multiply(A,A))
print()
print(A @ A)  # It is a matrix product
print()
print(matmul(A,A))
print()
print(dot(A,A))
[[0.04781561 0.11911882 0.32967908]
 [0.13517962 0.15164628 0.1180888 ]
 [0.22153632 0.0099833  0.25438493]]

[[0.04781561 0.11911882 0.32967908]
 [0.13517962 0.15164628 0.1180888 ]
 [0.22153632 0.0099833  0.25438493]]

[[0.44496253 0.26724194 0.53375191]
 [0.38531703 0.31287699 0.51824665]
 [0.37705099 0.25175105 0.5589718 ]]

[[0.44496253 0.26724194 0.53375191]
 [0.38531703 0.31287699 0.51824665]
 [0.37705099 0.25175105 0.5589718 ]]

[[0.44496253 0.26724194 0.53375191]
 [0.38531703 0.31287699 0.51824665]
 [0.37705099 0.25175105 0.5589718 ]]

Matrix product or matrix-vector product can be performed by dot command

In [50]:
v1 = random.rand(3)
print(v1)
print( dot(A,v1) ) # matrix.vector product
print( dot(v1,v1) ) # length of vector^2
[0.52212272 0.6438025  0.58804297]
[0.67401149 0.6447513  0.60666613]
1.0328883299904315
In [51]:
print(A@v1)
print(v1 @ v1)
[0.67401149 0.6447513  0.60666613]
1.0328883299904315
In [52]:
# still works, but risky code.
w=A*v1        # w[i,j]=A[i,j]*v1[j]
sum(w,axis=1) # sum_j w[i,j]
Out[52]:
array([0.67401149, 0.6447513 , 0.60666613])

Here (Feb 27, 2024):¶

Slightly less efficient, but nicer code can be obtained by matrix clas

In [172]:
A = random.rand(3,3)
v1 = random.rand(3)

M = matrix(A)
v = matrix(v1).T # create a column vector
In [173]:
M*M  # this is now matrix-matrix product
Out[173]:
matrix([[0.46845252, 0.70757384, 0.62198422],
        [0.78928309, 0.99595206, 0.44805275],
        [1.21562026, 1.77059845, 1.17874835]])
In [174]:
M.T
Out[174]:
matrix([[0.25241362, 0.65894841, 0.52189223],
        [0.20370712, 0.8991256 , 0.91255734],
        [0.51832041, 0.05839933, 0.9246341 ]])
In [175]:
M*v # this is matrix*vector product
Out[175]:
matrix([[0.46174393],
        [0.82389859],
        [1.02506369]])
In [176]:
v.T * M # vector*matrix product
Out[176]:
matrix([[0.57815665, 0.73951423, 0.80597931]])
In [178]:
v1 @ A
Out[178]:
array([0.57815665, 0.73951423, 0.80597931])
In [179]:
v.T * v # inner-product
Out[179]:
matrix([[0.96005595]])
In [180]:
v1 @ v1
Out[180]:
0.9600559468923451

Array/Matrix transformations

  • .T or transpose(M) transposes matrix
  • .H hermitian conjugate
  • conjugate(M), M.conjugate(), or M.conj() conjugates the matrix
  • real(M) and imag(M) takes real and imaginary part of the matrix
In [181]:
print(M.H)
print()
print(M.T)
[[0.25241362 0.65894841 0.52189223]
 [0.20370712 0.8991256  0.91255734]
 [0.51832041 0.05839933 0.9246341 ]]

[[0.25241362 0.65894841 0.52189223]
 [0.20370712 0.8991256  0.91255734]
 [0.51832041 0.05839933 0.9246341 ]]
In [184]:
A.T.conj()
Out[184]:
array([[0.25241362, 0.65894841, 0.52189223],
       [0.20370712, 0.8991256 , 0.91255734],
       [0.51832041, 0.05839933, 0.9246341 ]])

More advanced linear algebra operations¶

Library linalg:

  • linalg.det(A)
  • linalg.inv(A) or just M.I
  • linalg.eig, linalg.eigvals, linalg.eigh
  • linalg.svd()
  • linalg.solve()
  • linalg.cholesky()
In [185]:
from numpy import linalg
In [186]:
print( linalg.det(A) )
linalg.inv(A)
0.14694856921662117
Out[186]:
array([[ 5.29484196,  1.93702154, -3.0904608 ],
       [-3.93884753, -0.25258604,  2.2239456 ],
       [ 0.89882834, -0.84402819,  0.63096267]])
In [187]:
M.I
Out[187]:
matrix([[ 5.29484196,  1.93702154, -3.0904608 ],
        [-3.93884753, -0.25258604,  2.2239456 ],
        [ 0.89882834, -0.84402819,  0.63096267]])
In [188]:
linalg.inv(A)
Out[188]:
array([[ 5.29484196,  1.93702154, -3.0904608 ],
       [-3.93884753, -0.25258604,  2.2239456 ],
       [ 0.89882834, -0.84402819,  0.63096267]])

The eigenvalue problem for a matrix $A$:

$\displaystyle A v_n = \lambda_n v_n$

where $v_n$ is the $n$th eigenvector and $\lambda_n$ is the $n$th eigenvalue.

Recall: For complex matrix, symmetric means A.H == A, and for real means A.T==A

Symmetric matrix has real eigenvalues, and eigenvectors are unitary (U.H @ U = 1). These are very useful properties because eigenvalues and eigenvectors can be obtained by higher accuracy in symmetric case, and also more efficiently.

To calculate eigenvalues of a matrix, use the eigvals (symmetric/hermitian eigvalsh) and for calculating both eigenvalues and eigenvectors, use the function eig (or eigh):

In [194]:
linalg.eigvals(0.5*(A+A.T))
Out[194]:
array([ 1.70937488, -0.06392804,  0.43072648])
In [193]:
linalg.eigvalsh(0.5*(A+A.T))
Out[193]:
array([-0.06392804,  0.43072648,  1.70937488])
In [195]:
linalg.eig(A)
Out[195]:
EigResult(eigenvalues=array([1.61674392+0.j       , 0.2297147 +0.1952507j,
       0.2297147 -0.1952507j]), eigenvectors=array([[-0.37666674+0.j        , -0.65988359+0.j        ,
        -0.65988359-0.j        ],
       [-0.41333696+0.j        ,  0.60779456+0.20602859j,
         0.60779456-0.20602859j],
       [-0.82902034+0.j        , -0.20997328-0.32954949j,
        -0.20997328+0.32954949j]]))
In [196]:
# Checking that diagonalization works:

w, v = linalg.eig(A)

# \sum_j A[i,j]*v[j,k] = v[i,k] * w[k]

allclose(A @ v , v * w)
Out[196]:
True
In [201]:
sum(abs(A@v-v*w))
Out[201]:
2.4584446083990795e-15
In [206]:
#A = array([[1,2,3], [4,5,6], [7,8,9]])
A = array([[1,0.1,0.3],[0.4,1,0.9],[-0.3,-0.4,1]])
b = array([1,2,3])
x = linalg.solve(A,b) # A @ x==b
print(x)

allclose(A @ x, b)
[ 0.21722846 -0.62172285  2.8164794 ]
Out[206]:
True
In [207]:
# solve is more efficient, unless the RHS is very very large.
# But result should be equivalent, when the matrix is non-singular

Ai = linalg.inv(A)
x = Ai @ b

print(x)
allclose(A @ x, b)
[ 0.21722846 -0.62172285  2.8164794 ]
Out[207]:
True
In [208]:
linalg.det(A)
Out[208]:
1.335

sum, cumsum, trace, diag¶

In [209]:
v1
Out[209]:
array([0.87285132, 0.25284165, 0.36641183])
In [211]:
print( sum(v1) )
print( cumsum(v1) )
print( trace(A) )
print( diag(A) )
print( sum(diag(A)) )
1.49210479369816
[0.87285132 1.12569297 1.49210479]
3.0
[1. 1. 1.]
3.0
In [212]:
diag([1,2,3])
Out[212]:
array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])
In [213]:
diag(A)
Out[213]:
array([1., 1., 1.])

Reshaping, resizing, and stacking arrays¶

In [214]:
A = array([[1,2,3], [4,5,6], [7,8,9]])

print(A.shape)
Ag = reshape(A, (9,1))  # this is not new data
print(Ag.shape)
(3, 3)
(9, 1)
In [215]:
Ag[0]=10
A        # we change A when we change Ag
Out[215]:
array([[10,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9]])
In [216]:
Ax = A.flatten()  # flatten creates 1D array of all data, but creates a copy
print(Ax)
Ax[8]=100         # changing a copy
print(A)
[10  2  3  4  5  6  7  8  9]
[[10  2  3]
 [ 4  5  6]
 [ 7  8  9]]
In [217]:
Ay=A.ravel()
print(Ay)
Ay[8]=100   # changing original data, hence ravel does not copy, and is more efficient.
print(A)
[10  2  3  4  5  6  7  8  9]
[[ 10   2   3]
 [  4   5   6]
 [  7   8 100]]

Vectorizing functions¶

Every function written in Python is very slow. However numpy type operations are fast, because they are written in fortran/C

In [123]:
Temp = data[:,3]

Temp**2  #  this is fast, written in C
Out[123]:
array([ 37.21, 237.16, 225.  , ...,  24.01,   0.36,   6.76])
In [124]:
array([Temp[i]**2 for i in range(len(Temp))])  # This is slow, written in Python
Out[124]:
array([ 37.21, 237.16, 225.  , ...,  24.01,   0.36,   6.76])

What if we have a function that can not simply work on arrays?

For example, theta function?

In [125]:
def Theta(x):
    "For positive numbers returns 1, for negative 0"
    if x>=0:
        return 1
    else:
        return 0
In [126]:
# Does not work on array
Theta(Temp)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[126], line 2
      1 # Does not work on array
----> 2 Theta(Temp)

Cell In[125], line 3, in Theta(x)
      1 def Theta(x):
      2     "For positive numbers returns 1, for negative 0"
----> 3     if x>=0:
      4         return 1
      5     else:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

We can vectorize Theta, to make it applicable to arrays.

This is simply achieved by call to numpy function vectorize, which will create low-level routine from your function

In [127]:
Theta_vec = vectorize(Theta)
In [128]:
# This is very fast now, and creates 0 or ones
positive_temperatures=Theta_vec(Temp)
positive_temperatures
Out[128]:
array([0, 0, 0, ..., 1, 1, 0])

How to calculate number of days in a year with positive temperatures?

In [129]:
# Boolean array to select data with positive temperatures
positives = array(positive_temperatures, dtype=bool)
# keeps data with positive temperatures only
kept = data[positives,0]
# now we just need to check how many of these data are in each year
In [151]:
years = list(range(1800,2013,1))
print(years)

help(histogram)
[1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012]
Help on _ArrayFunctionDispatcher in module numpy:

histogram(a, bins=10, range=None, density=None, weights=None)
    Compute the histogram of a dataset.
    
    Parameters
    ----------
    a : array_like
        Input data. The histogram is computed over the flattened array.
    bins : int or sequence of scalars or str, optional
        If `bins` is an int, it defines the number of equal-width
        bins in the given range (10, by default). If `bins` is a
        sequence, it defines a monotonically increasing array of bin edges,
        including the rightmost edge, allowing for non-uniform bin widths.
    
        .. versionadded:: 1.11.0
    
        If `bins` is a string, it defines the method used to calculate the
        optimal bin width, as defined by `histogram_bin_edges`.
    
    range : (float, float), optional
        The lower and upper range of the bins.  If not provided, range
        is simply ``(a.min(), a.max())``.  Values outside the range are
        ignored. The first element of the range must be less than or
        equal to the second. `range` affects the automatic bin
        computation as well. While bin width is computed to be optimal
        based on the actual data within `range`, the bin count will fill
        the entire range including portions containing no data.
    weights : array_like, optional
        An array of weights, of the same shape as `a`.  Each value in
        `a` only contributes its associated weight towards the bin count
        (instead of 1). If `density` is True, the weights are
        normalized, so that the integral of the density over the range
        remains 1.
    density : bool, optional
        If ``False``, the result will contain the number of samples in
        each bin. If ``True``, the result is the value of the
        probability *density* function at the bin, normalized such that
        the *integral* over the range is 1. Note that the sum of the
        histogram values will not be equal to 1 unless bins of unity
        width are chosen; it is not a probability *mass* function.
    
    Returns
    -------
    hist : array
        The values of the histogram. See `density` and `weights` for a
        description of the possible semantics.
    bin_edges : array of dtype float
        Return the bin edges ``(length(hist)+1)``.
    
    
    See Also
    --------
    histogramdd, bincount, searchsorted, digitize, histogram_bin_edges
    
    Notes
    -----
    All but the last (righthand-most) bin is half-open.  In other words,
    if `bins` is::
    
      [1, 2, 3, 4]
    
    then the first bin is ``[1, 2)`` (including 1, but excluding 2) and
    the second ``[2, 3)``.  The last bin, however, is ``[3, 4]``, which
    *includes* 4.
    
    
    Examples
    --------
    >>> np.histogram([1, 2, 1], bins=[0, 1, 2, 3])
    (array([0, 2, 1]), array([0, 1, 2, 3]))
    >>> np.histogram(np.arange(4), bins=np.arange(5), density=True)
    (array([0.25, 0.25, 0.25, 0.25]), array([0, 1, 2, 3, 4]))
    >>> np.histogram([[1, 2, 1], [1, 0, 1]], bins=[0,1,2,3])
    (array([1, 4, 1]), array([0, 1, 2, 3]))
    
    >>> a = np.arange(5)
    >>> hist, bin_edges = np.histogram(a, density=True)
    >>> hist
    array([0.5, 0. , 0.5, 0. , 0. , 0.5, 0. , 0.5, 0. , 0.5])
    >>> hist.sum()
    2.4999999999999996
    >>> np.sum(hist * np.diff(bin_edges))
    1.0
    
    .. versionadded:: 1.11.0
    
    Automated Bin Selection Methods example, using 2 peak random data
    with 2000 points:
    
    >>> import matplotlib.pyplot as plt
    >>> rng = np.random.RandomState(10)  # deterministic random data
    >>> a = np.hstack((rng.normal(size=1000),
    ...                rng.normal(loc=5, scale=2, size=1000)))
    >>> _ = plt.hist(a, bins='auto')  # arguments are passed to np.histogram
    >>> plt.title("Histogram with 'auto' bins")
    Text(0.5, 1.0, "Histogram with 'auto' bins")
    >>> plt.show()

In [157]:
hist = histogram(kept, bins=years)
In [158]:
shape(hist[0]),shape(hist[1])
Out[158]:
((212,), (213,))
In [159]:
plt.plot(hist[1][:-1], hist[0]);
No description has been provided for this image
In [167]:
negative = array(1-Theta_vec(Temp), dtype=bool)
kept2 = data[negative,0]
hist2 = histogram(kept2, bins=years)
plt.plot(hist2[1][:-1], hist2[0]);
No description has been provided for this image

Homework 4¶

Download the data file stockholm_daily_mean_temperature.csv from

https://bolin.su.se/data/stockholm-historical-daily-temperature-2?n=stockholm-historical-temps-daily-2

which contains data starting in 1756 till this year. And plot the number of days with positive (in Celsius) temperature over the years.

Note that some dates in which data is missing have very negative temperature. These data should be discarted, and average should be computed on the rest of the data.

In [235]:
datT = loadtxt('stockholm_daily_mean_temperature.csv',delimiter=',',skiprows=1,usecols=[1,2,3])
datd = loadtxt('stockholm_daily_mean_temperature.csv',delimiter=',',skiprows=1,usecols=[0],dtype=str)
datd_ =array([[int(x) for x in dt.split('-')] for dt in datd])
datd = datd_