After getting used to how Python works, it is the moment to begin getting our hands dirty with data analysis. We will study two packages: NumPy is the fundamental numeric computing and linear algebra package in Python, that allows for decent data analysis. We will learn it not only for the data analysis, but more importantly because it will be a package that will be always present in our import
section as scientists. After NumPy we will go to Pandas. Pandas is a dedicated data analysis package with a lot more functionalities than NumPy, making our life much easier in terms of data visualization and manipulation. All the power of Pandas will be completely unleashed in Section 5, where we will see how to visualize information in plots.
As usual, we begin importing the necessary packages
In [1]:
import numpy as np
import pandas as pd
print('NumPy:', np.__version__)
print('Pandas:', pd.__version__)
NumPy is an open-source add-on module to Python that provides common mathematical and numerical routines in pre-compiled, fast functions. These are growing into highly mature packages that provide functionality that meets, or perhaps exceeds, that associated with common commercial software like MATLAB. The NumPy (Numeric Python) package provides basic routines for manipulating large arrays and matrices of numeric data. The main object NumPy works with is a homogeneous multidimensional array. Despite its intimidating name, these are nothing but tables of numbers, each labelled by a tuple of indices.
We will now explore some capabilities of NumPy that will prove very useful not only for data analyisis, but throughout all our life with Python.
In [2]:
mylist = [1, 2, 3]
x = np.array(mylist)
x
Out[2]:
In [3]:
type(x)
Out[3]:
The same applies to multidimensional arrays
In [4]:
m = np.array([[[7, 8, 9], [10, 11, 12]], [[1, 2, 3], [4, 5, 6]]])
m
Out[4]:
There is one restriction with respect to the use of lists: while you could create lists with data of different type, all the data in an array has to be of the same type, and it will be converted automatically.
In [5]:
lst = [1., 'cat']
print(type(lst[0]))
arr = np.array(lst)
print(type(arr[0]))
(We will go deeper into indexing in a while)
A NumPy array has a number of dimensions (or axes). To obtain the number of axes and the size of each of them you use the command shape
. For 2-dimensional arrays (matrices), the order corresponds to (rows, columns)
There are two different ways of calling the shape
command, either with np.shape(arr)
or arr.shape
. This is not the only command that works in both formats, and we will be finding some more in our way.
In [6]:
print(x.shape)
print(np.shape(m))
In [7]:
array_zeros=np.zeros((3, 2))
array_ones=np.ones((3, 2, 4), dtype=np.int8)
In [8]:
array_zeros
Out[8]:
In [9]:
array_ones
Out[9]:
eye(d)
returns a 2-D, dimension-$d$ array with ones on the diagonal and zeros elsewhere.
In [10]:
np.eye(3)
Out[10]:
eye
can also create arrays with ones in upper and lower diagonals. To achieve this, you must call eye(d, d, k)
where $k$ denotes the diagonal (positive for above the center diagonal, negative for below), or eye(d, k=num)
In [11]:
print(np.eye(5, 5, 2))
np.eye(5, k=-1)
Out[11]:
diag
, depending on the input, either extracts a diagonal from a matrix (if the input is a 2-D array), or constructs a diagonal array (if the input is a vector).
In [12]:
np.diag(x, 1)
Out[12]:
In [13]:
y = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])
print(np.diag(y))
np.diag(np.diag(y))
Out[13]:
arange(begin, end, step)
returns evenly spaced values within a given interval. Note that the beginning point is included, but not the ending.
In [14]:
n = np.arange(0, 30, 2) # start at 0 count up by 2, stop before 30
n
Out[14]:
In [15]:
len(n)
Out[15]:
Exercise 1: Create an array of the first million of odd numbers, both with arange
and using loops. Try timing both methods to see which one is faster. For that, use %timeit
.
In [16]:
%timeit np.arange(0, 2e6, 2)
%timeit [i for i in range(2000000) if i % 2 == 0]
Similarly, linspace(begin, end, points)
returns evenly spaced numbers over a specified interval. Here, instead of specifying the step, you specify the amount of points you want. Also with linspace
you include the ending of the interval.
In [17]:
o = np.linspace(0, 30, 15)
o
Out[17]:
In [18]:
len(o)
Out[18]:
reshape
changes the shape of an array, but not its data. This is another of the commands that can be called before or after the array.
In [19]:
print(n.reshape(3, 5))
np.reshape(n, (5, 3))
Out[19]:
Note however that, in order for these changes to be permanent, you should do a reassignment of the variable
In [20]:
print(n) # After the reshapings above, the original array stays being the same
n = n.reshape(3, 5)
n # Now that we have reassigned it is when it definitely changes shape
Out[20]:
In [21]:
p = np.ones([2, 2, 2])
p
Out[21]:
In [22]:
np.concatenate([p, 2 * p], 0)
Out[22]:
In [23]:
np.concatenate([p, 2 * p], 1)
Out[23]:
In [24]:
np.concatenate([p, 2 * p], 2)
Out[24]:
However, for common combinations there exist special commands. Use vstack
to stack arrays in sequence vertically (row wise), hstack
to stack arrays in sequence horizontally (column wise), and block
to create arrays out of blocks (only available in versions 1.13.0+)
In [25]:
q = np.ones((2, 2))
np.vstack([q, 2 * q])
Out[25]:
In [26]:
np.hstack([q, 2 * q])
Out[26]:
In [27]:
np.block([[q, np.zeros((2, 2))], [np.zeros((2, 2)), 2 * q]])
Out[27]:
In [28]:
x = np.array([1, 2, 3])
print(x)
print(x + 10)
print(3 * x)
print(1 / x)
print(x ** (-2 / 3))
print(2 ** x)
Also (and obviously) these symbols can be used to operate between two arrays, which must be of the same shape. If this is the case, they also do element-wise operations
In [29]:
y = np.arange(4, 7, 1)
print(x + y) # [1+4, 2+5, 3+6]
print(x * y) # [1*4, 2*5, 3*6]
print(x / y) # [1/4, 2/5, 3/6]
print(x ** y) # [1**4, 2**5, 3**6]
For doing vector or matrix multiplication, the command to be used is dot
In [30]:
x.dot(y) # 1*4 + 2*5 + 3*6
Out[30]:
With python 3.5 matrix multiplication got it's own operator
In [31]:
x@y
Out[31]:
In [32]:
X = np.array([[i + j for i in range(3, 6)] for j in range(3)])
Y = np.diag([1, 1], 1) + np.diag([1], -2)
print('{}\n'.format(X))
print(X * Y)
np.dot(X, Y)
Out[32]:
Exercise 2: Take a 10x2 matrix representing $(x1,x2)$ coordinates and transform them into polar coordinates $(r,\theta)$.
Hint 1: the inverse transformation is given by $x1 = r\cos\theta$, $x2 = r\sin\theta$
Hint 2: generate random numbers with the functions in numpy.random
In [33]:
z = np.random.random((10, 2))
x1, x2 = z[:, 0], z[:, 1]
R = np.sqrt(x1 ** 2 + x2 ** 2)
T = np.arctan2(x2, x1)
print(R)
print(T)
In [34]:
Z = np.arange(0, 12, 1).reshape((4, 3))
In [35]:
np.dot(Z, y)
Out[35]:
In [36]:
np.dot(Z, y.T)
Out[36]:
In [37]:
Z.dot(X)
Out[37]:
In [38]:
(Z.T).dot(X)
In [39]:
a = np.array([-4, -2, 1, 3, 5])
print(a.max())
print(a.min())
print(a.sum())
print(a.mean())
print(a.std())
Some interesting functions are argmax
and argmin
, which return the index of the maximum and minimum values in the array.
In [40]:
print(a.argmax())
print(a.argmin())
In [41]:
r = [4, 5, 6, 7]
print(r[2])
r[0] = 198
r
Out[41]:
To select a range of rows or columns you can use a colon :. A second : can be used to indicate the step size. array[start:stop:stepsize]
. If you leave start
(stop
) blank, the selection will go from the very beginning (until the very end) of the array
In [42]:
s = np.arange(13)**2
print(s)
print(s[3:9])
print(s[2:10:3])
s[-5::-2]
Out[42]:
The same applies to matrices or higher-dimensional arrays
In [43]:
r = np.arange(36).reshape((6, 6))
r
Out[43]:
In [44]:
r[2:5, 1:3]
Out[44]:
You can also select specific rows and columns, separated by commas
In [45]:
r[[1, 3, 4], 1:3]
Out[45]:
A very useful tool is conditional indexing, where we apply a function, assignment... only to those elements of an array that satisfy some condition
In [46]:
r[r > 30] = 30
r
Out[46]:
Exercise 3: Create a random 1-dimensional array, and find which element is closest to 0.7
In [47]:
Z = np.random.uniform(0,1,100)
z = 0.7
m = Z[np.abs(Z - z).argmin()]
print(m)
In [48]:
r2 = r[:3,:3]
r2
Out[48]:
And now let's set all its elements to zero
In [49]:
r2[:] = 0
r2
Out[49]:
When looking at r
, we see that it has also been changed!
In [50]:
r
Out[50]:
The proper way of handling selections without modifying the original arrays is through the copy
command.
In [51]:
r_copy = r.copy()
r_copy
Out[51]:
Now we can safely modify r_copy
without affecting r
.
In [52]:
r_copy[:] = 10
print('{}\n'.format(r_copy))
r
Out[52]:
Finally, you can iterate over arrays in the same way as you iterate over lists
In [53]:
test = np.random.randint(0, 10, (4,3))
test
Out[53]:
You can iterate by row:
In [54]:
for row in test:
print(row)
Or by row index
In [55]:
for i in range(len(test)):
print(test[i])
Or by row and index:
In [56]:
for i, row in enumerate(test):
print('Row {} is {}'.format(i, row))
In the same way as with lists, you can use zip
to iterate over multiple iterables.
In [57]:
test2 = test**2
test2
Out[57]:
In [58]:
for i, j in zip(test, test2):
print('{} + {} = {}'.format(i, j, i + j))
Exercise 4: Create a function that iterates over the columns of a 2-dimensional array
In [59]:
def iterate(df):
for i, row in enumerate(df):
shp = row.shape
row.shape = shp + (1,)
print('Column {} is {}'.format(i, row))
iterate(test.T)
In [60]:
np.savetxt('numpytest.txt', test)
np.loadtxt('numpytest.txt')
Out[60]:
When dealing with numeric matrices and vectors in Python, NumPy makes life a lot easier. For more complex data, however, it leaves a bit to be desired. For those used to working with dedicated languages like R, doing data analysis directly with numpy feels like a step back. Fortunately, some nice folks have written the Python Data Analysis Library (a.k.a. pandas). Pandas provides an R-like DataFrame, produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays.
Pandas works with Series
of data, that then are arranged in DataFrame
s. A dataframe will be the object closest to an Excel spreadsheet that you will see throughout the course (but of course, given that it is integrated in Python and can be combined with so many different packages, dataframes are much more powerful than Excel spreadsheets). The data in the series can be either qualitative or quantitative data. Creating a series is as easy as creating a NumPy array from a one-dimensional list.
In [61]:
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)
Out[61]:
In [62]:
numbers = [1, 2, 3]
pd.Series(numbers)
Out[62]:
Notice that the series is indexed by default by integers. You can change this indexing by using a dictionary instead of a list for creating the series.
In [63]:
sports = {'Archery': 'Bhutan',
'Golf': 'Scotland',
'Sumo': 'Japan',
'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s
Out[63]:
On the other hand, DataFrame
s can be built from two-dimensional arrays, with the ability of labelling columns and indexing the rows
In [64]:
u = pd.DataFrame(np.random.randn(1000,6), index=np.arange(0, 3000, 3),
columns=['A', 'B', 'C', 'D', 'E', 'F'])
u
Out[64]:
As you might have noticed, it is a bit ugly to deal with large dataframes. There are however some functions that allows to have an idea of the data in a frame.
In [65]:
u.head()
Out[65]:
In [66]:
u.tail()
Out[66]:
In [67]:
u.describe()
Out[67]:
One can also change the maximal number of rows that is displayed:
In [68]:
pd.set_option('display.max_rows', 15)
Pandas
can also generate random DataFrames
for testing:
In [69]:
import pandas.util.testing as tm
tm.makeDataFrame().head()
Out[69]:
In [70]:
u.iloc[125:132,[0, 2, 5]]
Out[70]:
However, there are a few different ways of accessing the data in a Pandas dataframe, that typically have a more "direct" connection with the actual content fo the dataframe. Individual or sets of columns can also be accessed by their column names. Choosing one single column will give a Series, while two or more will produce a Dataset
In [71]:
u['A'].head()
Out[71]:
In [72]:
u[['A', 'D']].head()
Out[72]:
Not only that, you can access a single column without the need of brackets []
In [73]:
u.A.head()
Out[73]:
The usual [] will select specific rows according to the row number
In [74]:
u[0:10][list('BCF')]
Out[74]:
You can also choose specific rows according to their indices with the loc
command
In [75]:
u.loc[6:15]
Out[75]:
Or, you can access just the elements that satisfy some condition
In [76]:
u[u.D > 2]
Out[76]:
In [77]:
u[~(u.D > 2)] # For the inverse of u.D > 2
Out[77]:
Recently query
has been added to DataFrame
for the same purpose. While it is less powerful than logical indexing, it is often faster and shorter (when names are longer than just u
):
In [78]:
u.query('D > 2')
Out[78]:
In [132]:
u.pivot(index='E', columns='G', values='A')
Out[132]:
In [136]:
u.stack()
Out[136]:
In [137]:
u.unstack()
Out[137]:
In [138]:
u.stack().unstack()
Out[138]:
In [79]:
u['F'] = 1 / u['F']
u['F'].head()
Out[79]:
In [80]:
np.mean(u)
Out[80]:
You can apply functions to the whole dataset or specific columns with the apply
command. apply
acts on the whole column at a time (i.e. a Pandas Series
), so you can compute things that depend on several values of the column, for instance the mean value. To apply functions in a real element-by-element basis the function applymap
or Series.apply
should be used.
In [81]:
def mn(col):
return sum(col) / len(col)
u.apply(mn)
Out[81]:
While most can be directly calculated (including the given example of the mean), apply
also works on columns with strings or categorical data, where no mathematical operations are defined. The limit is the imagination.
DataFrames
Something we will do quite often as scientists is combining data from different sources into one single source. This can be achieved by different commands in Pandas, depending on the actual goal we want.
To begin with, appending new rows of data is achieved by the command append
.
In [82]:
newdata = pd.DataFrame(np.ones((5, 6)), index=np.arange(3003, 3018, 3), columns=list('ABCDEF'))
newdata
Out[82]:
In [83]:
unew = u.append(newdata)
unew.tail(10)
Out[83]:
The same result can be obtained with concat
.
In [84]:
pd.concat([u, newdata]).tail(10)
Out[84]:
New columns of data can just be asigned or added with the command join
.
In [85]:
u['G'] = np.random.choice(['a', 'b', 'c'], len(u))
u.tail()
Out[85]:
In [86]:
for h, group in u.groupby('G'):
print('{}: {}'.format(h, np.mean(group['F'])))
In [87]:
u.groupby('G').describe()
Out[87]:
In [98]:
u.pivot_table(index='G', aggfunc='mean')
Out[98]:
In [107]:
u.to_csv('test.csv')
v = pd.read_csv('test.csv', index_col=0)
v.head()
Out[107]:
But, as an addition, Pandas has special commands to load and save Excel spreadsheets (yay!). However, to use it you'll need the openpyxl
and xlrd
packages.
In [108]:
u.to_excel('test.xlsx', sheet_name='My sheet')
pd.read_excel('test.xlsx', 'My sheet', index_col=0).head()
Out[108]:
Exercise 5: Download this dataset and load it, using the first column as the index. Take a look at it, and do the following things:
In [109]:
df = pd.read_csv('https://raw.githubusercontent.com/ChihChengLiang/pokemongor/master/data-raw/pokemons.csv',
index_col=0)
df = df[['Identifier', 'BaseStamina', 'BaseAttack', 'BaseDefense', 'Type1', 'Type2']]
capitalize = lambda st: st.capitalize()
for col in ['Type1', 'Type2']:
df[col] = df[col].apply(capitalize)
def highstamina(x):
return True if x > 170 else False
df['HighStamina'] = df.BaseStamina.apply(highstamina)
print(df[df['HighStamina'] == True].Identifier)
df.tail(15)
Out[109]: