4 NumPy and Pandas

Introduction

After getting used to how Python works, it is the moment to begin getting our hands dirty with data analysis. We will study two packages: NumPy is the fundamental numeric computing and linear algebra package in Python, that allows for decent data analysis. We will learn it not only for the data analysis, but more importantly because it will be a package that will be always present in our import section as scientists. After NumPy we will go to Pandas. Pandas is a dedicated data analysis package with a lot more functionalities than NumPy, making our life much easier in terms of data visualization and manipulation. All the power of Pandas will be completely unleashed in Section 5, where we will see how to visualize information in plots.

As usual, we begin importing the necessary packages


In [1]:
import numpy as np
import pandas as pd
print('NumPy:', np.__version__)
print('Pandas:', pd.__version__)


NumPy: 1.13.3
Pandas: 0.22.0

Numeric Python (NumPy)

NumPy is an open-source add-on module to Python that provides common mathematical and numerical routines in pre-compiled, fast functions. These are growing into highly mature packages that provide functionality that meets, or perhaps exceeds, that associated with common commercial software like MATLAB. The NumPy (Numeric Python) package provides basic routines for manipulating large arrays and matrices of numeric data. The main object NumPy works with is a homogeneous multidimensional array. Despite its intimidating name, these are nothing but tables of numbers, each labelled by a tuple of indices.

We will now explore some capabilities of NumPy that will prove very useful not only for data analyisis, but throughout all our life with Python.

Creating Arrays

As mentioned before, the main object in NumPy is the array. Creating one is as easy as calling the command array


In [2]:
mylist = [1, 2, 3]
x = np.array(mylist)
x


Out[2]:
array([1, 2, 3])

In [3]:
type(x)


Out[3]:
numpy.ndarray

The same applies to multidimensional arrays


In [4]:
m = np.array([[[7, 8, 9], [10, 11, 12]], [[1, 2, 3], [4, 5, 6]]])
m


Out[4]:
array([[[ 7,  8,  9],
        [10, 11, 12]],

       [[ 1,  2,  3],
        [ 4,  5,  6]]])

There is one restriction with respect to the use of lists: while you could create lists with data of different type, all the data in an array has to be of the same type, and it will be converted automatically.


In [5]:
lst = [1., 'cat']
print(type(lst[0]))

arr = np.array(lst)
print(type(arr[0]))


<class 'float'>
<class 'numpy.str_'>

(We will go deeper into indexing in a while)


A NumPy array has a number of dimensions (or axes). To obtain the number of axes and the size of each of them you use the command shape. For 2-dimensional arrays (matrices), the order corresponds to (rows, columns)

There are two different ways of calling the shape command, either with np.shape(arr) or arr.shape. This is not the only command that works in both formats, and we will be finding some more in our way.


In [6]:
print(x.shape)
print(np.shape(m))


(3,)
(2, 2, 3)

Special Arrays

Now we review some built-in functions that create matrices commonly used. ones and zeros return arrays of given shape and type (default is float64), filled with ones or zeros, respectively.


In [7]:
array_zeros=np.zeros((3, 2))
array_ones=np.ones((3, 2, 4), dtype=np.int8)

In [8]:
array_zeros


Out[8]:
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

In [9]:
array_ones


Out[9]:
array([[[1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1]]], dtype=int8)

eye(d) returns a 2-D, dimension-$d$ array with ones on the diagonal and zeros elsewhere.


In [10]:
np.eye(3)


Out[10]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

eye can also create arrays with ones in upper and lower diagonals. To achieve this, you must call eye(d, d, k) where $k$ denotes the diagonal (positive for above the center diagonal, negative for below), or eye(d, k=num)


In [11]:
print(np.eye(5, 5, 2))
np.eye(5, k=-1)


[[ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]
Out[11]:
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.]])

diag, depending on the input, either extracts a diagonal from a matrix (if the input is a 2-D array), or constructs a diagonal array (if the input is a vector).


In [12]:
np.diag(x, 1)


Out[12]:
array([[0, 1, 0, 0],
       [0, 0, 2, 0],
       [0, 0, 0, 3],
       [0, 0, 0, 0]])

In [13]:
y = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])
print(np.diag(y))
np.diag(np.diag(y))


[ 1  6 11 16]
Out[13]:
array([[ 1,  0,  0,  0],
       [ 0,  6,  0,  0],
       [ 0,  0, 11,  0],
       [ 0,  0,  0, 16]])

arange(begin, end, step) returns evenly spaced values within a given interval. Note that the beginning point is included, but not the ending.


In [14]:
n = np.arange(0, 30, 2) # start at 0 count up by 2, stop before 30
n


Out[14]:
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

In [15]:
len(n)


Out[15]:
15

Exercise 1: Create an array of the first million of odd numbers, both with arange and using loops. Try timing both methods to see which one is faster. For that, use %timeit.


In [16]:
%timeit np.arange(0, 2e6, 2)

%timeit [i for i in range(2000000) if i % 2 == 0]


966 µs ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
207 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Similarly, linspace(begin, end, points) returns evenly spaced numbers over a specified interval. Here, instead of specifying the step, you specify the amount of points you want. Also with linspace you include the ending of the interval.


In [17]:
o = np.linspace(0, 30, 15)
o


Out[17]:
array([  0.        ,   2.14285714,   4.28571429,   6.42857143,
         8.57142857,  10.71428571,  12.85714286,  15.        ,
        17.14285714,  19.28571429,  21.42857143,  23.57142857,
        25.71428571,  27.85714286,  30.        ])

In [18]:
len(o)


Out[18]:
15

reshape changes the shape of an array, but not its data. This is another of the commands that can be called before or after the array.


In [19]:
print(n.reshape(3, 5))
np.reshape(n, (5, 3))


[[ 0  2  4  6  8]
 [10 12 14 16 18]
 [20 22 24 26 28]]
Out[19]:
array([[ 0,  2,  4],
       [ 6,  8, 10],
       [12, 14, 16],
       [18, 20, 22],
       [24, 26, 28]])

Note however that, in order for these changes to be permanent, you should do a reassignment of the variable


In [20]:
print(n)  # After the reshapings above, the original array stays being the same

n = n.reshape(3, 5)
n   # Now that we have reassigned it is when it definitely changes shape


[ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28]
Out[20]:
array([[ 0,  2,  4,  6,  8],
       [10, 12, 14, 16, 18],
       [20, 22, 24, 26, 28]])

Combining Arrays

The most general command for combining arrays is concatenate(arrs, d). It takes a list of arrays and concatenates them along axis $d$


In [21]:
p = np.ones([2, 2, 2])
p


Out[21]:
array([[[ 1.,  1.],
        [ 1.,  1.]],

       [[ 1.,  1.],
        [ 1.,  1.]]])

In [22]:
np.concatenate([p, 2 * p], 0)


Out[22]:
array([[[ 1.,  1.],
        [ 1.,  1.]],

       [[ 1.,  1.],
        [ 1.,  1.]],

       [[ 2.,  2.],
        [ 2.,  2.]],

       [[ 2.,  2.],
        [ 2.,  2.]]])

In [23]:
np.concatenate([p, 2 * p], 1)


Out[23]:
array([[[ 1.,  1.],
        [ 1.,  1.],
        [ 2.,  2.],
        [ 2.,  2.]],

       [[ 1.,  1.],
        [ 1.,  1.],
        [ 2.,  2.],
        [ 2.,  2.]]])

In [24]:
np.concatenate([p, 2 * p], 2)


Out[24]:
array([[[ 1.,  1.,  2.,  2.],
        [ 1.,  1.,  2.,  2.]],

       [[ 1.,  1.,  2.,  2.],
        [ 1.,  1.,  2.,  2.]]])

However, for common combinations there exist special commands. Use vstack to stack arrays in sequence vertically (row wise), hstack to stack arrays in sequence horizontally (column wise), and block to create arrays out of blocks (only available in versions 1.13.0+)


In [25]:
q = np.ones((2, 2))
np.vstack([q, 2 * q])


Out[25]:
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 2.,  2.]])

In [26]:
np.hstack([q, 2 * q])


Out[26]:
array([[ 1.,  1.,  2.,  2.],
       [ 1.,  1.,  2.,  2.]])

In [27]:
np.block([[q, np.zeros((2, 2))], [np.zeros((2, 2)), 2 * q]])


Out[27]:
array([[ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 0.,  0.,  2.,  2.],
       [ 0.,  0.,  2.,  2.]])

Operations

You can perform easily element-wise operations on arrays of any shape. Use the typical symbols, +, -, *, / and ** to perform element-wise addition, subtraction, multiplication, division and power.


In [28]:
x = np.array([1, 2, 3])
print(x)
print(x + 10)
print(3 * x)
print(1 / x)
print(x ** (-2 / 3))
print(2 ** x)


[1 2 3]
[11 12 13]
[3 6 9]
[ 1.          0.5         0.33333333]
[ 1.          0.62996052  0.48074986]
[2 4 8]

Also (and obviously) these symbols can be used to operate between two arrays, which must be of the same shape. If this is the case, they also do element-wise operations


In [29]:
y = np.arange(4, 7, 1)
print(x + y)     # [1+4, 2+5, 3+6]
print(x * y)     # [1*4, 2*5, 3*6]
print(x / y)     # [1/4, 2/5, 3/6]
print(x ** y)    # [1**4, 2**5, 3**6]


[5 7 9]
[ 4 10 18]
[ 0.25  0.4   0.5 ]
[  1  32 729]

For doing vector or matrix multiplication, the command to be used is dot


In [30]:
x.dot(y) # 1*4 + 2*5 + 3*6


Out[30]:
32

With python 3.5 matrix multiplication got it's own operator


In [31]:
x@y


Out[31]:
32

In [32]:
X = np.array([[i + j for i in range(3, 6)] for j in range(3)])
Y = np.diag([1, 1], 1) + np.diag([1], -2)

print('{}\n'.format(X))
print(X * Y)
np.dot(X, Y)


[[3 4 5]
 [4 5 6]
 [5 6 7]]

[[0 4 0]
 [0 0 6]
 [5 0 0]]
Out[32]:
array([[5, 3, 4],
       [6, 4, 5],
       [7, 5, 6]])

Exercise 2: Take a 10x2 matrix representing $(x1,x2)$ coordinates and transform them into polar coordinates $(r,\theta)$.

Hint 1: the inverse transformation is given by $x1 = r\cos\theta$, $x2 = r\sin\theta$

Hint 2: generate random numbers with the functions in numpy.random


In [33]:
z = np.random.random((10, 2))
x1, x2 = z[:, 0], z[:, 1]
R = np.sqrt(x1 ** 2 + x2 ** 2)
T = np.arctan2(x2, x1)
print(R)
print(T)


[ 0.88919595  0.67700298  0.43657171  0.99695062  0.09369465  0.42234775
  0.9362286   0.56539581  1.01339843  0.72450275]
[ 0.08591593  1.31038962  1.40461825  0.53314706  1.36699826  1.2306954
  0.49694134  0.68438617  0.83524534  1.54853991]

Transposing

Transposition is a very important operation for linear algebra. Although NumPy is capable of correctly doing matrix-vector products correctly regardless of the orientation of the vector, it is not the case for products of matrices


In [34]:
Z = np.arange(0, 12, 1).reshape((4, 3))

In [35]:
np.dot(Z, y)


Out[35]:
array([ 17,  62, 107, 152])

In [36]:
np.dot(Z, y.T)


Out[36]:
array([ 17,  62, 107, 152])

In [37]:
Z.dot(X)


Out[37]:
array([[ 14,  17,  20],
       [ 50,  62,  74],
       [ 86, 107, 128],
       [122, 152, 182]])

In [38]:
(Z.T).dot(X)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-2e11f30e177d> in <module>()
----> 1 (Z.T).dot(X)

ValueError: shapes (3,4) and (3,3) not aligned: 4 (dim 1) != 3 (dim 0)

Array Methods

To not have to go from NumPy arrays to lists back and forth, NumPy contains some functions to know properties of your arrays. Actually, there are more of these functions than in standard Python.


In [39]:
a = np.array([-4, -2, 1, 3, 5])
print(a.max())
print(a.min())
print(a.sum())
print(a.mean())
print(a.std())


5
-4
3
0.6
3.26190128606

Some interesting functions are argmax and argmin, which return the index of the maximum and minimum values in the array.


In [40]:
print(a.argmax())
print(a.argmin())


4
0

Indexing/Slicing

We have already seen briefly that to access individual elements you use the bracket notation: array[ax_0, ax_1, ...], where the ax_i denotes the coordinate in the i-th axis. You can even use this to assign new values to your elements.


In [41]:
r = [4, 5, 6, 7]
print(r[2])
r[0] = 198
r


6
Out[41]:
[198, 5, 6, 7]

To select a range of rows or columns you can use a colon :. A second : can be used to indicate the step size. array[start:stop:stepsize]. If you leave start (stop) blank, the selection will go from the very beginning (until the very end) of the array


In [42]:
s = np.arange(13)**2
print(s)
print(s[3:9])
print(s[2:10:3])
s[-5::-2]


[  0   1   4   9  16  25  36  49  64  81 100 121 144]
[ 9 16 25 36 49 64]
[ 4 25 64]
Out[42]:
array([64, 36, 16,  4,  0])

The same applies to matrices or higher-dimensional arrays


In [43]:
r = np.arange(36).reshape((6, 6))
r


Out[43]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

In [44]:
r[2:5, 1:3]


Out[44]:
array([[13, 14],
       [19, 20],
       [25, 26]])

You can also select specific rows and columns, separated by commas


In [45]:
r[[1, 3, 4], 1:3]


Out[45]:
array([[ 7,  8],
       [19, 20],
       [25, 26]])

A very useful tool is conditional indexing, where we apply a function, assignment... only to those elements of an array that satisfy some condition


In [46]:
r[r > 30] = 30
r


Out[46]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 30, 30, 30, 30, 30]])

Exercise 3: Create a random 1-dimensional array, and find which element is closest to 0.7


In [47]:
Z = np.random.uniform(0,1,100)
z = 0.7
m = Z[np.abs(Z - z).argmin()]
print(m)


0.705902515629

Copying Data

Be very careful with copying and modifying arrays in NumPy! You will see the reason right now. Let's begin defining r2 as a slice of r


In [48]:
r2 = r[:3,:3]
r2


Out[48]:
array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])

And now let's set all its elements to zero


In [49]:
r2[:] = 0
r2


Out[49]:
array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

When looking at r, we see that it has also been changed!


In [50]:
r


Out[50]:
array([[ 0,  0,  0,  3,  4,  5],
       [ 0,  0,  0,  9, 10, 11],
       [ 0,  0,  0, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 30, 30, 30, 30, 30]])

The proper way of handling selections without modifying the original arrays is through the copy command.


In [51]:
r_copy = r.copy()
r_copy


Out[51]:
array([[ 0,  0,  0,  3,  4,  5],
       [ 0,  0,  0,  9, 10, 11],
       [ 0,  0,  0, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 30, 30, 30, 30, 30]])

Now we can safely modify r_copy without affecting r.


In [52]:
r_copy[:] = 10
print('{}\n'.format(r_copy))
r


[[10 10 10 10 10 10]
 [10 10 10 10 10 10]
 [10 10 10 10 10 10]
 [10 10 10 10 10 10]
 [10 10 10 10 10 10]
 [10 10 10 10 10 10]]

Out[52]:
array([[ 0,  0,  0,  3,  4,  5],
       [ 0,  0,  0,  9, 10, 11],
       [ 0,  0,  0, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 30, 30, 30, 30, 30]])

Iterating Over Arrays

Finally, you can iterate over arrays in the same way as you iterate over lists


In [53]:
test = np.random.randint(0, 10, (4,3))
test


Out[53]:
array([[1, 6, 0],
       [4, 8, 0],
       [2, 1, 6],
       [1, 7, 4]])

You can iterate by row:


In [54]:
for row in test:
    print(row)


[1 6 0]
[4 8 0]
[2 1 6]
[1 7 4]

Or by row index


In [55]:
for i in range(len(test)):
    print(test[i])


[1 6 0]
[4 8 0]
[2 1 6]
[1 7 4]

Or by row and index:


In [56]:
for i, row in enumerate(test):
    print('Row {} is {}'.format(i, row))


Row 0 is [1 6 0]
Row 1 is [4 8 0]
Row 2 is [2 1 6]
Row 3 is [1 7 4]

In the same way as with lists, you can use zip to iterate over multiple iterables.


In [57]:
test2 = test**2
test2


Out[57]:
array([[ 1, 36,  0],
       [16, 64,  0],
       [ 4,  1, 36],
       [ 1, 49, 16]])

In [58]:
for i, j in zip(test, test2):
    print('{} + {} = {}'.format(i, j, i + j))


[1 6 0] + [ 1 36  0] = [ 2 42  0]
[4 8 0] + [16 64  0] = [20 72  0]
[2 1 6] + [ 4  1 36] = [ 6  2 42]
[1 7 4] + [ 1 49 16] = [ 2 56 20]

Exercise 4: Create a function that iterates over the columns of a 2-dimensional array


In [59]:
def iterate(df):
    for i, row in enumerate(df):
        shp = row.shape
        row.shape = shp + (1,)
        print('Column {} is {}'.format(i, row))

iterate(test.T)


Column 0 is [[1]
 [4]
 [2]
 [1]]
Column 1 is [[6]
 [8]
 [1]
 [7]]
Column 2 is [[0]
 [0]
 [6]
 [4]]

Loading and Saving Data

To load and save data NumPy has the loadtxt and savetxt commands. However, they only work for two-dimensional arrays


In [60]:
np.savetxt('numpytest.txt', test)
np.loadtxt('numpytest.txt')


Out[60]:
array([[ 1.,  6.,  0.],
       [ 4.,  8.,  0.],
       [ 2.,  1.,  6.],
       [ 1.,  7.,  4.]])

Pandas

When dealing with numeric matrices and vectors in Python, NumPy makes life a lot easier. For more complex data, however, it leaves a bit to be desired. For those used to working with dedicated languages like R, doing data analysis directly with numpy feels like a step back. Fortunately, some nice folks have written the Python Data Analysis Library (a.k.a. pandas). Pandas provides an R-like DataFrame, produces high quality plots with matplotlib, and integrates nicely with other libraries that expect NumPy arrays.

Pandas works with Series of data, that then are arranged in DataFrames. A dataframe will be the object closest to an Excel spreadsheet that you will see throughout the course (but of course, given that it is integrated in Python and can be combined with so many different packages, dataframes are much more powerful than Excel spreadsheets). The data in the series can be either qualitative or quantitative data. Creating a series is as easy as creating a NumPy array from a one-dimensional list.


In [61]:
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)


Out[61]:
0    Tiger
1     Bear
2    Moose
dtype: object

In [62]:
numbers = [1, 2, 3]
pd.Series(numbers)


Out[62]:
0    1
1    2
2    3
dtype: int64

Notice that the series is indexed by default by integers. You can change this indexing by using a dictionary instead of a list for creating the series.


In [63]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s


Out[63]:
Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

On the other hand, DataFrames can be built from two-dimensional arrays, with the ability of labelling columns and indexing the rows


In [64]:
u = pd.DataFrame(np.random.randn(1000,6), index=np.arange(0, 3000, 3), 
                 columns=['A', 'B', 'C', 'D', 'E', 'F'])
u


Out[64]:
A B C D E F
0 -0.761486 -1.856963 -0.259506 -0.965839 0.595335 -1.393989
3 0.730582 -1.933338 0.159036 -0.303658 0.285985 -0.015358
6 1.181648 -0.308067 0.194845 -0.110363 -0.377039 -0.563322
9 -0.207188 0.887225 -0.089879 1.535221 -0.178550 -0.119188
12 -1.452087 -1.678056 0.442135 0.815065 -0.788070 0.865199
15 -0.312236 0.405971 -0.916694 -1.570857 -1.808676 -0.601334
18 -0.520655 0.520683 -0.420893 0.866483 0.707625 -0.457042
21 -0.763168 1.941782 -0.345903 -1.078514 -2.348245 -0.507215
24 0.286458 -0.203965 2.202701 0.305345 -0.373617 -0.516685
27 0.483357 0.582012 0.151425 -0.280406 -1.193848 0.074661
30 0.251333 -0.984670 -1.126600 1.437455 1.144455 0.916344
33 0.501224 -1.196463 -0.890707 -0.330306 1.701601 0.195782
36 -0.493956 -1.001679 -0.944730 0.740723 0.359925 1.290009
39 -0.616836 0.040601 0.839057 0.233386 0.572832 -0.672458
42 -1.017777 0.913523 -2.204982 -0.398876 1.027856 -0.112541
45 1.037946 1.087577 -0.024363 -1.386254 -0.700079 -0.451928
48 0.590588 1.912149 0.382943 1.103702 -0.174166 -0.475542
51 0.123948 -1.953421 -0.829696 0.385270 -1.386090 0.290291
54 -0.775569 2.293993 -0.889644 -0.786743 0.471794 0.610733
57 0.237778 -2.610199 -0.629762 0.555422 0.843687 -2.165088
60 -0.971872 1.174113 0.521214 0.634707 0.373459 1.213645
63 1.167321 0.057173 0.485950 0.605712 -0.307385 -1.319829
66 0.159974 -1.433897 0.605769 0.010578 -2.978682 -0.352156
69 -0.782439 1.053819 0.533252 0.878424 -0.269466 0.495181
72 0.891193 0.364128 0.178855 -1.640823 1.809076 -0.263593
75 0.051243 -0.263917 0.640288 -2.158057 -0.508163 1.132559
78 -1.096922 -1.401145 0.833753 -1.807646 -1.168890 1.390710
81 -0.988110 -0.369140 -0.237166 -1.303915 0.888377 1.142403
84 -0.865377 0.323423 0.492383 1.370745 -0.021754 -1.719541
87 0.321882 1.381522 -0.674172 -1.083496 1.914568 1.028375
... ... ... ... ... ... ...
2910 0.314307 -0.431866 1.527405 0.601484 -0.160716 1.314302
2913 0.257940 -0.099765 1.157054 -0.060946 -0.494843 0.341643
2916 -0.428302 1.507066 0.611729 -0.332719 -0.196556 0.423443
2919 0.349743 -0.282030 0.196447 0.002695 0.136590 0.401846
2922 -0.307028 -0.791913 -0.265240 0.125383 -1.003613 -1.160399
2925 0.219177 0.940110 0.781867 0.126816 0.161123 1.089183
2928 -1.196400 -0.184507 1.962129 -0.343204 1.436841 -1.038041
2931 -0.272810 0.410523 0.541178 -0.793986 0.742207 2.590795
2934 -0.439342 0.136602 -0.225461 0.364491 1.802063 -0.582919
2937 0.225935 1.015986 -1.177095 -0.971220 0.236631 0.891794
2940 -0.252398 -0.012515 -0.207346 -0.493030 -0.983243 -0.657357
2943 0.868827 -0.624654 0.365266 -1.029895 -0.904947 0.933805
2946 -1.673025 -0.421382 2.676548 -1.360665 -0.999639 -0.409515
2949 -0.629402 -0.742095 0.559815 -1.556713 -2.251803 -1.368577
2952 0.025487 -0.217591 0.182265 -1.496190 0.499133 -0.494414
2955 -1.311549 -1.553367 -0.267299 1.937489 -1.462930 0.015008
2958 -0.629710 -1.414470 0.165272 -0.123763 -0.774348 2.061785
2961 -0.018643 -2.001437 -0.692074 2.882349 -0.027989 -0.864119
2964 -1.199184 -2.128532 -1.322724 -0.951057 -0.370693 0.145449
2967 0.720639 -1.539494 0.966309 1.557915 0.100207 0.188154
2970 1.483731 0.821352 0.872577 -1.694163 1.452975 1.234305
2973 -0.203272 -1.026092 -0.303214 -0.634275 0.454736 0.541063
2976 -0.519535 -1.405036 -0.218642 0.372721 0.585361 0.018672
2979 0.419742 0.763683 -0.111488 -0.397157 -1.451003 1.081199
2982 1.542049 -0.929883 -2.279710 0.352081 -0.345532 0.163429
2985 -0.779326 0.424736 0.548876 -2.316646 1.518206 0.038710
2988 1.245536 1.010383 -1.802921 -2.585215 0.362162 -0.180893
2991 0.752105 -1.444161 -0.507737 0.018064 1.158738 -0.461061
2994 -0.688970 0.508954 -0.110612 -0.657601 -1.061930 0.050378
2997 0.208066 1.819959 -1.461952 1.949032 -0.814686 -0.153806

1000 rows × 6 columns

As you might have noticed, it is a bit ugly to deal with large dataframes. There are however some functions that allows to have an idea of the data in a frame.


In [65]:
u.head()


Out[65]:
A B C D E F
0 -0.761486 -1.856963 -0.259506 -0.965839 0.595335 -1.393989
3 0.730582 -1.933338 0.159036 -0.303658 0.285985 -0.015358
6 1.181648 -0.308067 0.194845 -0.110363 -0.377039 -0.563322
9 -0.207188 0.887225 -0.089879 1.535221 -0.178550 -0.119188
12 -1.452087 -1.678056 0.442135 0.815065 -0.788070 0.865199

In [66]:
u.tail()


Out[66]:
A B C D E F
2985 -0.779326 0.424736 0.548876 -2.316646 1.518206 0.038710
2988 1.245536 1.010383 -1.802921 -2.585215 0.362162 -0.180893
2991 0.752105 -1.444161 -0.507737 0.018064 1.158738 -0.461061
2994 -0.688970 0.508954 -0.110612 -0.657601 -1.061930 0.050378
2997 0.208066 1.819959 -1.461952 1.949032 -0.814686 -0.153806

In [67]:
u.describe()


Out[67]:
A B C D E F
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.050794 0.061913 -0.025466 0.045786 0.022753 -0.060224
std 1.020710 1.005581 0.993343 1.005364 1.025786 1.005457
min -3.205216 -3.110890 -3.052732 -2.773758 -3.470618 -3.867963
25% -0.751474 -0.631540 -0.695884 -0.640073 -0.654438 -0.764999
50% -0.061007 0.098579 -0.029510 0.047236 0.037100 -0.037634
75% 0.650328 0.757897 0.647189 0.720574 0.731039 0.604459
max 3.796625 3.126197 2.889348 2.913861 3.159995 4.216350

One can also change the maximal number of rows that is displayed:


In [68]:
pd.set_option('display.max_rows', 15)

Pandas can also generate random DataFrames for testing:


In [69]:
import pandas.util.testing as tm

tm.makeDataFrame().head()


Out[69]:
A B C D
P08BTc9mdE -0.527469 -0.273173 -0.636832 -0.156654
pxObI8gZh6 -0.993787 -1.260044 0.165435 0.716617
S81GD7fAO0 0.649611 -0.345573 -0.476357 -0.430727
r2d2zMxxmv -0.426132 -0.282060 -0.055078 0.638711
jaKW4OkQ0V -0.669180 -0.251879 0.949884 -1.117098

Indexing/Slicing in Pandas

The easiest way of accessing information in a Pandas dataframe, equivalent to the way used in NumPy, is using the iloc command. With this you can also set specific values, do conditional indexing... all that we have seen before in section 2.4


In [70]:
u.iloc[125:132,[0, 2, 5]]


Out[70]:
A C F
375 -0.636038 0.116078 -0.488360
378 -0.950089 -2.232674 -0.784635
381 1.168677 -0.428499 -0.069833
384 -1.268749 1.368481 0.612919
387 0.008765 -0.606459 0.171234
390 -0.212263 -0.327441 -0.592471
393 0.801593 -1.103829 -1.240144

However, there are a few different ways of accessing the data in a Pandas dataframe, that typically have a more "direct" connection with the actual content fo the dataframe. Individual or sets of columns can also be accessed by their column names. Choosing one single column will give a Series, while two or more will produce a Dataset


In [71]:
u['A'].head()


Out[71]:
0    -0.761486
3     0.730582
6     1.181648
9    -0.207188
12   -1.452087
Name: A, dtype: float64

In [72]:
u[['A', 'D']].head()


Out[72]:
A D
0 -0.761486 -0.965839
3 0.730582 -0.303658
6 1.181648 -0.110363
9 -0.207188 1.535221
12 -1.452087 0.815065

Not only that, you can access a single column without the need of brackets []


In [73]:
u.A.head()


Out[73]:
0    -0.761486
3     0.730582
6     1.181648
9    -0.207188
12   -1.452087
Name: A, dtype: float64

The usual [] will select specific rows according to the row number


In [74]:
u[0:10][list('BCF')]


Out[74]:
B C F
0 -1.856963 -0.259506 -1.393989
3 -1.933338 0.159036 -0.015358
6 -0.308067 0.194845 -0.563322
9 0.887225 -0.089879 -0.119188
12 -1.678056 0.442135 0.865199
15 0.405971 -0.916694 -0.601334
18 0.520683 -0.420893 -0.457042
21 1.941782 -0.345903 -0.507215
24 -0.203965 2.202701 -0.516685
27 0.582012 0.151425 0.074661

You can also choose specific rows according to their indices with the loc command


In [75]:
u.loc[6:15]


Out[75]:
A B C D E F
6 1.181648 -0.308067 0.194845 -0.110363 -0.377039 -0.563322
9 -0.207188 0.887225 -0.089879 1.535221 -0.178550 -0.119188
12 -1.452087 -1.678056 0.442135 0.815065 -0.788070 0.865199
15 -0.312236 0.405971 -0.916694 -1.570857 -1.808676 -0.601334

Or, you can access just the elements that satisfy some condition


In [76]:
u[u.D > 2]


Out[76]:
A B C D E F
96 -1.062619 -0.326839 0.698481 2.260445 -0.480326 -0.030888
231 -0.062075 1.700715 0.004098 2.265953 0.971236 -1.165149
243 -1.548629 -0.682061 -0.988246 2.101512 -0.281365 0.165648
294 1.536812 -0.065230 0.089026 2.173031 -0.229206 3.471096
717 1.043050 -0.596884 -2.254242 2.728187 0.597010 1.541664
984 0.697663 -0.625542 -0.445807 2.276466 -0.918990 -0.303464
1092 0.038387 -0.155476 -0.253734 2.684383 0.774636 0.283066
... ... ... ... ... ... ...
2313 -0.205846 -1.757420 -1.705877 2.100689 -2.208538 -1.132553
2343 -1.878500 -1.878065 0.689921 2.751148 -0.482695 -1.619270
2361 -3.205216 1.287279 -0.571579 2.167375 -1.650644 0.358632
2430 0.580814 -0.215444 -1.704029 2.505435 2.415872 0.450578
2589 -1.093291 -0.069399 -0.168713 2.109153 0.723736 0.105442
2751 2.553238 0.060056 -0.290016 2.663297 1.081204 -0.416994
2961 -0.018643 -2.001437 -0.692074 2.882349 -0.027989 -0.864119

29 rows × 6 columns


In [77]:
u[~(u.D > 2)]  # For the inverse of u.D > 2


Out[77]:
A B C D E F
0 -0.761486 -1.856963 -0.259506 -0.965839 0.595335 -1.393989
3 0.730582 -1.933338 0.159036 -0.303658 0.285985 -0.015358
6 1.181648 -0.308067 0.194845 -0.110363 -0.377039 -0.563322
9 -0.207188 0.887225 -0.089879 1.535221 -0.178550 -0.119188
12 -1.452087 -1.678056 0.442135 0.815065 -0.788070 0.865199
15 -0.312236 0.405971 -0.916694 -1.570857 -1.808676 -0.601334
18 -0.520655 0.520683 -0.420893 0.866483 0.707625 -0.457042
... ... ... ... ... ... ...
2979 0.419742 0.763683 -0.111488 -0.397157 -1.451003 1.081199
2982 1.542049 -0.929883 -2.279710 0.352081 -0.345532 0.163429
2985 -0.779326 0.424736 0.548876 -2.316646 1.518206 0.038710
2988 1.245536 1.010383 -1.802921 -2.585215 0.362162 -0.180893
2991 0.752105 -1.444161 -0.507737 0.018064 1.158738 -0.461061
2994 -0.688970 0.508954 -0.110612 -0.657601 -1.061930 0.050378
2997 0.208066 1.819959 -1.461952 1.949032 -0.814686 -0.153806

971 rows × 6 columns

Recently query has been added to DataFrame for the same purpose. While it is less powerful than logical indexing, it is often faster and shorter (when names are longer than just u):


In [78]:
u.query('D > 2')


Out[78]:
A B C D E F
96 -1.062619 -0.326839 0.698481 2.260445 -0.480326 -0.030888
231 -0.062075 1.700715 0.004098 2.265953 0.971236 -1.165149
243 -1.548629 -0.682061 -0.988246 2.101512 -0.281365 0.165648
294 1.536812 -0.065230 0.089026 2.173031 -0.229206 3.471096
717 1.043050 -0.596884 -2.254242 2.728187 0.597010 1.541664
984 0.697663 -0.625542 -0.445807 2.276466 -0.918990 -0.303464
1092 0.038387 -0.155476 -0.253734 2.684383 0.774636 0.283066
... ... ... ... ... ... ...
2313 -0.205846 -1.757420 -1.705877 2.100689 -2.208538 -1.132553
2343 -1.878500 -1.878065 0.689921 2.751148 -0.482695 -1.619270
2361 -3.205216 1.287279 -0.571579 2.167375 -1.650644 0.358632
2430 0.580814 -0.215444 -1.704029 2.505435 2.415872 0.450578
2589 -1.093291 -0.069399 -0.168713 2.109153 0.723736 0.105442
2751 2.553238 0.060056 -0.290016 2.663297 1.081204 -0.416994
2961 -0.018643 -2.001437 -0.692074 2.882349 -0.027989 -0.864119

29 rows × 6 columns

Reshaping DataFrames


In [132]:
u.pivot(index='E', columns='G', values='A')


Out[132]:
G a b c
E
-3.470618 -0.525781 NaN NaN
-2.978682 NaN 0.159974 NaN
-2.668974 NaN -1.407538 NaN
-2.587076 NaN 0.782683 NaN
-2.485586 -1.196322 NaN NaN
-2.457113 NaN NaN 0.591406
-2.439570 NaN NaN -0.315309
... ... ... ...
2.471763 NaN NaN 0.588655
2.523968 0.968783 NaN NaN
2.528503 NaN NaN 0.737061
2.686937 NaN -0.392155 NaN
2.829819 NaN NaN 1.453957
3.060963 NaN 1.390720 NaN
3.159995 0.190599 NaN NaN

1000 rows × 3 columns


In [136]:
u.stack()


Out[136]:
0     A   -0.761486
      B    -1.85696
      C   -0.259506
      D   -0.965839
      E    0.595335
      F   -0.717366
      G           b
             ...   
2997  A    0.208066
      B     1.81996
      C    -1.46195
      D     1.94903
      E   -0.814686
      F    -6.50169
      G           b
Length: 7000, dtype: object

In [137]:
u.unstack()


Out[137]:
A  0      -0.761486
   3       0.730582
   6        1.18165
   9      -0.207188
   12      -1.45209
   15     -0.312236
   18     -0.520655
             ...   
G  2979           a
   2982           a
   2985           a
   2988           c
   2991           b
   2994           a
   2997           b
Length: 7000, dtype: object

In [138]:
u.stack().unstack()


Out[138]:
A B C D E F G
0 -0.761486 -1.85696 -0.259506 -0.965839 0.595335 -0.717366 b
3 0.730582 -1.93334 0.159036 -0.303658 0.285985 -65.1111 c
6 1.18165 -0.308067 0.194845 -0.110363 -0.377039 -1.77518 c
9 -0.207188 0.887225 -0.0898789 1.53522 -0.17855 -8.39008 b
12 -1.45209 -1.67806 0.442135 0.815065 -0.78807 1.1558 a
15 -0.312236 0.405971 -0.916694 -1.57086 -1.80868 -1.66297 b
18 -0.520655 0.520683 -0.420893 0.866483 0.707625 -2.18798 a
... ... ... ... ... ... ... ...
2979 0.419742 0.763683 -0.111488 -0.397157 -1.451 0.924899 a
2982 1.54205 -0.929883 -2.27971 0.352081 -0.345532 6.11887 a
2985 -0.779326 0.424736 0.548876 -2.31665 1.51821 25.8329 a
2988 1.24554 1.01038 -1.80292 -2.58522 0.362162 -5.52813 c
2991 0.752105 -1.44416 -0.507737 0.018064 1.15874 -2.16891 b
2994 -0.68897 0.508954 -0.110612 -0.657601 -1.06193 19.8499 a
2997 0.208066 1.81996 -1.46195 1.94903 -0.814686 -6.50169 b

1000 rows × 7 columns

Computing With DataFrames

You can calculate with DataFrames or their columns (which are Series) the same way you could with arrayss


In [79]:
u['F'] = 1 / u['F']
u['F'].head()


Out[79]:
0     -0.717366
3    -65.111125
6     -1.775183
9     -8.390081
12     1.155803
Name: F, dtype: float64

In [80]:
np.mean(u)


Out[80]:
A   -0.050794
B    0.061913
C   -0.025466
D    0.045786
E    0.022753
F   -4.038254
dtype: float64

You can apply functions to the whole dataset or specific columns with the apply command. apply acts on the whole column at a time (i.e. a Pandas Series), so you can compute things that depend on several values of the column, for instance the mean value. To apply functions in a real element-by-element basis the function applymap or Series.apply should be used.


In [81]:
def mn(col):
    return sum(col) / len(col)

u.apply(mn)


Out[81]:
A   -0.050794
B    0.061913
C   -0.025466
D    0.045786
E    0.022753
F   -4.038254
dtype: float64

While most can be directly calculated (including the given example of the mean), apply also works on columns with strings or categorical data, where no mathematical operations are defined. The limit is the imagination.

Combining DataFrames

Something we will do quite often as scientists is combining data from different sources into one single source. This can be achieved by different commands in Pandas, depending on the actual goal we want.

To begin with, appending new rows of data is achieved by the command append.


In [82]:
newdata = pd.DataFrame(np.ones((5, 6)), index=np.arange(3003, 3018, 3), columns=list('ABCDEF'))
newdata


Out[82]:
A B C D E F
3003 1.0 1.0 1.0 1.0 1.0 1.0
3006 1.0 1.0 1.0 1.0 1.0 1.0
3009 1.0 1.0 1.0 1.0 1.0 1.0
3012 1.0 1.0 1.0 1.0 1.0 1.0
3015 1.0 1.0 1.0 1.0 1.0 1.0

In [83]:
unew = u.append(newdata)
unew.tail(10)


Out[83]:
A B C D E F
2985 -0.779326 0.424736 0.548876 -2.316646 1.518206 25.832926
2988 1.245536 1.010383 -1.802921 -2.585215 0.362162 -5.528130
2991 0.752105 -1.444161 -0.507737 0.018064 1.158738 -2.168912
2994 -0.688970 0.508954 -0.110612 -0.657601 -1.061930 19.849865
2997 0.208066 1.819959 -1.461952 1.949032 -0.814686 -6.501688
3003 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
3006 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
3009 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
3012 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
3015 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

The same result can be obtained with concat.


In [84]:
pd.concat([u, newdata]).tail(10)


Out[84]:
A B C D E F
2985 -0.779326 0.424736 0.548876 -2.316646 1.518206 25.832926
2988 1.245536 1.010383 -1.802921 -2.585215 0.362162 -5.528130
2991 0.752105 -1.444161 -0.507737 0.018064 1.158738 -2.168912
2994 -0.688970 0.508954 -0.110612 -0.657601 -1.061930 19.849865
2997 0.208066 1.819959 -1.461952 1.949032 -0.814686 -6.501688
3003 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
3006 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
3009 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
3012 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
3015 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

New columns of data can just be asigned or added with the command join.


In [85]:
u['G'] = np.random.choice(['a', 'b', 'c'], len(u))
u.tail()


Out[85]:
A B C D E F G
2985 -0.779326 0.424736 0.548876 -2.316646 1.518206 25.832926 a
2988 1.245536 1.010383 -1.802921 -2.585215 0.362162 -5.528130 c
2991 0.752105 -1.444161 -0.507737 0.018064 1.158738 -2.168912 b
2994 -0.688970 0.508954 -0.110612 -0.657601 -1.061930 19.849865 a
2997 0.208066 1.819959 -1.461952 1.949032 -0.814686 -6.501688 b

Grouping Data


In [86]:
for h, group in u.groupby('G'):
    print('{}: {}'.format(h, np.mean(group['F'])))


a: -0.0008421855752116789
b: -13.980448225745274
c: 1.5318088670645276

In [87]:
u.groupby('G').describe()


Out[87]:
A B ... E F
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
G
a 355.0 -0.054040 1.056119 -3.205216 -0.762618 -0.024805 0.642713 3.796625 355.0 0.137639 ... 0.660793 3.159995 355.0 -0.000842 15.675441 -223.960885 -1.462605 -0.504428 1.587306 77.513602
b 324.0 -0.016923 0.979483 -2.487953 -0.717918 -0.057716 0.681434 3.356327 324.0 0.040709 ... 0.662474 3.060963 324.0 -13.980448 264.328321 -4752.187932 -1.289143 -0.438322 1.574061 110.350988
c 321.0 -0.081392 1.023785 -2.690009 -0.809986 -0.109089 0.596933 2.553238 321.0 -0.000431 ... 0.885457 2.829819 321.0 1.531809 44.547577 -142.745170 -1.451198 -0.483571 1.372699 762.072499

3 rows × 48 columns


In [98]:
u.pivot_table(index='G', aggfunc='mean')


Out[98]:
A B C D E F
G
a -0.054040 0.137639 -0.028253 -0.024943 -0.038171 -0.000842
b -0.016923 0.040709 -0.090043 0.116645 -0.014294 -13.980448
c -0.081392 -0.000431 0.042796 0.052485 0.127525 1.531809

Loading and saving dataframes

To load and save Pandas dataframes we will use the to_csv and read_csv commands


In [107]:
u.to_csv('test.csv')
v = pd.read_csv('test.csv', index_col=0)
v.head()


Out[107]:
A B C D E F G
0 -0.761486 -1.856963 -0.259506 -0.965839 0.595335 -0.717366 b
3 0.730582 -1.933338 0.159036 -0.303658 0.285985 -65.111125 c
6 1.181648 -0.308067 0.194845 -0.110363 -0.377039 -1.775183 c
9 -0.207188 0.887225 -0.089879 1.535221 -0.178550 -8.390081 b
12 -1.452087 -1.678056 0.442135 0.815065 -0.788070 1.155803 a

But, as an addition, Pandas has special commands to load and save Excel spreadsheets (yay!). However, to use it you'll need the openpyxl and xlrd packages.


In [108]:
u.to_excel('test.xlsx', sheet_name='My sheet')
pd.read_excel('test.xlsx', 'My sheet', index_col=0).head()


Out[108]:
A B C D E F G
0 -0.761486 -1.856963 -0.259506 -0.965839 0.595335 -0.717366 b
3 0.730582 -1.933338 0.159036 -0.303658 0.285985 -65.111125 c
6 1.181648 -0.308067 0.194845 -0.110363 -0.377039 -1.775183 c
9 -0.207188 0.887225 -0.089879 1.535221 -0.178550 -8.390081 b
12 -1.452087 -1.678056 0.442135 0.815065 -0.788070 1.155803 a

Exercise 5: Download this dataset and load it, using the first column as the index. Take a look at it, and do the following things:

  • Choose the columns 'Identifier', 'BaseStamina', 'BaseAttack', 'BaseDefense', 'Type1' and 'Type2'
  • Create a function that lowercases strings and apply it to 'Type1' and 'Type2' (Extra: just capitalize the strings, i.e., leave the first letter uppercase and lowercase the rest)
  • Create a function that returns a Boolean value (don't be afraif by this, it is a function that returns either True or False) that tells if a Pokémon has high stamina (BaseStamina>170) or not. Store this information in a new column and show the list of Pokémon with high stamina
  • Show the instructor the last 15 rows of your dataset

In [109]:
df = pd.read_csv('https://raw.githubusercontent.com/ChihChengLiang/pokemongor/master/data-raw/pokemons.csv', 
                 index_col=0)

df = df[['Identifier', 'BaseStamina', 'BaseAttack', 'BaseDefense', 'Type1', 'Type2']]

capitalize = lambda st: st.capitalize()

for col in ['Type1', 'Type2']:
    df[col] = df[col].apply(capitalize)
    
def highstamina(x):
    return True if x > 170 else False

df['HighStamina'] = df.BaseStamina.apply(highstamina)

print(df[df['HighStamina'] == True].Identifier)

df.tail(15)


PkMn
31      Nidoqueen
36       Clefable
39     Jigglypuff
40     Wigglytuff
59       Arcanine
62      Poliwrath
68        Machamp
          ...    
143       Snorlax
144      Articuno
145        Zapdos
146       Moltres
149     Dragonite
150        Mewtwo
151           Mew
Name: Identifier, Length: 26, dtype: object
Out[109]:
Identifier BaseStamina BaseAttack BaseDefense Type1 Type2 HighStamina
PkMn
137 Porygon 130 156 158 Normal None False
138 Omanyte 70 132 160 Rock Water False
139 Omastar 140 180 202 Rock Water False
140 Kabuto 60 148 142 Rock Water False
141 Kabutops 120 190 190 Rock Water False
142 Aerodactyl 160 182 162 Rock Flying False
143 Snorlax 320 180 180 Normal None True
144 Articuno 180 198 242 Ice Flying True
145 Zapdos 180 232 194 Electric Flying True
146 Moltres 180 242 194 Fire Flying True
147 Dratini 82 128 110 Dragon None False
148 Dragonair 122 170 152 Dragon None False
149 Dragonite 182 250 212 Dragon Flying True
150 Mewtwo 212 284 202 Psychic None True
151 Mew 200 220 220 Psychic None True