A ten minute tour of Pandas


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Two Important Data Structures

  • Series - a one-dimensional labeled array or list. Like an independant column in a spreadsheet
  • Dataframe - a two-dimensional labeled table. Like the whole spreadsheet
  • the Index - Ok, there are THREE important data structures. The index are the labels

Built on top of Numpy

  • Which is a lower-level library for manipulating numerical data
  • Implemented in C, which means it is wicked fast
  • Pandas is more flexible and useful than Numpy

Series

  • s = pd.Series(data, index=index)
  • data can be
    • A Python Dictionary
    • A list
    • A single value
  • index is a list that will become item lables (like the keys in a dictionary
  • Think of a Series as an ordered dictionary, which means it is very list-like

In [6]:
# make a list of random numbers
a_python_list = list(np.random.randn(5))
a_python_list


Out[6]:
[-0.13321098075371843,
 1.1805918025675413,
 0.6952629518297494,
 0.38619173384279104,
 -0.2162508212191328]

In [8]:
a_pandas_series = pd.Series(data=a_python_list)
a_pandas_series


Out[8]:
0   -0.133211
1    1.180592
2    0.695263
3    0.386192
4   -0.216251
dtype: float64
  • You can see there is the list of random numbers
  • There is also a list of integers, the index
  • And there is a data type dype associated with the items in the list

In [9]:
a_simple_index = ['a', 'b', 'c', 'd', 'e']
a_pandas_series = pd.Series(data=a_python_list, index=a_simple_index)
a_pandas_series


Out[9]:
a   -0.133211
b    1.180592
c    0.695263
d    0.386192
e   -0.216251
dtype: float64
  • Notice how the index has changed from numbers to letters
  • we can use the index to extract specific values from the Seres

In [10]:
# index by label
a_pandas_series['a']


Out[10]:
-0.13321098075371843

In [11]:
# index by location
a_pandas_series[1]


Out[11]:
1.1805918025675413
  • you can also create a Series from a python dictionary

In [12]:
a_python_dictionary = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(a_python_dictionary)


Out[12]:
a    0.0
b    1.0
c    2.0
dtype: float64
  • Series are useful for doing fast, "vectorized" operations
  • Much faster than doing this with Python data structures

In [14]:
a_big_series = pd.Series(np.random.randn(1000))
a_big_series


Out[14]:
0      0.469852
1      0.175082
2     -0.018170
3      2.781493
4     -0.280631
5      0.419199
6     -0.588726
7     -1.689237
8     -0.374918
9      0.043309
10     0.231049
11     0.314572
12     1.303032
13     0.041015
14    -0.438821
15     0.266051
16    -0.481886
17    -0.152868
18     0.863440
19     0.231818
20     0.851909
21     0.159365
22     1.109669
23     0.627472
24    -0.092200
25    -1.385732
26     0.188567
27     1.939975
28     1.685535
29     0.344431
         ...   
970   -0.875709
971   -0.461561
972   -0.494858
973   -0.482895
974    0.398280
975   -1.197531
976    0.246262
977   -0.576767
978    1.134142
979    0.759695
980    0.728189
981   -1.953974
982   -0.326915
983   -1.528382
984    0.443522
985   -0.709981
986   -2.630291
987   -0.741079
988   -0.206771
989   -0.739157
990   -0.463475
991   -0.420967
992   -1.907424
993   -0.871503
994    0.020252
995   -0.414407
996    0.963306
997    0.014267
998   -0.603454
999    0.926399
dtype: float64

In [15]:
a_big_series * 2


Out[15]:
0      0.939704
1      0.350164
2     -0.036341
3      5.562987
4     -0.561262
5      0.838399
6     -1.177452
7     -3.378473
8     -0.749836
9      0.086618
10     0.462097
11     0.629144
12     2.606063
13     0.082030
14    -0.877642
15     0.532102
16    -0.963771
17    -0.305737
18     1.726880
19     0.463635
20     1.703818
21     0.318729
22     2.219338
23     1.254945
24    -0.184400
25    -2.771465
26     0.377133
27     3.879950
28     3.371071
29     0.688863
         ...   
970   -1.751417
971   -0.923122
972   -0.989717
973   -0.965791
974    0.796560
975   -2.395063
976    0.492523
977   -1.153534
978    2.268284
979    1.519391
980    1.456377
981   -3.907949
982   -0.653830
983   -3.056764
984    0.887044
985   -1.419963
986   -5.260581
987   -1.482158
988   -0.413541
989   -1.478315
990   -0.926950
991   -0.841934
992   -3.814849
993   -1.743005
994    0.040504
995   -0.828814
996    1.926612
997    0.028535
998   -1.206909
999    1.852799
dtype: float64

In [18]:
a_big_series.sum() / len(a_big_series)


Out[18]:
-0.036257500273335394

In [16]:
a_big_series.mean()


Out[16]:
-0.036257500273335394

In [17]:
a_big_series.describe()


Out[17]:
count    1000.000000
mean       -0.036258
std         0.994357
min        -3.434910
25%        -0.695222
50%        -0.018045
75%         0.608289
max         3.193684
dtype: float64

Dictionaries

  • DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
  • You can think of it like a spreadsheet or R dataframe
  • Most popular data structure

In [26]:
a_dictionary = {'one' : [1., 2., 3., 4.],
                'two' : [4., 3., 2., 1.]}

a_dataframe = pd.DataFrame(a_dictionary)
a_dataframe


Out[26]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0

In [27]:
a_dataframe = pd.DataFrame(a_dictionary,
                           index=['a', 'b', 'c', 'd'])
a_dataframe


Out[27]:
one two
a 1.0 4.0
b 2.0 3.0
c 3.0 2.0
d 4.0 1.0
  • Very commonly you can create a Dataframe from a list of dictionaries

In [29]:
a_list_of_dictionaries = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
a_dataframe = pd.DataFrame(a_list_of_dictionaries)
a_dataframe


Out[29]:
a b c
0 1 2 NaN
1 5 10 20.0
  • We can select columns by using python slicing

In [30]:
a_dataframe['a']


Out[30]:
0    1
1    5
Name: a, dtype: int64
  • This gives us column 'a' as a Series data structure

In [33]:
a_dataframe = pd.DataFrame({'a': np.random.randn(1000),
                            'b': np.random.randn(1000),
                            'c': np.random.randn(1000),
                            'd': np.random.randn(1000),
                            'e': 'hi'})
a_dataframe


Out[33]:
a b c d e
0 0.195200 1.471513 0.447872 0.742578 hi
1 -1.467380 1.005069 -0.679448 -1.296169 hi
2 -0.126771 -0.448476 -0.007398 -1.293716 hi
3 -0.829169 0.287930 -0.329248 -0.391619 hi
4 1.894362 0.507544 -2.194929 2.227772 hi
5 -0.455285 0.721201 -0.024484 -1.249271 hi
6 -1.868823 -0.023902 -1.070258 -1.057628 hi
7 0.481177 1.726554 -0.459909 -0.348522 hi
8 0.461946 -0.272053 0.555865 1.455070 hi
9 1.469342 0.440828 -1.469929 -1.147367 hi
10 1.318531 -1.184896 0.108213 -2.652809 hi
11 -0.715273 -1.771749 -0.476682 1.074689 hi
12 0.340681 -2.238441 1.723926 -2.261482 hi
13 0.948599 1.145019 0.642341 -0.192459 hi
14 -1.725399 0.208998 0.332862 0.277811 hi
15 1.911594 0.500968 0.386674 -0.680614 hi
16 0.269013 0.597904 -0.239381 -0.945871 hi
17 0.609643 1.000152 1.075684 0.789990 hi
18 0.909984 -0.299750 -1.886791 -0.230973 hi
19 -0.834842 -0.870673 -0.629568 -0.912954 hi
20 0.108223 -0.742483 0.870948 0.272827 hi
21 0.310667 -0.220494 -0.733112 1.008447 hi
22 -1.334094 -1.455676 0.076898 -1.603281 hi
23 1.215530 -2.415782 -1.821654 -0.365152 hi
24 0.708383 0.411112 0.356446 1.447221 hi
25 -1.024141 -0.440211 0.569519 0.992510 hi
26 0.547769 0.576499 0.390148 0.023413 hi
27 -1.776340 0.957517 0.098811 0.259768 hi
28 -0.837603 -0.254053 -0.090065 -1.251397 hi
29 -0.078810 0.647083 -0.722462 -0.265362 hi
... ... ... ... ... ...
970 -1.687111 1.588347 -1.150845 -0.294375 hi
971 -0.544362 2.200697 0.827082 -0.345998 hi
972 0.375467 0.024828 1.132318 0.977846 hi
973 -0.904338 0.513799 0.052034 -0.030120 hi
974 -0.520909 -1.277849 -0.641681 -0.939304 hi
975 0.257926 -0.476056 0.382637 -0.343628 hi
976 -0.022015 -0.168963 0.948342 -0.193978 hi
977 -0.115801 -2.135518 0.349617 0.697843 hi
978 0.668141 -0.967671 0.919919 -0.264464 hi
979 -0.575074 0.993120 0.339274 -0.418490 hi
980 -0.525728 -2.021995 -0.933688 -0.690072 hi
981 -1.704445 -0.369670 -0.938018 0.793153 hi
982 0.531911 0.167518 -0.423615 -0.096458 hi
983 -1.242685 0.167854 1.018922 0.896711 hi
984 -2.674783 -1.203283 0.778616 -0.637059 hi
985 -1.306915 0.507325 -0.151521 -0.432589 hi
986 -1.322193 0.298746 0.685674 -0.477047 hi
987 0.106706 0.419202 0.854542 0.168844 hi
988 1.090841 2.572600 1.759922 0.846667 hi
989 -0.271730 0.851365 -0.636409 0.107462 hi
990 0.836010 0.060212 0.374984 -0.238871 hi
991 -1.492225 0.102794 0.557444 -0.221696 hi
992 -0.768700 -1.053390 -0.913775 -0.580076 hi
993 -0.947071 -2.341502 -1.494057 1.161260 hi
994 -0.151204 -1.976072 1.908078 -1.330443 hi
995 -0.201867 1.303376 -0.276134 -0.399333 hi
996 -0.233172 0.159232 1.398382 0.490431 hi
997 0.194740 -0.044352 -1.175685 0.610438 hi
998 1.003858 0.053377 -0.992018 0.300915 hi
999 0.126112 0.400421 -0.553286 -0.322495 hi

1000 rows × 5 columns


In [34]:
a_dataframe.dtypes


Out[34]:
a    float64
b    float64
c    float64
d    float64
e     object
dtype: object

In [35]:
a_dataframe.describe()


Out[35]:
a b c d
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 0.015121 0.001493 -0.015727 -0.018310
std 1.014903 1.025325 0.987438 1.007048
min -3.232754 -2.761861 -3.243911 -2.652809
25% -0.665784 -0.668076 -0.665368 -0.725810
50% 0.022373 -0.026524 -0.029161 -0.074956
75% 0.717300 0.712870 0.646439 0.693972
max 2.874075 2.959302 2.656383 3.596201
  • Dataframes are very useful for reading and writing tabular data to disk
  • We can save this dataframe as a CSV or an Excel spreadsheet

In [36]:
a_dataframe.to_csv("random-data.csv")
  • Now go back to the directory tree
  • You should see a new file called "random-data.csv"

In [38]:
a_dataframe.to_excel("random-data.xls")
  • Now you can open your data in excel, how fun!

In [39]:
a_dataframe.to_excel("random-data.xls", index=False)
a_dataframe.to_csv("random-data.csv", index=False)
  • Reading data is also super easy

In [41]:
a_new_dataframe = pd.read_csv("random-data.csv")
a_new_dataframe


Out[41]:
a b c d e
0 0.195200 1.471513 0.447872 0.742578 hi
1 -1.467380 1.005069 -0.679448 -1.296169 hi
2 -0.126771 -0.448476 -0.007398 -1.293716 hi
3 -0.829169 0.287930 -0.329248 -0.391619 hi
4 1.894362 0.507544 -2.194929 2.227772 hi
5 -0.455285 0.721201 -0.024484 -1.249271 hi
6 -1.868823 -0.023902 -1.070258 -1.057628 hi
7 0.481177 1.726554 -0.459909 -0.348522 hi
8 0.461946 -0.272053 0.555865 1.455070 hi
9 1.469342 0.440828 -1.469929 -1.147367 hi
10 1.318531 -1.184896 0.108213 -2.652809 hi
11 -0.715273 -1.771749 -0.476682 1.074689 hi
12 0.340681 -2.238441 1.723926 -2.261482 hi
13 0.948599 1.145019 0.642341 -0.192459 hi
14 -1.725399 0.208998 0.332862 0.277811 hi
15 1.911594 0.500968 0.386674 -0.680614 hi
16 0.269013 0.597904 -0.239381 -0.945871 hi
17 0.609643 1.000152 1.075684 0.789990 hi
18 0.909984 -0.299750 -1.886791 -0.230973 hi
19 -0.834842 -0.870673 -0.629568 -0.912954 hi
20 0.108223 -0.742483 0.870948 0.272827 hi
21 0.310667 -0.220494 -0.733112 1.008447 hi
22 -1.334094 -1.455676 0.076898 -1.603281 hi
23 1.215530 -2.415782 -1.821654 -0.365152 hi
24 0.708383 0.411112 0.356446 1.447221 hi
25 -1.024141 -0.440211 0.569519 0.992510 hi
26 0.547769 0.576499 0.390148 0.023413 hi
27 -1.776340 0.957517 0.098811 0.259768 hi
28 -0.837603 -0.254053 -0.090065 -1.251397 hi
29 -0.078810 0.647083 -0.722462 -0.265362 hi
... ... ... ... ... ...
970 -1.687111 1.588347 -1.150845 -0.294375 hi
971 -0.544362 2.200697 0.827082 -0.345998 hi
972 0.375467 0.024828 1.132318 0.977846 hi
973 -0.904338 0.513799 0.052034 -0.030120 hi
974 -0.520909 -1.277849 -0.641681 -0.939304 hi
975 0.257926 -0.476056 0.382637 -0.343628 hi
976 -0.022015 -0.168963 0.948342 -0.193978 hi
977 -0.115801 -2.135518 0.349617 0.697843 hi
978 0.668141 -0.967671 0.919919 -0.264464 hi
979 -0.575074 0.993120 0.339274 -0.418490 hi
980 -0.525728 -2.021995 -0.933688 -0.690072 hi
981 -1.704445 -0.369670 -0.938018 0.793153 hi
982 0.531911 0.167518 -0.423615 -0.096458 hi
983 -1.242685 0.167854 1.018922 0.896711 hi
984 -2.674783 -1.203283 0.778616 -0.637059 hi
985 -1.306915 0.507325 -0.151521 -0.432589 hi
986 -1.322193 0.298746 0.685674 -0.477047 hi
987 0.106706 0.419202 0.854542 0.168844 hi
988 1.090841 2.572600 1.759922 0.846667 hi
989 -0.271730 0.851365 -0.636409 0.107462 hi
990 0.836010 0.060212 0.374984 -0.238871 hi
991 -1.492225 0.102794 0.557444 -0.221696 hi
992 -0.768700 -1.053390 -0.913775 -0.580076 hi
993 -0.947071 -2.341502 -1.494057 1.161260 hi
994 -0.151204 -1.976072 1.908078 -1.330443 hi
995 -0.201867 1.303376 -0.276134 -0.399333 hi
996 -0.233172 0.159232 1.398382 0.490431 hi
997 0.194740 -0.044352 -1.175685 0.610438 hi
998 1.003858 0.053377 -0.992018 0.300915 hi
999 0.126112 0.400421 -0.553286 -0.322495 hi

1000 rows × 5 columns


In [42]:
a_new_dataframe.dtypes


Out[42]:
a    float64
b    float64
c    float64
d    float64
e     object
dtype: object

Plotting with Pandas

  • One of the other nice features of Pandas is how easy it is to create charts and graphs from Series and Dataframes
  • This is because of tight integration with Matplotlib, which is the most popular plotting library for python

In [43]:
a_new_dataframe.plot()


Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x10765f438>

In [45]:
a_new_dataframe['a'].plot()


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x107973c88>

In [47]:
a_new_dataframe.plot(kind="box")


Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x10755d240>

In [49]:
a_new_dataframe.plot(kind="density")


Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c6e2e10>

In [44]:
a_new_dataframe.hist()


Out[44]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1075d75c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x103976da0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1077062e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1078d79e8>]], dtype=object)

In [ ]: