Pandas:

the primary Python tabular data analysis/manipulation toolkit

Presented by Michael Ripperger

What is Pandas?

  • Python package
  • Functional toolkit
    • Organizes data
    • Manipulates data
    • Analyzes data

In [1]:
import pandas as pd
import numpy as np

Speed!

  • Built upon Numpy
  • Cython core

First data structure: Series


In [2]:
s = pd.Series([1,2,3,3,4,5,4], index=['a','b','c','d','e','f','g'])
print(s)


a    1
b    2
c    3
d    3
e    4
f    5
g    4
dtype: int64

Slicing


In [3]:
s[:3]


Out[3]:
a    1
b    2
c    3
dtype: int64

In [4]:
(s + s) / 4


Out[4]:
a    0.5
b    1.0
c    1.5
d    1.5
e    2.0
f    2.5
g    2.0
dtype: float64

In [5]:
s.get('f')


Out[5]:
5

In [6]:
s.get('y', np.nan)


Out[6]:
nan

Second data structure: DataFrame

  • Similar to R's data frame
  • Does everything Series can do

In [7]:
df = pd.read_csv('brothers.csv', index_col=0)

In [8]:
print(df)


    length_1  breadth_1  length_2  breadth_2
1        191        155       179        145
2        195        149       201        152
3        181        148       185        149
4        183        153       188        149
5        176        144       171        142
6        208        157       192        152
7        189        150       190        149
8        197        159       189        152
9        188        152       197        159
10       192        150       187        151
11       179        158       186        148
12       183        147       174        147
13       174        150       185        152
14       190        159       195        157
15       188        151       187        158
16       163        137       161        130
17       195        155       183        158
18       186        153       173        148
19       181        145       182        146
20       175        140       165        137
21       192        154       185        152
22       174        143       178        147
23       176        139       176        143
24       197        167       200        158
25       190        163       187        150

Metadata


In [9]:
df.shape


Out[9]:
(25, 4)

In [10]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 1 to 25
Data columns (total 4 columns):
length_1     25 non-null int64
breadth_1    25 non-null int64
length_2     25 non-null int64
breadth_2    25 non-null int64
dtypes: int64(4)
memory usage: 1000.0 bytes

In [11]:
df.head()


Out[11]:
length_1 breadth_1 length_2 breadth_2
1 191 155 179 145
2 195 149 201 152
3 181 148 185 149
4 183 153 188 149
5 176 144 171 142

In [12]:
df = pd.DataFrame(
        [df['length_%s' % x] * df['breadth_%s' % x] for x in [1,2]],
            index=['brother 1', 'brother 2']
        )
df


Out[12]:
1 2 3 4 5 6 7 8 9 10 ... 16 17 18 19 20 21 22 23 24 25
brother 1 29605 29055 26788 27999 25344 32656 28350 31323 28576 28800 ... 22331 30225 28458 26245 24500 29568 24882 24464 32899 30970
brother 2 25955 30552 27565 28012 24282 29184 28310 28728 31323 28237 ... 20930 28914 25604 26572 22605 28120 26166 25168 31600 28050

2 rows × 25 columns


In [13]:
df = df.T
df.head()


Out[13]:
brother 1 brother 2
1 29605 25955
2 29055 30552
3 26788 27565
4 27999 28012
5 25344 24282

In [14]:
df = df / 1000
df.head()


Out[14]:
brother 1 brother 2
1 29.605 25.955
2 29.055 30.552
3 26.788 27.565
4 27.999 28.012
5 25.344 24.282

Boolean indexing


In [15]:
df[df['brother 1'] < df['brother 2']]


Out[15]:
brother 1 brother 2
2 29.055 30.552
3 26.788 27.565
4 27.999 28.012
9 28.576 31.323
13 26.100 28.120
14 30.210 30.615
15 28.388 29.546
19 26.245 26.572
22 24.882 26.166
23 24.464 25.168

In [16]:
df.loc[:,'brother 2 greater?'] = df['brother 1'] < df['brother 2']
df.head()


Out[16]:
brother 1 brother 2 brother 2 greater?
1 29.605 25.955 False
2 29.055 30.552 True
3 26.788 27.565 True
4 27.999 28.012 True
5 25.344 24.282 False

Sorting


In [17]:
df.sort(columns='brother 1').head()


Out[17]:
brother 1 brother 2 brother 2 greater?
16 22.331 20.930 False
23 24.464 25.168 True
20 24.500 22.605 False
22 24.882 26.166 True
5 25.344 24.282 False

In [18]:
df.sort_index(ascending=False).head()


Out[18]:
brother 1 brother 2 brother 2 greater?
25 30.970 28.050 False
24 32.899 31.600 False
23 24.464 25.168 True
22 24.882 26.166 True
21 29.568 28.120 False

Iteration


In [19]:
for index, data_series in df.iterrows():
    if index in [1,2]:
        print(index)
        print(data_series)


1
brother 1             29.605
brother 2             25.955
brother 2 greater?     False
Name: 1, dtype: object
2
brother 1             29.055
brother 2             30.552
brother 2 greater?      True
Name: 2, dtype: object

SQL-like groupby operations


In [20]:
df1 = pd.read_csv('mtcars.csv', index_col=0)
df1


Out[20]:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

In [21]:
grouped = df1.groupby('cyl')

In [22]:
for group_name, group in grouped:
    print(group_name)
    print(group)


4
                 mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
Datsun 710      22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
Merc 240D       24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2
Merc 230        22.8    4  140.8   95  3.92  3.150  22.90   1   0     4     2
Fiat 128        32.4    4   78.7   66  4.08  2.200  19.47   1   1     4     1
Honda Civic     30.4    4   75.7   52  4.93  1.615  18.52   1   1     4     2
Toyota Corolla  33.9    4   71.1   65  4.22  1.835  19.90   1   1     4     1
Toyota Corona   21.5    4  120.1   97  3.70  2.465  20.01   1   0     3     1
Fiat X1-9       27.3    4   79.0   66  4.08  1.935  18.90   1   1     4     1
Porsche 914-2   26.0    4  120.3   91  4.43  2.140  16.70   0   1     5     2
Lotus Europa    30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2
Volvo 142E      21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2
6
                 mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
Mazda RX4       21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
Mazda RX4 Wag   21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
Valiant         18.1    6  225.0  105  2.76  3.460  20.22   1   0     3     1
Merc 280        19.2    6  167.6  123  3.92  3.440  18.30   1   0     4     4
Merc 280C       17.8    6  167.6  123  3.92  3.440  18.90   1   0     4     4
Ferrari Dino    19.7    6  145.0  175  3.62  2.770  15.50   0   1     5     6
8
                      mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  \
Hornet Sportabout    18.7    8  360.0  175  3.15  3.440  17.02   0   0     3   
Duster 360           14.3    8  360.0  245  3.21  3.570  15.84   0   0     3   
Merc 450SE           16.4    8  275.8  180  3.07  4.070  17.40   0   0     3   
Merc 450SL           17.3    8  275.8  180  3.07  3.730  17.60   0   0     3   
Merc 450SLC          15.2    8  275.8  180  3.07  3.780  18.00   0   0     3   
Cadillac Fleetwood   10.4    8  472.0  205  2.93  5.250  17.98   0   0     3   
Lincoln Continental  10.4    8  460.0  215  3.00  5.424  17.82   0   0     3   
Chrysler Imperial    14.7    8  440.0  230  3.23  5.345  17.42   0   0     3   
Dodge Challenger     15.5    8  318.0  150  2.76  3.520  16.87   0   0     3   
AMC Javelin          15.2    8  304.0  150  3.15  3.435  17.30   0   0     3   
Camaro Z28           13.3    8  350.0  245  3.73  3.840  15.41   0   0     3   
Pontiac Firebird     19.2    8  400.0  175  3.08  3.845  17.05   0   0     3   
Ford Pantera L       15.8    8  351.0  264  4.22  3.170  14.50   0   1     5   
Maserati Bora        15.0    8  301.0  335  3.54  3.570  14.60   0   1     5   

                     carb  
Hornet Sportabout       2  
Duster 360              4  
Merc 450SE              3  
Merc 450SL              3  
Merc 450SLC             3  
Cadillac Fleetwood      4  
Lincoln Continental     4  
Chrysler Imperial       4  
Dodge Challenger        2  
AMC Javelin             2  
Camaro Z28              4  
Pontiac Firebird        2  
Ford Pantera L          4  
Maserati Bora           8  

Plotting with matplotlib


In [23]:
%matplotlib inline

In [24]:
df1.plot(kind='scatter', x='cyl', y='hp')


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bd3ab7b38>

Function application


In [25]:
df2 = pd.read_csv('data.csv')
df2


Out[25]:
ACCESSION AGE_Years SEX AA_Interp Alanine_P Arginine_P Aspartic_acid_P Citrulline_P Glutamine_P Glutamic_acid_P ... Ornithine_P Phenylalanine_P Proline_P Serine_P Taurine_P Threonine_P Tyrosine_P Valine_P Homocysteine_P Cysteine_P
0 9020101689 0.000000 F Normal 332 79 9 18 612 41 ... 74 62 167 177 127 317 57 161 0 53
1 9058109237 0.000000 F Normal 293 44 7 9 467 63 ... 49 46 133 138 138 300 74 152 0 26
2 9161128247 0.000000 M Normal 437 30 7 18 671 85 ... 57 56 231 164 135 205 56 116 0 32
3 9175120582 0.000000 M Normal 361 32 9 27 487 62 ... 100 52 118 136 92 228 55 123 0 37
4 9181544801 0.000000 F Normal 555 86 10 37 594 90 ... 72 47 256 140 91 204 92 263 0 36
5 9224120862 0.000000 M Normal 251 33 5 13 578 45 ... 43 45 205 168 216 232 101 160 0 29
6 9274552153 0.000000 M Normal 480 15 14 17 640 74 ... 61 45 257 159 248 183 61 113 0 43
7 9365109752 0.000000 F Normal 177 32 7 9 556 58 ... 33 67 207 216 305 92 52 107 0 23
8 10054500609 0.000000 M Normal 247 69 7 27 445 41 ... 39 43 81 123 38 99 56 191 0 20
9 10065108843 0.000000 F Normal 373 41 23 35 610 98 ... 59 69 263 179 128 187 81 189 0 41
10 10099121224 0.000000 F Normal 525 62 10 28 674 48 ... 101 60 202 214 121 151 86 123 0 61
11 10223134328 0.000000 F Normal 130 80 28 21 332 181 ... 145 80 193 122 205 115 37 165 0 32
12 10266107979 0.000000 F Normal 598 36 7 30 582 92 ... 112 49 275 139 166 189 106 116 0 29
13 10310300418 0.000000 M Normal 321 22 4 8 437 49 ... 28 48 305 80 90 96 62 69 0 21
14 10321125435 0.000000 M Normal 442 27 14 16 624 69 ... 82 48 178 204 193 204 53 137 0 45
15 10351127644 0.000000 F Normal 190 28 25 10 487 103 ... 100 58 128 173 127 130 86 132 0 32
16 10354112499 0.000000 M Normal 231 32 8 16 666 44 ... 48 53 382 228 96 162 58 99 0 17
17 11090118522 0.000000 F Normal 403 87 11 9 633 67 ... 49 97 216 151 283 261 112 122 0 28
18 11174117272 0.000000 F Normal 264 31 13 14 633 56 ... 83 53 135 236 164 168 58 102 0 34
19 11185105907 0.000000 M Normal 331 41 9 20 570 81 ... 53 51 138 107 299 209 70 119 0 40
20 11245124770 0.000000 M Normal 260 50 7 14 388 77 ... 39 51 105 129 228 232 53 117 0 23
21 11315126467 0.000000 F Normal 327 56 8 11 853 22 ... 58 97 204 184 55 159 98 184 0 47
22 11331104296 0.000000 F Normal 278 23 6 11 509 66 ... 36 49 142 121 118 149 54 78 0 39
23 12079124994 0.000000 M Normal 350 41 6 20 457 36 ... 95 49 248 112 203 160 80 118 0 59
24 12123122305 0.000000 M Normal 358 72 4 35 671 17 ... 30 47 197 121 67 93 59 161 0 39
25 12200131746 0.000000 M Normal 430 98 12 12 687 56 ... 110 88 258 258 102 239 67 211 0 39
26 12230124486 0.000000 M Normal 388 52 17 16 737 79 ... 106 89 201 226 230 238 93 124 0 41
27 12236125128 0.000000 M Normal 337 60 10 12 437 69 ... 54 80 180 123 316 289 120 170 0 25
28 12265105031 0.000000 M Normal 322 157 11 12 572 57 ... 106 62 199 169 51 116 133 174 0 27
29 13043544813 0.000000 F Normal 205 77 8 16 584 53 ... 54 52 161 115 46 105 54 220 0 31
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13843 13352109849 3.969883 F Normal 418 80 11 21 587 28 ... 42 48 246 123 67 90 72 183 0 29
13844 9315122955 3.972621 M Normal 280 74 7 31 533 44 ... 81 56 160 145 54 113 79 227 0 41
13845 10048106347 3.972621 M Normal 217 55 4 26 455 25 ... 45 45 142 113 48 70 62 168 0 24
13846 13348108612 3.972621 M Normal 239 94 14 34 604 33 ... 57 41 115 169 55 98 60 155 0 22
13847 12192131427 3.975359 M Normal 324 59 5 36 445 25 ... 70 64 207 125 69 119 98 204 0 22
13848 9117125753 3.978097 F Normal 267 65 11 49 484 81 ... 58 56 205 131 150 145 77 240 0 20
13849 12058131812 3.978097 M Normal 352 84 8 32 462 93 ... 89 62 366 154 132 157 122 279 0 31
13850 13294132313 3.978097 F Normal 282 75 11 33 578 16 ... 55 59 268 104 50 81 72 239 0 17
13851 10160101208 3.980835 M Normal 385 76 3 42 561 25 ... 49 41 156 121 55 129 49 174 0 34
13852 11193117260 3.980835 F Normal 438 40 8 17 562 41 ... 61 45 106 132 78 100 53 157 0 24
13853 9260113090 3.986310 F Normal 419 59 3 32 518 16 ... 52 80 207 125 38 93 73 234 0 28
13854 10134107903 3.986310 F Normal 386 44 4 17 460 22 ... 37 38 228 141 57 135 50 215 0 34
13855 10315131040 3.986310 F Normal 239 81 4 30 470 49 ... 71 57 177 151 52 131 75 282 0 46
13856 11361301446 3.986310 U Normal 256 44 10 44 352 192 ... 69 46 229 98 135 88 46 211 0 2
13857 11362300162 3.989048 U Normal 478 90 7 23 422 95 ... 53 55 203 140 79 108 55 150 0 1
13858 12004133840 3.989048 M Normal 260 79 4 31 500 25 ... 47 38 152 126 47 92 47 190 0 20
13859 11286115881 3.991786 M Normal 270 73 5 27 580 29 ... 30 47 80 102 64 76 37 155 0 36
13860 13091127554 3.991786 M Normal 126 40 6 14 327 10 ... 26 72 200 86 35 88 75 288 0 26
13861 13150104063 3.991786 M Normal 335 74 9 30 573 43 ... 67 62 178 130 44 101 81 203 0 30
13862 13162104789 3.991786 M Normal 227 52 6 19 561 48 ... 34 43 106 165 36 110 43 180 0 28
13863 13323113589 3.991786 M Normal 241 48 13 26 528 24 ... 44 58 380 105 47 96 77 288 0 24
13864 10211127309 3.994524 M Normal 200 29 5 37 474 39 ... 64 43 74 114 46 72 47 192 0 33
13865 12047121447 3.994524 F Normal 209 64 3 29 457 24 ... 41 47 152 84 36 107 48 201 0 22
13866 9095103307 3.997262 F Normal 193 39 8 4 438 57 ... 37 63 106 116 80 65 48 136 0 18
13867 10172115731 3.997262 F Normal 317 38 6 24 472 39 ... 44 59 271 119 33 68 53 173 0 21
13868 10263121559 3.997262 F Normal 280 69 4 31 564 20 ... 41 59 99 102 47 84 56 213 0 25
13869 12086300135 3.997262 U Normal 489 69 11 31 524 83 ... 67 56 318 142 129 106 71 165 0 10
13870 13166105842 3.997262 M Normal 238 56 7 37 512 28 ... 32 59 237 107 58 77 60 195 0 21
13871 13217102895 3.997262 M Normal 244 63 13 18 426 51 ... 34 50 163 83 43 72 44 201 0 25
13872 12031133330 4.000000 M Normal 479 49 8 28 406 33 ... 33 61 239 121 51 101 53 139 0 13

13873 rows × 27 columns


In [26]:
df2.columns


Out[26]:
Index(['ACCESSION', 'AGE_Years', 'SEX', 'AA_Interp', 'Alanine_P', 'Arginine_P',
       'Aspartic_acid_P', 'Citrulline_P', 'Glutamine_P', 'Glutamic_acid_P',
       'Glycine_P', 'Histidine_P', 'Hydroxyproline_P', 'Isoleucine_P',
       'Leucine_P', 'Lysine_P', 'Methionine_P', 'Ornithine_P',
       'Phenylalanine_P', 'Proline_P', 'Serine_P', 'Taurine_P', 'Threonine_P',
       'Tyrosine_P', 'Valine_P', 'Homocysteine_P', 'Cysteine_P'],
      dtype='object')

In [27]:
df3 = pd.read_csv('amino_acids.txt', header=None)
print(df3.head())


                                       0
0                Ala     A       Alanine
1               Arg     R       Arginine
2             Asn     N       Asparagine
3          Asp     D       Aspartic acid
4             Cit     C       Citrulline

In [28]:
import re
df3 = df3.loc[:,0].apply(lambda x: pd.Series(list(filter(None, re.compile('  +').split(x)))))
df3.head()


Out[28]:
0 1 2
0 Ala A Alanine
1 Arg R Arginine
2 Asn N Asparagine
3 Asp D Aspartic acid
4 Cit C Citrulline

In [29]:
df3.tail()


Out[29]:
0 1 2
22 Trp W Tryptophan
23 Tyr Y Tyrosine
24 Val V Valine
25 Xaa X Any amino acid
26 TERM termination codon NaN

In [30]:
df3 = df3.iloc[:-1,:]

In [31]:
df3.columns = ['Abbreviation', 'First letter', 'Amino acid']

In [32]:
df3.set_index('Amino acid', inplace=True)

In [33]:
df3.head()


Out[33]:
Abbreviation First letter
Amino acid
Alanine Ala A
Arginine Arg R
Asparagine Asn N
Aspartic acid Asp D
Citrulline Cit C

In [34]:
print(df3.head())


              Abbreviation First letter
Amino acid                             
Alanine                Ala            A
Arginine               Arg            R
Asparagine             Asn            N
Aspartic acid          Asp            D
Citrulline             Cit            C

In [35]:
def change_names(name):
    if '_P' in name:
        name = df3.loc[name[:-2].replace('_', ' '), 'Abbreviation']
    return name

df2.columns = [change_names(name) for name in df2.columns]

In [36]:
df2.set_index('ACCESSION', inplace=True)
df2.head()


Out[36]:
AGE_Years SEX AA_Interp Ala Arg Asp Cit Gln Glu Gly ... Orn Phe Pro Ser Tau Thr Tyr Val Hcy Cys
ACCESSION
9020101689 0 F Normal 332 79 9 18 612 41 347 ... 74 62 167 177 127 317 57 161 0 53
9058109237 0 F Normal 293 44 7 9 467 63 274 ... 49 46 133 138 138 300 74 152 0 26
9161128247 0 M Normal 437 30 7 18 671 85 302 ... 57 56 231 164 135 205 56 116 0 32
9175120582 0 M Normal 361 32 9 27 487 62 356 ... 100 52 118 136 92 228 55 123 0 37
9181544801 0 F Normal 555 86 10 37 594 90 201 ... 72 47 256 140 91 204 92 263 0 36

5 rows × 26 columns

DataFrame columns are Series


In [37]:
hcy_s = df2.loc[:,'Hcy']
hcy_s.head()


Out[37]:
ACCESSION
9020101689    0
9058109237    0
9161128247    0
9175120582    0
9181544801    0
Name: Hcy, dtype: int64

In [38]:
hcy_s.unique()


Out[38]:
array([0])

In [39]:
df2 = df2.drop('Hcy', axis=1)
df2.head()


Out[39]:
AGE_Years SEX AA_Interp Ala Arg Asp Cit Gln Glu Gly ... Met Orn Phe Pro Ser Tau Thr Tyr Val Cys
ACCESSION
9020101689 0 F Normal 332 79 9 18 612 41 347 ... 33 74 62 167 177 127 317 57 161 53
9058109237 0 F Normal 293 44 7 9 467 63 274 ... 21 49 46 133 138 138 300 74 152 26
9161128247 0 M Normal 437 30 7 18 671 85 302 ... 27 57 56 231 164 135 205 56 116 32
9175120582 0 M Normal 361 32 9 27 487 62 356 ... 26 100 52 118 136 92 228 55 123 37
9181544801 0 F Normal 555 86 10 37 594 90 201 ... 36 72 47 256 140 91 204 92 263 36

5 rows × 25 columns


In [40]:
df2.loc[:,'Ala'].plot(kind='hist')


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bd37940f0>

In [41]:
df2.describe()


Out[41]:
AGE_Years Ala Arg Asp Cit Gln Glu Gly His Hyp ... Met Orn Phe Pro Ser Tau Thr Tyr Val Cys
count 13873.000000 13873.000000 13873.000000 13873.000000 13873.000000 13873.000000 13873.000000 13873.000000 13873.000000 13873.000000 ... 13873.000000 13873.000000 13870.000000 13873.000000 13873.000000 13873.000000 13873.000000 13870.000000 13873.000000 13873.000000
mean 1.105445 326.839977 70.525337 10.814171 22.640957 554.153968 70.019174 243.505658 93.974915 28.016939 ... 26.217184 68.269228 52.744845 193.419952 141.426800 76.562676 136.559720 69.575198 194.570605 26.937865
std 1.115121 96.904859 25.643201 7.195245 9.284156 107.989384 43.717879 68.677978 24.330960 22.161957 ... 9.445003 29.802001 12.840800 66.666766 39.858176 38.963828 60.388284 23.746333 61.780048 9.311415
min 0.000000 92.000000 0.000000 0.000000 0.000000 30.000000 8.000000 76.000000 17.000000 0.000000 ... 6.000000 15.000000 8.000000 0.000000 47.000000 2.000000 30.000000 7.000000 29.000000 0.000000
25% 0.172484 254.000000 52.000000 6.000000 16.000000 480.000000 42.000000 194.000000 76.000000 13.000000 ... 19.000000 47.000000 43.000000 146.000000 116.000000 53.000000 95.000000 52.000000 150.000000 21.000000
50% 0.684462 316.000000 68.000000 9.000000 22.000000 545.000000 59.000000 233.000000 90.000000 25.000000 ... 24.000000 62.000000 51.000000 184.000000 134.000000 67.000000 122.000000 65.000000 186.000000 26.000000
75% 1.804243 391.000000 86.000000 13.000000 28.000000 618.000000 85.000000 282.000000 109.000000 41.000000 ... 32.000000 83.000000 61.000000 229.000000 157.000000 89.000000 163.000000 84.000000 230.000000 32.000000
max 4.000000 624.000000 190.000000 78.000000 68.000000 959.000000 563.000000 537.000000 271.000000 338.000000 ... 65.000000 285.000000 113.000000 751.000000 569.000000 588.000000 687.000000 151.000000 594.000000 94.000000

8 rows × 23 columns


In [42]:
age_s = df2.loc[:,'AGE_Years']
age_s.max()


Out[42]:
4.0

In [43]:
len(list(df2.loc[:,'AGE_Years'].unique()))


Out[43]:
1448

In [44]:
df2.loc[:,'AGE_Years'] = age_s.round()
grouped = df2.groupby('AGE_Years')
grouped.count()


Out[44]:
SEX AA_Interp Ala Arg Asp Cit Gln Glu Gly His ... Met Orn Phe Pro Ser Tau Thr Tyr Val Cys
AGE_Years
0 5922 5922 5922 5922 5922 5922 5922 5922 5922 5922 ... 5922 5922 5921 5922 5922 5922 5922 5920 5922 5922
1 3772 3772 3772 3772 3772 3772 3772 3772 3772 3772 ... 3772 3772 3772 3772 3772 3772 3772 3772 3772 3772
2 2027 2027 2027 2027 2027 2027 2027 2027 2027 2027 ... 2027 2027 2025 2027 2027 2027 2027 2027 2027 2027
3 1543 1543 1543 1543 1543 1543 1543 1543 1543 1543 ... 1543 1543 1543 1543 1543 1543 1543 1542 1543 1543
4 609 609 609 609 609 609 609 609 609 609 ... 609 609 609 609 609 609 609 609 609 609

5 rows × 24 columns


In [45]:
def standard_score(df):
    return (df.mean() - df2.mean()) / df2.std()

df4 = grouped.get_group(4).pipe(standard_score) - \
    grouped.get_group(0).pipe(standard_score)
df4.drop('AGE_Years').plot(kind='bar')


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bd33c7438>

In [46]:
df2.drop(['AGE_Years', 'SEX', 'AA_Interp'], axis=1, inplace=True)
df2.apply(lambda x: x.sum())


Out[46]:
Ala    4534251
Arg     978398
Asp     150025
Cit     314098
Gln    7687778
Glu     971376
Gly    3378154
His    1303714
Hyp     388679
Ile     808715
Leu    1541011
Lys    2165406
Met     363711
Orn     947099
Phe     731571
Pro    2683315
Ser    1962014
Tau    1062154
Thr    1894493
Tyr     965008
Val    2699278
Cys     373709
dtype: float64

In [47]:
df2.apply(lambda x: x.sum(), axis=1)


Out[47]:
ACCESSION
9020101689     3182
9058109237     2646
9161128247     2975
9175120582     2731
9181544801     3440
9224120862     2748
9274552153     3133
9365109752     2597
10054500609    2135
10065108843    3244
10099121224    3315
10223134328    2678
10266107979    3341
10310300418    2218
10321125435    3075
10351127644    2650
10354112499    2814
11090118522    3335
11174117272    2845
11185105907    2890
11245124770    2402
11315126467    3218
11331104296    2322
12079124994    2600
12123122305    2632
12200131746    3721
12230124486    3615
12236125128    3120
12265105031    3012
13043544813    2597
               ... 
13352109849    2742
9315122955     2702
10048106347    2033
13348108612    2500
12192131427    2523
9117125753     2791
12058131812    3226
13294132313    2523
10160101208    2533
11193117260    2467
9260113090     2622
10134107903    2644
10315131040    2618
11361301446    2382
11362300162    2541
12004133840    2274
11286115881    2183
13091127554    1974
13150104063    2675
13162104789    2313
13323113589    2661
10211127309    2113
12047121447    2101
9095103307     1997
10172115731    2352
10263121559    2360
12086300135    2965
13166105842    2378
13217102895    2050
12031133330    2455
dtype: float64

In [48]:
df2.applymap(lambda x: np.exp(x)).head()


Out[48]:
Ala Arg Asp Cit Gln Glu Gly His Hyp Ile ... Met Orn Phe Pro Ser Tau Thr Tyr Val Cys
ACCESSION
9020101689 1.533797e+144 2.038281e+34 8103.083928 6.565997e+07 6.140771e+265 6.398435e+17 5.014010e+150 1.084464e+46 9.744803e+09 4.727839e+18 ... 2.146436e+14 1.373383e+32 8.438357e+26 3.366499e+72 7.415207e+76 1.430208e+55 4.691922e+137 5.685720e+24 8.344716e+69 1.041376e+23
9058109237 1.771264e+127 1.285160e+19 1096.633158 8.103084e+03 6.539176e+202 2.293783e+27 9.924029e+118 3.637971e+42 3.269017e+06 4.311232e+15 ... 1.318816e+09 1.907347e+21 9.496119e+19 5.769871e+57 8.563248e+59 8.563248e+59 1.942426e+130 1.373383e+32 1.029820e+66 1.957296e+11
9161128247 6.119115e+189 1.068647e+13 1096.633158 6.565997e+07 2.579867e+291 8.223013e+36 1.435270e+131 6.663176e+40 2.146436e+14 1.784823e+08 ... 5.320482e+11 5.685720e+24 2.091659e+24 2.099062e+100 1.676081e+71 4.263390e+58 1.072430e+89 2.091659e+24 2.388691e+50 7.896296e+13
9175120582 6.029870e+156 7.896296e+13 8103.083928 5.320482e+11 3.172581e+211 8.438357e+26 4.062895e+154 1.506097e+35 3.931334e+12 9.744803e+09 ... 1.957296e+11 2.688117e+43 3.831008e+22 1.765017e+51 1.158910e+59 9.017628e+39 1.045062e+99 7.694785e+23 2.619517e+53 1.171914e+16
9181544801 1.080034e+241 2.235247e+37 22026.465795 1.171914e+16 9.352382e+257 1.220403e+39 1.964223e+87 1.935576e+54 4.311232e+15 1.220403e+39 ... 4.311232e+15 1.858672e+31 2.581313e+20 1.511428e+111 6.327432e+60 3.317400e+39 3.945248e+88 9.017628e+39 1.657482e+114 4.311232e+15

5 rows × 22 columns

Concatenation


In [49]:
data = pd.read_csv('HairEyeColor.csv', index_col=0)
data.head()


Out[49]:
Hair Eye Sex Freq
1 Black Brown Male 32
2 Brown Brown Male 53
3 Red Brown Male 10
4 Blond Brown Male 3
5 Black Blue Male 11

In [50]:
df5 = data.loc[:,['Eye', 'Sex', 'Freq']][2:6]
df6 = data.loc[:,['Hair', 'Eye']][4:8]
df5


Out[50]:
Eye Sex Freq
3 Brown Male 10
4 Brown Male 3
5 Blue Male 11
6 Blue Male 50

In [51]:
df6


Out[51]:
Hair Eye
5 Black Blue
6 Brown Blue
7 Red Blue
8 Blond Blue

In [52]:
df5 = pd.concat([df5,df6])
df5


Out[52]:
Eye Freq Hair Sex
3 Brown 10 NaN Male
4 Brown 3 NaN Male
5 Blue 11 NaN Male
6 Blue 50 NaN Male
5 Blue NaN Black NaN
6 Blue NaN Brown NaN
7 Blue NaN Red NaN
8 Blue NaN Blond NaN

Handling missing data


In [53]:
df5['Freq'].sum()


Out[53]:
74.0

In [54]:
df5.dropna(how='all', subset=['Freq', 'Sex'])


Out[54]:
Eye Freq Hair Sex
3 Brown 10 NaN Male
4 Brown 3 NaN Male
5 Blue 11 NaN Male
6 Blue 50 NaN Male

Database-style merging

  • Built for speed

In [55]:
database = pd.read_csv('database.csv')
df7 = database.loc[:,['first_name', 'last_name', 'city', 'state']][2:6].reset_index(drop=True)
df8 = database.loc[:,['first_name', 'last_name', 'zip']][4:8].reset_index(drop=True)
df7


Out[55]:
first_name last_name city state
0 Art Venere Bridgeport NJ
1 Lenna Paprocki Anchorage AK
2 Donette Foller Hamilton OH
3 Simona Morasca Ashland OH

In [56]:
df8


Out[56]:
first_name last_name zip
0 Donette Foller 45011
1 Simona Morasca 44805
2 Mitsue Tollner 60632
3 Leota Dilliard 95111

In [57]:
df7.merge(df8, how='outer', on=['first_name', 'last_name'])


Out[57]:
first_name last_name city state zip
0 Art Venere Bridgeport NJ NaN
1 Lenna Paprocki Anchorage AK NaN
2 Donette Foller Hamilton OH 45011
3 Simona Morasca Ashland OH 44805
4 Mitsue Tollner NaN NaN 60632
5 Leota Dilliard NaN NaN 95111

In [58]:
df7.merge(df8, how='left', on=['first_name', 'last_name'])


Out[58]:
first_name last_name city state zip
0 Art Venere Bridgeport NJ NaN
1 Lenna Paprocki Anchorage AK NaN
2 Donette Foller Hamilton OH 45011
3 Simona Morasca Ashland OH 44805

In [59]:
df7.merge(df8, how='right', on=['first_name', 'last_name'])


Out[59]:
first_name last_name city state zip
0 Donette Foller Hamilton OH 45011
1 Simona Morasca Ashland OH 44805
2 Mitsue Tollner NaN NaN 60632
3 Leota Dilliard NaN NaN 95111

In [60]:
df7.merge(df8, how='inner', on=['first_name', 'last_name'])


Out[60]:
first_name last_name city state zip
0 Donette Foller Hamilton OH 45011
1 Simona Morasca Ashland OH 44805

Pandas vs. R

R's c

df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5)) df[, c("a", "c", "e")]

In [61]:
s_df = pd.DataFrame(np.random.randn(4, 3), columns=list('abc'))
s_df[['a', 'c']]


Out[61]:
a c
0 -0.335175 0.092877
1 -0.122592 -0.336803
2 1.262740 -2.169056
3 0.638697 0.761669

Aggregation

df <- data.frame( v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9), v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99), by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12), by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)) aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)

In [62]:
s_df = pd.DataFrame({
    'v1': [1,3,5,7,8,3,5,np.nan,4,5,7,9],
    'v2': [11,33,55,77,88,33,55,np.nan,44,55,77,99],
    'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
    'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
           np.nan]
    })
s_df


Out[62]:
by1 by2 v1 v2
0 red wet 1 11
1 blue dry 3 33
2 1 99 5 55
3 2 95 7 77
4 NaN NaN 8 88
5 big damp 3 33
6 1 95 5 55
7 2 99 NaN NaN
8 red red 4 44
9 1 99 5 55
10 NaN NaN 7 77
11 12 NaN 9 99

In [63]:
s_df.groupby(['by1','by2']).mean()


Out[63]:
v1 v2
by1 by2
1 95 5 55
99 5 55
2 95 7 77
99 NaN NaN
big damp 3 33
blue dry 3 33
red red 4 44
wet 1 11

R's %in%

s <- 0:4 s %in% c(2,4)

In [64]:
s = pd.Series(np.arange(5),dtype=np.float32)
s.isin([2, 4])


Out[64]:
0    False
1    False
2     True
3    False
4     True
dtype: bool

R's match

s <- 0:4 match(s, c(2,4))

In [65]:
pd.Series(pd.match(s, [2,4], np.nan))


Out[65]:
0   NaN
1   NaN
2     0
3   NaN
4     1
dtype: float64

R's tapply

baseball <- data.frame(team = gl(5, 5, labels = paste("Team", LETTERS[1:5])), player = sample(letters, 25), batting.average = runif(25, .200, .400)) tapply(baseball$batting.average, baseball.example$team, max)

In [66]:
import random, string
baseball = pd.DataFrame({
    'team': ["team %d" % (x+1) for x in range(5)]*5,
    'player': random.sample(list(string.ascii_lowercase),25),
    'batting avg': np.random.uniform(.200, .400, 25)
    })
baseball.head()


Out[66]:
batting avg player team
0 0.328316 p team 1
1 0.218720 k team 2
2 0.371300 b team 3
3 0.289187 v team 4
4 0.242803 w team 5

In [67]:
baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max)


Out[67]:
team
team 1    0.356478
team 2    0.399268
team 3    0.385549
team 4    0.300849
team 5    0.389789
Name: batting avg, dtype: float64

R's subset

df <- data.frame(a=rnorm(10), b=rnorm(10)) subset(df, a <= b) df[df$a <= df$b,]

In [68]:
s_df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
#s_df.query('a <= b')
s_df[s_df.a <= s_df.b]


Out[68]:
a b
1 -1.193466 0.703511
4 -1.208578 1.187422
6 -0.741790 -0.125572
9 -3.327805 0.599588

R's with

df <- data.frame(a=rnorm(10), b=rnorm(10)) with(df, a + b)

In [69]:
#s_df.eval('a + b')
s_df.a + s_df.b


Out[69]:
0    1.576161
1   -0.489955
2    1.321139
3   -0.435031
4   -0.021156
5   -1.244591
6   -0.867362
7    0.622398
8    1.777159
9   -2.728216
dtype: float64

rpy2: R data frame support


In [70]:
from rpy2.robjects import r, pandas2ri
df_iris = pandas2ri.ri2py(r['iris'])
df_iris.head()


Out[70]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Final comparison

  • ###Pandas + Python
    • General-purpose programming language
    • Strong object-oriented capabilities
    • SciPy - library of scientific computing routines
    • scikit-learn - machine learning toolkit
  • ###R
    • Very large library of statistical functions
    • Large statistics-oriented support base
    • Older thus wiser