Data Manipulation with Numpy and Pandas

Handling with large data is easy in Python. In the simplest way using arrays. However, they are pretty slow. Numpy and Panda are two great libraries for dealing with datasets. Numpy isused for homogenous n-dimensional data (matrices). Pandas is used for heterogenous tables (CSV, MS Excel tables). Pandas is internally based on Numpy, too. See http://scipy-lectures.github.io/ for a more detailed lesson.


In [16]:
import numpy as np

In [17]:
# Generating a random array
X = np.random.random((3, 5))  # a 3 x 5 array

print(X)


[[ 0.77347902  0.45273892  0.15401666  0.4357073   0.74741831]
 [ 0.19892103  0.8583207   0.89480927  0.5322447   0.94306179]
 [ 0.48248829  0.10666105  0.30380058  0.74718615  0.25991219]]

Accessing elements


In [18]:
# get a single element
X[0, 0]


Out[18]:
0.77347901713819489

In [19]:
# get a row
X[1]


Out[19]:
array([ 0.19892103,  0.8583207 ,  0.89480927,  0.5322447 ,  0.94306179])

In [20]:
# get a column
X[:, 1]


Out[20]:
array([ 0.45273892,  0.8583207 ,  0.10666105])

In [21]:
# Transposing an array
X.T


Out[21]:
array([[ 0.77347902,  0.19892103,  0.48248829],
       [ 0.45273892,  0.8583207 ,  0.10666105],
       [ 0.15401666,  0.89480927,  0.30380058],
       [ 0.4357073 ,  0.5322447 ,  0.74718615],
       [ 0.74741831,  0.94306179,  0.25991219]])

In [22]:
print(X.shape)
print(X.reshape(5, 3)) #change the layout of the matrix


(3, 5)
[[ 0.77347902  0.45273892  0.15401666]
 [ 0.4357073   0.74741831  0.19892103]
 [ 0.8583207   0.89480927  0.5322447 ]
 [ 0.94306179  0.48248829  0.10666105]
 [ 0.30380058  0.74718615  0.25991219]]

In [23]:
# indexing by an array of integers (fancy indexing)
indices = np.array([3, 1, 0])
print(indices)
X[:, indices]


[3 1 0]
Out[23]:
array([[ 0.4357073 ,  0.45273892,  0.77347902],
       [ 0.5322447 ,  0.8583207 ,  0.19892103],
       [ 0.74718615,  0.10666105,  0.48248829]])

Operations along an axis


In [24]:
X


Out[24]:
array([[ 0.77347902,  0.45273892,  0.15401666,  0.4357073 ,  0.74741831],
       [ 0.19892103,  0.8583207 ,  0.89480927,  0.5322447 ,  0.94306179],
       [ 0.48248829,  0.10666105,  0.30380058,  0.74718615,  0.25991219]])

In [25]:
X.shape


Out[25]:
(3, 5)

In [26]:
np.sum(X, axis=1) # 1...columns


Out[26]:
array([ 2.56336021,  3.42735748,  1.90004825])

In [27]:
np.max(X, axis=0) # 0...rows


Out[27]:
array([ 0.77347902,  0.8583207 ,  0.89480927,  0.74718615,  0.94306179])

A quick-ish introduction to Pandas

based on http://pandas.pydata.org/pandas-docs/stable/10min.html


In [28]:
import numpy as np
import pandas as pd

In [29]:
#use a standard dataset of heterogenous data
cars = pd.read_csv('data/mtcars.csv')
cars.head()


Out[29]:
car mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

In [30]:
#list all columns
cars.columns


Out[30]:
Index(['car', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
       'gear', 'carb'],
      dtype='object')

In [31]:
#we want to use the car as the "primary key" of a row
cars.index = cars.pop('car')
cars.head()


Out[31]:
mpg cyl disp hp drat wt qsec vs am gear carb
car
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

In [32]:
#describe our dataset
cars.describe()


Out[32]:
mpg cyl disp hp drat wt qsec vs am gear carb
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000 32.0000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750 0.437500 0.406250 3.687500 2.8125
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457 1.786943 0.504016 0.498991 0.737804 1.6152
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000 14.500000 0.000000 0.000000 3.000000 1.0000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250 16.892500 0.000000 0.000000 3.000000 2.0000
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000 17.710000 0.000000 0.000000 4.000000 2.0000
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000 18.900000 1.000000 1.000000 4.000000 4.0000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000 22.900000 1.000000 1.000000 5.000000 8.0000

In [33]:
cars.sort_index(inplace=True)
cars.head()


Out[33]:
mpg cyl disp hp drat wt qsec vs am gear carb
car
AMC Javelin 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

In [34]:
cars.sort_values('mpg').head(15)


Out[34]:
mpg cyl disp hp drat wt qsec vs am gear carb
car
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

In [35]:
cars.sort_values('hp', ascending=False).head()


Out[35]:
mpg cyl disp hp drat wt qsec vs am gear carb
car
Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
Ford Pantera L 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4

Selection

Note While many of the NumPy access methods work on DataFrames, use the pandas-specific data access methods, .at, .iat, .loc, .iloc and .ix.

See the Indexing section and below.


In [36]:
#single column
cars['mpg']
#depending on the name also cars.mpg works


Out[36]:
car
AMC Javelin            15.2
Cadillac Fleetwood     10.4
Camaro Z28             13.3
Chrysler Imperial      14.7
Datsun 710             22.8
Dodge Challenger       15.5
Duster 360             14.3
Ferrari Dino           19.7
Fiat 128               32.4
Fiat X1-9              27.3
Ford Pantera L         15.8
Honda Civic            30.4
Hornet 4 Drive         21.4
Hornet Sportabout      18.7
Lincoln Continental    10.4
Lotus Europa           30.4
Maserati Bora          15.0
Mazda RX4              21.0
Mazda RX4 Wag          21.0
Merc 230               22.8
Merc 240D              24.4
Merc 280               19.2
Merc 280C              17.8
Merc 450SE             16.4
Merc 450SL             17.3
Merc 450SLC            15.2
Pontiac Firebird       19.2
Porsche 914-2          26.0
Toyota Corolla         33.9
Toyota Corona          21.5
Valiant                18.1
Volvo 142E             21.4
Name: mpg, dtype: float64

In [37]:
#or a slice of rows
cars[2:5]


Out[37]:
mpg cyl disp hp drat wt qsec vs am gear carb
car
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

In [38]:
#by label = primary key
cars.loc['Fiat 128':'Lotus Europa']


Out[38]:
mpg cyl disp hp drat wt qsec vs am gear carb
car
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

In [39]:
#selection by position
cars.iloc[3]


Out[39]:
mpg      14.700
cyl       8.000
disp    440.000
hp      230.000
drat      3.230
wt        5.345
qsec     17.420
vs        0.000
am        0.000
gear      3.000
carb      4.000
Name: Chrysler Imperial, dtype: float64

In [40]:
cars.iloc[3:5, 0:2]


Out[40]:
mpg cyl
car
Chrysler Imperial 14.7 8
Datsun 710 22.8 4

In [41]:
cars[cars.cyl > 6] # more than 6 cylinders


Out[41]:
mpg cyl disp hp drat wt qsec vs am gear carb
car
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

Missing Data


In [42]:
cars_na = pd.read_csv('data/mtcars_with_nas.csv')

In [43]:
cars_na.isnull().head(4)


Out[43]:
car mpg cyl disp hp drat wt qsec vs am gear carb
0 False False False False False False False False False False False False
1 False False False True False False False False False False False False
2 False False False False False False False False False False False False
3 False False False False False False False False False False False False

In [44]:
#fill with a default value
cars_na.fillna(0).head(4)


Out[44]:
car mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

In [45]:
#or drop the rows

print(cars_na.shape)
#drop rows with na values
print(cars_na.dropna().shape)
#drop columns with na values
print(cars_na.dropna(axis=1).shape)

#see also http://pandas.pydata.org/pandas-docs/stable/missing_data.html


(32, 12)
(29, 12)
(32, 10)

Statistics

besides .describe() there are plenty of other staticial measures and aggregation methods in Pandas/Numpy


In [46]:
#stats
cars.mean()


Out[46]:
mpg      20.090625
cyl       6.187500
disp    230.721875
hp      146.687500
drat      3.596562
wt        3.217250
qsec     17.848750
vs        0.437500
am        0.406250
gear      3.687500
carb      2.812500
dtype: float64

In [47]:
cars.mean(axis=1)


Out[47]:
car
AMC Javelin            46.007727
Cadillac Fleetwood     66.232727
Camaro Z28             58.752727
Chrysler Imperial      65.972273
Datsun 710             23.598182
Dodge Challenger       47.240909
Duster 360             59.720000
Ferrari Dino           34.508182
Fiat 128               19.440909
Fiat X1-9              18.928636
Ford Pantera L         60.971818
Honda Civic            17.742273
Hornet 4 Drive         38.739545
Hornet Sportabout      53.664545
Lincoln Continental    66.058545
Lotus Europa           24.880273
Maserati Bora          63.155455
Mazda RX4              29.907273
Mazda RX4 Wag          29.981364
Merc 230               27.233636
Merc 240D              24.634545
Merc 280               31.860000
Merc 280C              31.787273
Merc 450SE             46.430909
Merc 450SL             46.500000
Merc 450SLC            46.350000
Pontiac Firebird       57.379545
Porsche 914-2          24.779091
Toyota Corolla         18.814091
Toyota Corona          24.888636
Valiant                35.049091
Volvo 142E             26.262727
dtype: float64

In [48]:
#grouping
cars.groupby('cyl').mean()


Out[48]:
mpg disp hp drat wt qsec vs am gear carb
cyl
4 26.663636 105.136364 82.636364 4.070909 2.285727 19.137273 0.909091 0.727273 4.090909 1.545455
6 19.742857 183.314286 122.285714 3.585714 3.117143 17.977143 0.571429 0.428571 3.857143 3.428571
8 15.100000 353.100000 209.214286 3.229286 3.999214 16.772143 0.000000 0.142857 3.285714 3.500000

In [49]:
#grouping different aggregation methods
cars.groupby('cyl').agg({ 'mpg': 'mean', 'qsec': 'min'})


Out[49]:
qsec mpg
cyl
4 16.7 26.663636
6 15.5 19.742857
8 14.5 15.100000

TASKS

Now there are a series of simple tasks


In [50]:
#loading gapminder data (taken from https://github.com/jennybc/gapminder)
# file located at 'data/gapminder-unfiltered.tsv' it uses tabular character as separator
# use the first column as index
gap = pd.read_csv('data/gapminder-unfiltered.tsv',index_col=0, sep='\t')

In [51]:
#what are the columns of this dataset?
gap.head()


Out[51]:
continent year lifeExp pop gdpPercap
country
Afghanistan Asia 1952 28.801 8425333 779.445314
Afghanistan Asia 1957 30.332 9240934 820.853030
Afghanistan Asia 1962 31.997 10267083 853.100710
Afghanistan Asia 1967 34.020 11537966 836.197138
Afghanistan Asia 1972 36.088 13079460 739.981106

In [52]:
#what is the maximal year contained?
gap['year'].max()


Out[52]:
2007

In [53]:
#just select all data of the year 2007
gap2007 = gap[gap.year == 2007]

In [54]:
#locate Austria and print it
gap2007.loc['Austria']


Out[54]:
continent         Europe
year                2007
lifeExp           79.829
pop          8.19978e+06
gdpPercap        36126.5
Name: Austria, dtype: object

In [55]:
#list the top 10 countries by life expectancy (lifeExp)
gap2007.sort_values('lifeExp',ascending=False).head(10)


Out[55]:
continent year lifeExp pop gdpPercap
country
Japan Asia 2007 82.603 127467972 31656.06806
Hong Kong, China Asia 2007 82.208 6980412 39724.97867
Iceland Europe 2007 81.757 301931 36180.78919
Switzerland Europe 2007 81.701 7554661 37506.41907
Australia Oceania 2007 81.235 20434176 34435.36744
Spain Europe 2007 80.941 40448191 28821.06370
Sweden Europe 2007 80.884 9031088 33859.74835
Israel Asia 2007 80.745 6426679 25523.27710
Macao, China Asia 2007 80.718 456989 54589.82024
France Europe 2007 80.657 61083916 30470.01670

In [56]:
#what is the total population (pop) per continent
gap2007.groupby('continent').agg({ 'pop': 'sum'})


Out[56]:
pop
continent
Africa 929963305
Americas 901346796
Asia 3840766818
Europe 593085338
FSU 255612291
Oceania 32945296