In [1]:
%pylab inline
Let's say hello to Pandas. The convention is use pd in the import.
In [2]:
import pandas as pd
Array/list, one dimensional object. The easiest way is to initialize it with a list.
In [3]:
cities = ['London', 'New York', 'Berlin', 'Toronto']
In [4]:
s =pd.Series(cities)
s
Out[4]:
We see that we have the values in the list, but also and index, in this case going from 0 to 3. That is the default index.
In [5]:
s.index
Out[5]:
In [6]:
s.values
Out[6]:
We can reference the values by using the appropiate index. In this case there is not much difference with a numpy array.
In [7]:
s[1]
Out[7]:
You can assign you own index. Which brings the power to the structure. Later it will be clear why.
In [8]:
s = pd.Series(cities, index = ['A', 'B', 'C', 'D'])
s
Out[8]:
In this case we have used labels A to D. And we can now reference the data in the Series by using the label.
In [9]:
s['D']
Out[9]:
Let us try to look at a more meaningful series. The population of Bogotá (the capital of Colombia) as evolving in time. Data taken from wikipedia. In this case it makes sense that the label is the year, and the value is the corresponding population. Then we can easily retrieve by year. Previously we used a list, but we can also use a dictionary to initialize a Series Object. In this case the keys will be the index, and the value will be the population in that year:
In [41]:
population_bogota = {1800:21964,
1912:121257,
1951:715250,
1964:1697311,
1973:2855065,
1985:4236490,
1999:6276428,
2012:7571345}
In [42]:
series_bogota = pd.Series(population_bogota)
series_bogota
Out[42]:
When you are working with different series it may be useful to include some meta-data on the series. Like the name of the series itself, and of the index.
In [43]:
series_bogota.name = 'Bogota population'
series_bogota.index.name = 'year'
In [44]:
series_bogota
Out[44]:
In [45]:
series_bogota.index
Out[45]:
In [46]:
series_bogota.values
Out[46]:
We can obviously reference specific values by using the index
In [47]:
series_bogota[1800]
Out[47]:
Note that the values in the series are numpy arrays. This is part of what makes pandas fast and can be very useful when working with other libraries.
In [48]:
type(series_bogota.values)
Out[48]:
The index, on the other hand is a pandas Object. There is a hierarchy of Index objects that includes specific types for time indexes, hierichical index and other types of indexes.
In [49]:
type(series_bogota.index)
Out[49]:
We can query som information based on values. When did the population of bogota went above 1.000.000
In [50]:
series_bogota[series_bogota > 1000000]
Out[50]:
I need the population from the 60's on
In [51]:
series_bogota[series_bogota.index > 1960]
Out[51]:
What is actually going on with this way of querying information? Let us see what the bit inside the brackets yields.
In [52]:
series_bogota.index > 1965
Out[52]:
This means that we can also query using an array of booleans, perhaps handy if we have complex programatic conditions.
In [53]:
series_bogota[[True, False, True]]
Out[53]:
You can also query based on the index value itself.
In [54]:
series_bogota[[1973, 2012, 2011]]
Out[54]:
Note that we get NaN, for 2011, because this label is not in the original index. This is part of the automatic handling of missing values, and will turn out to be extremely valuable when working with real data.
Lets add a ficticious value for 2011.
In [55]:
series_bogota = series_bogota.set_value(2011, 6500000)
series_bogota
Out[55]:
You can apply functions to each element of the series. Remember, this is close to numpy.
In [56]:
millions = lambda x: x/1000000.0
series_bogota.apply(millions)
Out[56]:
Finally, sometimes we need to have a quick idea of what the data looks like. For this we can use the function describe.
In [57]:
series_bogota.describe()
Out[57]:
Or even cooler, plots
In [58]:
pd.Series.plot(series_bogota, kind='bar')
Out[58]:
In [59]:
series_bogota.sort()
pd.Series.plot(series_bogota/1000000.0, kind='bar')
Out[59]:
What about population change?
In [60]:
series_bogota.pct_change()
Out[60]:
In [61]:
series_bogota.pct_change().plot(kind='bar')
plt.ylabel('percentage change')
Out[61]:
Note that there is a lot of missing data. Is there a way to solve this quickly?
Reindex + Interpolation
In [62]:
series_bogota = series_bogota.reindex(range(1800, 2014))
In [63]:
series_bogota
Out[63]:
In [64]:
series_bogota = series_bogota.interpolate('values')
series_bogota
Out[64]:
In [65]:
(series_bogota/1000000.0).plot()
Out[65]:
In [66]:
(series_bogota).pct_change().plot()
Out[66]:
Finally, let us look at adding series.
In [67]:
population_cali = {1809: 7546, 1938:101883, 1973:991549, 1985:1429026, 2013:2319684}
In [68]:
series_cali = pd.Series(population_cali)
series_cali.name = 'Cali (Colombia) Population'
series_cali.index.name = 'year'
series_cali
Out[68]:
In [69]:
(series_cali + series_bogota)
Out[69]:
Probably the most meaningful way to add the series is if they share an index. So I can reindex cali with the index of Bogotá
In [70]:
series_cali
Out[70]:
In [71]:
series_cali = series_cali.reindex(series_bogota.index)
series_cali
Out[71]:
In [72]:
len(series_bogota) == len(series_cali)
Out[72]:
In [73]:
np.alltrue(series_bogota.index == series_cali.index)
Out[73]:
In [74]:
series_cali = series_cali.interpolate('values')
series_cali
Out[74]:
In [75]:
series_cali = series_cali.fillna(0.0)
series_cali
Out[75]:
Now finally add the two series
In [76]:
series_bogota + series_cali
Out[76]:
In [77]:
series_bogota.plot(label='Bogota population')
series_cali.plot(label='Cali population')
plt.legend(loc='best')
Out[77]:
I do not vouch for the statistics here, it is just an example of the tool.
In [78]:
df = pd.DataFrame({'bogotá':series_bogota, 'cali':series_cali})
In [79]:
df
Out[79]:
In [80]:
df.head()
Out[80]:
In [81]:
pd.options.display.float_format = '{:20,.2f}'.format
df.index.name = 'year'
In [82]:
df.tail()
Out[82]:
Elements of DataFrame
In [83]:
df.index
Out[83]:
In [84]:
df.columns
Out[84]:
In [85]:
np.shape(df.values)
Out[85]:
In [86]:
df['population difference'] = df['bogotá'] - df['cali']
In [88]:
df.tail()
Out[88]:
In [87]:
df.describe()
Out[87]:
In [89]:
df['population difference'].tail()
Out[89]:
In [90]:
df[df.index >1990]
Out[90]:
In [91]:
df.plot()
Out[91]:
In [92]:
!ls *.csv
In [93]:
df.to_csv('cali_and_bogota.csv')
In [94]:
!head cali_and_bogota.csv
There are many ways to initialize DataFrames
In [95]:
!cat metropolitan.csv
In [96]:
df = pd.read_csv('metropolitan.csv')
df
Out[96]:
In [97]:
df = pd.read_csv('metropolitan.csv', index_col='Metropolitan area')
df
Out[97]:
In [98]:
df.describe()
Out[98]:
In [99]:
df.dtypes
Out[99]:
In [100]:
df['Population']= df['Population'].apply(float)
df['Area']= df['Area'].apply(float)
In [101]:
df.dtypes
Out[101]:
In [102]:
df.sort('Population')
df.head(3)
Out[102]:
Area largest than 10.0000 km2
In [103]:
df[df['Area']> 10000]
Out[103]:
In [104]:
df[(df['Area']> 10000) & (df['Population'] > 20000)]
Out[104]:
In [105]:
df['Density'] = df['Population']/df['Area']
df.sort('Density', ascending=False)['Density'].plot(kind='bar')
Out[105]:
In [106]:
df.groupby('Country').head()
Out[106]:
In [107]:
type(df.groupby('Country'))
Out[107]:
In [108]:
print df.groupby('Country').sum().sort('Density', ascending=False).head()
All the data in this notebook was taken from wikipedia.