Data

Resources



In [1]:

    
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline



In [2]:

    
plt.style.use('ggplot')

Working with Series



In [3]:

    
x = Series(range(5,10))



In [4]:

    
x









    Out[4]:





0    5
1    6
2    7
3    8
4    9
dtype: int64

We cna treat Series objects much like numpy vectors



In [5]:

    
x.sum(), x.mean(), x.std()









    Out[5]:





(35, 7.0, 1.5811388300841898)



In [6]:

    
x**2









    Out[6]:





0    25
1    36
2    49
3    64
4    81
dtype: int64



In [7]:

    
x[x >= 8]









    Out[7]:





3    8
4    9
dtype: int64

Series can also contain more information than numpy vectors

You can always use standard positional indexing



In [8]:

    
x[1:4]









    Out[8]:





1    6
2    7
3    8
dtype: int64

Series index

But you can also assign labeled indexes.



In [9]:

    
x.index = list('abcde')
x









    Out[9]:





a    5
b    6
c    7
d    8
e    9
dtype: int64

Note that with labels, the end index is included



In [10]:

    
x['a':'c']









    Out[10]:





a    5
b    6
c    7
dtype: int64

Even when you have a labeled index, positional arguments still work



In [11]:

    
x[1:4]









    Out[11]:





b    6
c    7
d    8
dtype: int64



In [12]:

    
x.a, x.c, x.e









    Out[12]:





(5, 7, 9)

Working with missing data

Missing data is indicated with NaN (not a number).



In [13]:

    
y = Series([10, np.nan, np.nan, 13, 14])
y









    Out[13]:





0    10.0
1     NaN
2     NaN
3    13.0
4    14.0
dtype: float64

Concatenating two series



In [14]:

    
z = pd.concat([x, y])
z









    Out[14]:





a     5.0
b     6.0
c     7.0
d     8.0
e     9.0
0    10.0
1     NaN
2     NaN
3    13.0
4    14.0
dtype: float64

Reset index to default



In [15]:

    
z = z.reset_index(drop=True)
z









    Out[15]:





0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
6     NaN
7     NaN
8    13.0
9    14.0
dtype: float64

`pandas` aggregate functions ignore missing data



In [16]:

    
z.sum(), z.mean(), z.std()









    Out[16]:





(72.0, 9.0, 3.2071349029490928)

Selecting missing values



In [17]:

    
z[z.isnull()]









    Out[17]:





6   NaN
7   NaN
dtype: float64

Selecting non-missing values



In [18]:

    
z[z.notnull()]









    Out[18]:





0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
8    13.0
9    14.0
dtype: float64

Replacement of missing values



In [19]:

    
z.fillna(0)









    Out[19]:





0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
6     0.0
7     0.0
8    13.0
9    14.0
dtype: float64



In [20]:

    
z.fillna(method='ffill')









    Out[20]:





0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
6    10.0
7    10.0
8    13.0
9    14.0
dtype: float64



In [21]:

    
z.fillna(method='bfill')









    Out[21]:





0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
6    13.0
7    13.0
8    13.0
9    14.0
dtype: float64



In [22]:

    
z.fillna(z.mean())









    Out[22]:





0     5.0
1     6.0
2     7.0
3     8.0
4     9.0
5    10.0
6     9.0
7     9.0
8    13.0
9    14.0
dtype: float64

Working with dates / times

We will see more date/time handling in the DataFrame section.



In [23]:

    
z.index = pd.date_range('01-Jan-2016', periods=len(z))



In [24]:

    
z









    Out[24]:





2016-01-01     5.0
2016-01-02     6.0
2016-01-03     7.0
2016-01-04     8.0
2016-01-05     9.0
2016-01-06    10.0
2016-01-07     NaN
2016-01-08     NaN
2016-01-09    13.0
2016-01-10    14.0
Freq: D, dtype: float64

Intelligent aggregation over datetime ranges



In [25]:

    
z.resample('W').sum()









    Out[25]:





2016-01-03    18.0
2016-01-10    54.0
Freq: W-SUN, dtype: float64

Formatting datetime objects (see http://strftime.org)



In [26]:

    
z.index.strftime('%b %d, %Y')









    Out[26]:





array(['Jan 01, 2016', 'Jan 02, 2016', 'Jan 03, 2016', 'Jan 04, 2016',
       'Jan 05, 2016', 'Jan 06, 2016', 'Jan 07, 2016', 'Jan 08, 2016',
       'Jan 09, 2016', 'Jan 10, 2016'], 
      dtype='<U12')

DataFrame

Similar to R.

Titanic data



In [27]:

    
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
titanic = pd.read_csv(url)



In [28]:

    
titanic.shape









    Out[28]:





(891, 15)



In [29]:

    
titanic.columns









    Out[29]:





Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')



In [30]:

    
# For display purposes, we will drop some columns
titanic = titanic[['survived', 'sex', 'age', 'fare',
                   'embarked', 'class', 'who', 'deck', 'embark_town',]]



In [31]:

    
titanic.dtypes









    Out[31]:





survived         int64
sex             object
age            float64
fare           float64
embarked        object
class           object
who             object
deck            object
embark_town     object
dtype: object

Summarizing a data frame



In [32]:

    
titanic.ix[0]









    Out[32]:





survived                 0
sex                   male
age                     22
fare                  7.25
embarked                 S
class                Third
who                    man
deck                   NaN
embark_town    Southampton
Name: 0, dtype: object



In [33]:

    
titanic.describe()



In [34]:

    
titanic.head()









    Out[34]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      0
      0
      male
      22.0
      7.2500
      S
      Third
      man
      NaN
      Southampton
    
    
      1
      1
      female
      38.0
      71.2833
      C
      First
      woman
      C
      Cherbourg
    
    
      2
      1
      female
      26.0
      7.9250
      S
      Third
      woman
      NaN
      Southampton
    
    
      3
      1
      female
      35.0
      53.1000
      S
      First
      woman
      C
      Southampton
    
    
      4
      0
      male
      35.0
      8.0500
      S
      Third
      man
      NaN
      Southampton



In [35]:

    
titanic.tail()









    Out[35]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      886
      0
      male
      27.0
      13.00
      S
      Second
      man
      NaN
      Southampton
    
    
      887
      1
      female
      19.0
      30.00
      S
      First
      woman
      B
      Southampton
    
    
      888
      0
      female
      NaN
      23.45
      S
      Third
      woman
      NaN
      Southampton
    
    
      889
      1
      male
      26.0
      30.00
      C
      First
      man
      C
      Cherbourg
    
    
      890
      0
      male
      32.0
      7.75
      Q
      Third
      man
      NaN
      Queenstown



In [36]:

    
titanic.columns









    Out[36]:





Index(['survived', 'sex', 'age', 'fare', 'embarked', 'class', 'who', 'deck',
       'embark_town'],
      dtype='object')



In [37]:

    
titanic.index









    Out[37]:





RangeIndex(start=0, stop=891, step=1)

Indexing



In [38]:

    
titanic[['sex', 'age', 'class']].head()









    Out[38]:






  
    
      
      sex
      age
      class
    
  
  
    
      0
      male
      22.0
      Third
    
    
      1
      female
      38.0
      First
    
    
      2
      female
      26.0
      Third
    
    
      3
      female
      35.0
      First
    
    
      4
      male
      35.0
      Third



In [39]:

    
titanic[10:15]









    Out[39]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      10
      1
      female
      4.0
      16.7000
      S
      Third
      child
      G
      Southampton
    
    
      11
      1
      female
      58.0
      26.5500
      S
      First
      woman
      C
      Southampton
    
    
      12
      0
      male
      20.0
      8.0500
      S
      Third
      man
      NaN
      Southampton
    
    
      13
      0
      male
      39.0
      31.2750
      S
      Third
      man
      NaN
      Southampton
    
    
      14
      0
      female
      14.0
      7.8542
      S
      Third
      child
      NaN
      Southampton

Using the `ix` helper for indexing



In [40]:

    
titanic.ix[10:15, 'age':'fare']



In [41]:

    
titanic.ix[10:15, [1,3,5]]









    Out[41]:






  
    
      
      sex
      fare
      class
    
  
  
    
      10
      female
      16.7000
      Third
    
    
      11
      female
      26.5500
      First
    
    
      12
      male
      8.0500
      Third
    
    
      13
      male
      31.2750
      Third
    
    
      14
      female
      7.8542
      Third
    
    
      15
      female
      16.0000
      Second



In [42]:

    
titanic[titanic.age < 2]









    Out[42]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      78
      1
      male
      0.83
      29.0000
      S
      Second
      child
      NaN
      Southampton
    
    
      164
      0
      male
      1.00
      39.6875
      S
      Third
      child
      NaN
      Southampton
    
    
      172
      1
      female
      1.00
      11.1333
      S
      Third
      child
      NaN
      Southampton
    
    
      183
      1
      male
      1.00
      39.0000
      S
      Second
      child
      F
      Southampton
    
    
      305
      1
      male
      0.92
      151.5500
      S
      First
      child
      C
      Southampton
    
    
      381
      1
      female
      1.00
      15.7417
      C
      Third
      child
      NaN
      Cherbourg
    
    
      386
      0
      male
      1.00
      46.9000
      S
      Third
      child
      NaN
      Southampton
    
    
      469
      1
      female
      0.75
      19.2583
      C
      Third
      child
      NaN
      Cherbourg
    
    
      644
      1
      female
      0.75
      19.2583
      C
      Third
      child
      NaN
      Cherbourg
    
    
      755
      1
      male
      0.67
      14.5000
      S
      Second
      child
      NaN
      Southampton
    
    
      788
      1
      male
      1.00
      20.5750
      S
      Third
      child
      NaN
      Southampton
    
    
      803
      1
      male
      0.42
      8.5167
      C
      Third
      child
      NaN
      Cherbourg
    
    
      827
      1
      male
      1.00
      37.0042
      C
      Second
      child
      NaN
      Cherbourg
    
    
      831
      1
      male
      0.83
      18.7500
      S
      Second
      child
      NaN
      Southampton

Sorting and ordering data



In [43]:

    
titanic.sort_index().head()









    Out[43]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      0
      0
      male
      22.0
      7.2500
      S
      Third
      man
      NaN
      Southampton
    
    
      1
      1
      female
      38.0
      71.2833
      C
      First
      woman
      C
      Cherbourg
    
    
      2
      1
      female
      26.0
      7.9250
      S
      Third
      woman
      NaN
      Southampton
    
    
      3
      1
      female
      35.0
      53.1000
      S
      First
      woman
      C
      Southampton
    
    
      4
      0
      male
      35.0
      8.0500
      S
      Third
      man
      NaN
      Southampton



In [44]:

    
titanic.sort_values('age', ascending=True).head()









    Out[44]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      803
      1
      male
      0.42
      8.5167
      C
      Third
      child
      NaN
      Cherbourg
    
    
      755
      1
      male
      0.67
      14.5000
      S
      Second
      child
      NaN
      Southampton
    
    
      644
      1
      female
      0.75
      19.2583
      C
      Third
      child
      NaN
      Cherbourg
    
    
      469
      1
      female
      0.75
      19.2583
      C
      Third
      child
      NaN
      Cherbourg
    
    
      78
      1
      male
      0.83
      29.0000
      S
      Second
      child
      NaN
      Southampton



In [45]:

    
titanic.sort_values(['survived', 'age'], ascending=[True, True]).head()









    Out[45]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      164
      0
      male
      1.0
      39.6875
      S
      Third
      child
      NaN
      Southampton
    
    
      386
      0
      male
      1.0
      46.9000
      S
      Third
      child
      NaN
      Southampton
    
    
      7
      0
      male
      2.0
      21.0750
      S
      Third
      child
      NaN
      Southampton
    
    
      16
      0
      male
      2.0
      29.1250
      Q
      Third
      child
      NaN
      Queenstown
    
    
      119
      0
      female
      2.0
      31.2750
      S
      Third
      child
      NaN
      Southampton

Grouping data



In [46]:

    
sex_class = titanic.groupby(['sex', 'class'])



In [47]:

    
sex_class.count()









    Out[47]:






  
    
      
      
      survived
      age
      fare
      embarked
      who
      deck
      embark_town
    
    
      sex
      class
      
      
      
      
      
      
      
    
  
  
    
      female
      First
      94
      85
      94
      92
      94
      81
      92
    
    
      Second
      76
      74
      76
      76
      76
      10
      76
    
    
      Third
      144
      102
      144
      144
      144
      6
      144
    
    
      male
      First
      122
      101
      122
      122
      122
      94
      122
    
    
      Second
      108
      99
      108
      108
      108
      6
      108
    
    
      Third
      347
      253
      347
      347
      347
      6
      347

Why Kate Winslett survived and Leonardo DiCaprio didn't



In [48]:

    
df = sex_class.mean()
df['survived']









    Out[48]:





sex     class 
female  First     0.968085
        Second    0.921053
        Third     0.500000
male    First     0.368852
        Second    0.157407
        Third     0.135447
Name: survived, dtype: float64

Of the females who were in first class, count the number from each embarking town



In [49]:

    
sex_class.get_group(('female', 'First')).groupby('embark_town').count()









    Out[49]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
    
    
      embark_town
      
      
      
      
      
      
      
      
    
  
  
    
      Cherbourg
      43
      43
      38
      43
      43
      43
      43
      35
    
    
      Queenstown
      1
      1
      1
      1
      1
      1
      1
      1
    
    
      Southampton
      48
      48
      44
      48
      48
      48
      48
      43

Cross-tabulation



In [50]:

    
pd.crosstab(titanic.survived, titanic['class'])

We can also get multiple summaries at the same time



In [51]:

    
def my_func(x):
    return np.max(x)



In [52]:

    
mapped_funcs = {'embarked': 'count', 'age': ('mean', 'median', my_func), 'survived': sum}
sex_class.get_group(('female', 'First')).groupby('embark_town').agg(mapped_funcs)









    Out[52]:






  
    
      
      survived
      embarked
      age
    
    
      
      sum
      count
      mean
      median
      my_func
    
    
      embark_town
      
      
      
      
      
    
  
  
    
      Cherbourg
      42
      43
      36.052632
      37.0
      60.0
    
    
      Queenstown
      1
      1
      33.000000
      33.0
      33.0
    
    
      Southampton
      46
      48
      32.704545
      33.0
      63.0



In [53]:

    
titanic.columns









    Out[53]:





Index(['survived', 'sex', 'age', 'fare', 'embarked', 'class', 'who', 'deck',
       'embark_town'],
      dtype='object')

Visualizing tables

See more examples in the Graphics notebook.



In [54]:

    
import seaborn as sns
sns.set_context(font_scale=4)
sns.factorplot(x='sex', y='age', hue='survived', col='class', kind='box', data=titanic)
pass

Making plots with `pandas`



In [55]:

    
from pandas_datareader import data as web
import datetime



In [56]:

    
apple = web.DataReader('AAPL', 'google', 
                        start = datetime.datetime(2015, 1, 1),
                        end = datetime.datetime(2015, 12, 31))



In [57]:

    
apple.head()



In [58]:

    
apple.plot.line(y='Close', marker='o', markersize=3, linewidth=0.5)
pass



In [59]:

    
# Zoom in on large drop in August
aug = apple['2015-08-01':'2015-08-30']
aug.plot.line(y=['High', 'Low', 'Open', 'Close'], marker='o', markersize=10, linewidth=1)
pass

Data conversions

One of the nicest features of pandas is the ease of converting tabular data across different storage formats. We will illustrate by converting the titanic dataframe into multiple formats.



In [60]:

    
titanic.head(2)









    Out[60]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      0
      0
      male
      22.0
      7.2500
      S
      Third
      man
      NaN
      Southampton
    
    
      1
      1
      female
      38.0
      71.2833
      C
      First
      woman
      C
      Cherbourg

CSV



In [61]:

    
titanic.to_csv('titanic.csv', index=False)



In [62]:

    
t1 = pd.read_csv('titanic.csv')
t1.head(2)









    Out[62]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      0
      0
      male
      22.0
      7.2500
      S
      Third
      man
      NaN
      Southampton
    
    
      1
      1
      female
      38.0
      71.2833
      C
      First
      woman
      C
      Cherbourg

Excel



In [63]:

    
!pip install openpyxl
t1.to_excel('titanic.xlsx')









    



Requirement already satisfied: openpyxl in /Users/cliburn/anaconda2/lib/python2.7/site-packages
Requirement already satisfied: jdcal in /Users/cliburn/anaconda2/lib/python2.7/site-packages (from openpyxl)
Requirement already satisfied: et_xmlfile in /Users/cliburn/anaconda2/lib/python2.7/site-packages (from openpyxl)



In [64]:

    
t2 = pd.read_excel('titanic.xlsx')
t2.head(2)









    Out[64]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      0
      0
      male
      22.0
      7.2500
      S
      Third
      man
      NaN
      Southampton
    
    
      1
      1
      female
      38.0
      71.2833
      C
      First
      woman
      C
      Cherbourg

Relational Database



In [65]:

    
import sqlite3

con = sqlite3.connect('titanic.db')
t2.to_sql('titanic', con, index=False, if_exists='replace')



In [66]:

    
t3 = pd.read_sql('select * from titanic', con)
t3.head(2)









    Out[66]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      0
      0
      male
      22.0
      7.2500
      S
      Third
      man
      None
      Southampton
    
    
      1
      1
      female
      38.0
      71.2833
      C
      First
      woman
      C
      Cherbourg

JSON



In [67]:

    
t3.to_json('titanic.json')



In [68]:

    
t4 = pd.read_json('titanic.json')
t4.head(2)









    Out[68]:






  
    
      
      age
      class
      deck
      embark_town
      embarked
      fare
      sex
      survived
      who
    
  
  
    
      0
      22.0
      Third
      None
      Southampton
      S
      7.2500
      male
      0
      man
    
    
      1
      38.0
      First
      C
      Cherbourg
      C
      71.2833
      female
      1
      woman



In [69]:

    
t4 = t4[t3.columns]
t4.head(2)









    Out[69]:






  
    
      
      survived
      sex
      age
      fare
      embarked
      class
      who
      deck
      embark_town
    
  
  
    
      0
      0
      male
      22.0
      7.2500
      S
      Third
      man
      None
      Southampton
    
    
      1
      1
      female
      38.0
      71.2833
      C
      First
      woman
      C
      Cherbourg

	survived	age	fare
count	891.000000	714.000000	891.000000
mean	0.383838	29.699118	32.204208
std	0.486592	14.526497	49.693429
min	0.000000	0.420000	0.000000
25%	0.000000	20.125000	7.910400
50%	0.000000	28.000000	14.454200
75%	1.000000	38.000000	31.000000
max	1.000000	80.000000	512.329200

	Open	High	Low	Close	Volume
Date
2015-01-02	111.39	111.44	107.35	109.33	53204626
2015-01-05	108.29	108.65	105.41	106.25	64285491
2015-01-06	106.54	107.43	104.63	106.26	65797116
2015-01-07	107.20	108.20	106.70	107.75	40105934
2015-01-08	109.23	112.15	108.70	111.89	59364547

	survived	sex	age	fare	embarked	class	who	deck	embark_town
0	0	male	22.0	7.2500	S	Third	man	NaN	Southampton
1	1	female	38.0	71.2833	C	First	woman	C	Cherbourg
2	1	female	26.0	7.9250	S	Third	woman	NaN	Southampton
3	1	female	35.0	53.1000	S	First	woman	C	Southampton
4	0	male	35.0	8.0500	S	Third	man	NaN	Southampton

	survived	sex	age	fare	embarked	class	who	deck	embark_town
886	0	male	27.0	13.00	S	Second	man	NaN	Southampton
887	1	female	19.0	30.00	S	First	woman	B	Southampton
888	0	female	NaN	23.45	S	Third	woman	NaN	Southampton
889	1	male	26.0	30.00	C	First	man	C	Cherbourg
890	0	male	32.0	7.75	Q	Third	man	NaN	Queenstown

	survived	sex	age	fare	embarked	class	who	deck	embark_town
10	1	female	4.0	16.7000	S	Third	child	G	Southampton
11	1	female	58.0	26.5500	S	First	woman	C	Southampton
12	0	male	20.0	8.0500	S	Third	man	NaN	Southampton
13	0	male	39.0	31.2750	S	Third	man	NaN	Southampton
14	0	female	14.0	7.8542	S	Third	child	NaN	Southampton

	survived	sex	age	fare	embarked	class	who	deck	embark_town
78	1	male	0.83	29.0000	S	Second	child	NaN	Southampton
164	0	male	1.00	39.6875	S	Third	child	NaN	Southampton
172	1	female	1.00	11.1333	S	Third	child	NaN	Southampton
183	1	male	1.00	39.0000	S	Second	child	F	Southampton
305	1	male	0.92	151.5500	S	First	child	C	Southampton
381	1	female	1.00	15.7417	C	Third	child	NaN	Cherbourg
386	0	male	1.00	46.9000	S	Third	child	NaN	Southampton
469	1	female	0.75	19.2583	C	Third	child	NaN	Cherbourg
644	1	female	0.75	19.2583	C	Third	child	NaN	Cherbourg
755	1	male	0.67	14.5000	S	Second	child	NaN	Southampton
788	1	male	1.00	20.5750	S	Third	child	NaN	Southampton
803	1	male	0.42	8.5167	C	Third	child	NaN	Cherbourg
827	1	male	1.00	37.0042	C	Second	child	NaN	Cherbourg
831	1	male	0.83	18.7500	S	Second	child	NaN	Southampton

		survived	age	fare	embarked	who	deck	embark_town
sex	class
female	First	94	85	94	92	94	81	92
	Second	76	74	76	76	76	10	76
	Third	144	102	144	144	144	6	144
male	First	122	101	122	122	122	94	122
	Second	108	99	108	108	108	6	108
	Third	347	253	347	347	347	6	347

	survived	sex	age	fare	embarked	class	who	deck
embark_town
Cherbourg	43	43	38	43	43	43	43	35
Queenstown	1	1	1	1	1	1	1	1
Southampton	48	48	44	48	48	48	48	43

	survived	embarked	age
	sum	count	mean	median	my_func
embark_town
Cherbourg	42	43	36.052632	37.0	60.0
Queenstown	1	1	33.000000	33.0	33.0
Southampton	46	48	32.704545	33.0	63.0

	age	class	deck	embark_town	embarked	fare	sex	survived	who
0	22.0	Third	None	Southampton	S	7.2500	male	0	man
1	38.0	First	C	Cherbourg	C	71.2833	female	1	woman