Part I



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize']=(8,5) # optional
plt.style.use('bmh')  # optional

We're going to investigate the set of data on the passengers of the Titanic. The datasets I'm providing come from the website http://www.kaggle.com/c/titanic. You can download it there - you'll need to sign up and agree to the rules of the competition.



In [2]:

    
#change the paths as needed
train = pd.read_csv('../data/titanic_train.csv')
test = pd.read_csv('../data/titanic_test.csv')



In [3]:

    
print(train.info())  # overview of the training data









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None



In [4]:

    
print(test.info())  # note no 'Survived' column









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None



In [5]:

    
train.head()  # first few rows









    Out[5]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [6]:

    
print(train.shape)  # 891 rows, 12 columns
print(test.shape)  # 418 rows, 11 columns









    



(891, 12)
(418, 11)



In [7]:

    
train.shape[0]









    Out[7]:





891



In [8]:

    
train.describe(include='all')  # summary statistics









    Out[8]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      count
      891.000000
      891.000000
      891.000000
      891
      891
      714.000000
      891.000000
      891.000000
      891
      891.000000
      204
      889
    
    
      unique
      NaN
      NaN
      NaN
      891
      2
      NaN
      NaN
      NaN
      681
      NaN
      147
      3
    
    
      top
      NaN
      NaN
      NaN
      Dantcheff, Mr. Ristiu
      male
      NaN
      NaN
      NaN
      CA. 2343
      NaN
      B96 B98
      S
    
    
      freq
      NaN
      NaN
      NaN
      1
      577
      NaN
      NaN
      NaN
      7
      NaN
      4
      644
    
    
      mean
      446.000000
      0.383838
      2.308642
      NaN
      NaN
      29.699118
      0.523008
      0.381594
      NaN
      32.204208
      NaN
      NaN
    
    
      std
      257.353842
      0.486592
      0.836071
      NaN
      NaN
      14.526497
      1.102743
      0.806057
      NaN
      49.693429
      NaN
      NaN
    
    
      min
      1.000000
      0.000000
      1.000000
      NaN
      NaN
      0.420000
      0.000000
      0.000000
      NaN
      0.000000
      NaN
      NaN
    
    
      25%
      223.500000
      0.000000
      2.000000
      NaN
      NaN
      20.125000
      0.000000
      0.000000
      NaN
      7.910400
      NaN
      NaN
    
    
      50%
      446.000000
      0.000000
      3.000000
      NaN
      NaN
      28.000000
      0.000000
      0.000000
      NaN
      14.454200
      NaN
      NaN
    
    
      75%
      668.500000
      1.000000
      3.000000
      NaN
      NaN
      38.000000
      1.000000
      0.000000
      NaN
      31.000000
      NaN
      NaN
    
    
      max
      891.000000
      1.000000
      3.000000
      NaN
      NaN
      80.000000
      8.000000
      6.000000
      NaN
      512.329200
      NaN
      NaN

Note that the above summary gives you information about missing values indirectly: for instance, we have 891 observations in the data, but the count for 'Age' is only 714, implying that we don't have the age for 177 passengers. A direct way to get this is by combining the command isnull() with sum():



In [9]:

    
print(train.isnull().sum())  # the sum is taken columnwise









    



PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Let's head towards developing a simple predictive model for who survives the shipwreck. For now we'll explore and clean the data; later on we'll implement the model. Let's look at the distributions of some of the data.



In [10]:

    
train.hist(column='Age');



In [11]:

    
train.hist(column=['Age', 'Fare'], figsize=(16, 6));



In [12]:

    
pd.crosstab(train['Sex'], train['Survived'], margins=True)  # 'margins' gives total counts

This will be easier to analyze via proportions. Below, 'axis=1' means to apply the sum horizontally ('axis = 0' would mean to apply the sum vertically)



In [13]:

    
# saving the table for plotting
table = pd.crosstab(train.Sex, train.Survived).apply(lambda x: x/x.sum(), axis=1)  # no use for margins here
table



In [14]:

    
train['Survived'].sum() / train.shape[0]









    Out[14]:





0.3838383838383838



In [15]:

    
table.plot(kind='bar', stacked=True);  # another plot call directly from a Pandas object



In [16]:

    
table.plot(kind='bar');  # without stacking

At this point, we're going to start addressing the missing values in Age, Embarked, Cabin, etc. It is important that any changes we make to 'train' are also made to 'test,' otherwise any predictive model we build will be flawed. Let's make a combined data frame.



In [17]:

    
test['Survived'] = 0  # create a new 'Survived' column in test, set all values to 0



In [18]:

    
test.head()









    Out[18]:






  
    
      
      PassengerId
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      Survived
    
  
  
    
      0
      892
      3
      Kelly, Mr. James
      male
      34.5
      0
      0
      330911
      7.8292
      NaN
      Q
      0
    
    
      1
      893
      3
      Wilkes, Mrs. James (Ellen Needs)
      female
      47.0
      1
      0
      363272
      7.0000
      NaN
      S
      0
    
    
      2
      894
      2
      Myles, Mr. Thomas Francis
      male
      62.0
      0
      0
      240276
      9.6875
      NaN
      Q
      0
    
    
      3
      895
      3
      Wirz, Mr. Albert
      male
      27.0
      0
      0
      315154
      8.6625
      NaN
      S
      0
    
    
      4
      896
      3
      Hirvonen, Mrs. Alexander (Helga E Lindqvist)
      female
      22.0
      1
      1
      3101298
      12.2875
      NaN
      S
      0



In [19]:

    
alldata = pd.concat([train,test], keys=['train', 'test'])  # tag the training and test data with a multiindex



In [20]:

    
alldata.tail() #note the multiindex on the rows









    Out[20]:






  
    
      
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      PassengerId
      Pclass
      Sex
      SibSp
      Survived
      Ticket
    
  
  
    
      test
      413
      NaN
      NaN
      S
      8.0500
      Spector, Mr. Woolf
      0
      1305
      3
      male
      0
      0
      A.5. 3236
    
    
      414
      39.0
      C105
      C
      108.9000
      Oliva y Ocana, Dona. Fermina
      0
      1306
      1
      female
      0
      0
      PC 17758
    
    
      415
      38.5
      NaN
      S
      7.2500
      Saether, Mr. Simon Sivertsen
      0
      1307
      3
      male
      0
      0
      SOTON/O.Q. 3101262
    
    
      416
      NaN
      NaN
      S
      8.0500
      Ware, Mr. Frederick
      0
      1308
      3
      male
      0
      0
      359309
    
    
      417
      NaN
      NaN
      C
      22.3583
      Peter, Master. Michael J
      1
      1309
      3
      male
      1
      0
      2668



In [21]:

    
alldata.shape[0] == 891 + 418









    Out[21]:





True

Below are some examples of how we could use the multi-index on the rows.



In [22]:

    
print(alldata.ix['train', 'Name'])









    



0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Williams, Mr. Charles Eugene
18     Vander Planke, Mrs. Julius (Emelia Maria Vande...
19                               Masselmani, Mrs. Fatima
20                                  Fynney, Mr. Joseph J
21                                 Beesley, Mr. Lawrence
22                           McGowan, Miss. Anna "Annie"
23                          Sloper, Mr. William Thompson
24                         Palsson, Miss. Torborg Danira
25     Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
26                               Emir, Mr. Farred Chehab
27                        Fortune, Mr. Charles Alexander
28                         O'Dwyer, Miss. Ellen "Nellie"
29                                   Todoroff, Mr. Lalio
                             ...                        
861                          Giles, Mr. Frederick Edward
862    Swift, Mrs. Frederick Joel (Margaret Welles Ba...
863                    Sage, Miss. Dorothy Edith "Dolly"
864                               Gill, Mr. John William
865                             Bystrom, Mrs. (Karolina)
866                         Duran y More, Miss. Asuncion
867                 Roebling, Mr. Washington Augustus II
868                          van Melkebeke, Mr. Philemon
869                      Johnson, Master. Harold Theodor
870                                    Balkic, Mr. Cerin
871     Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
872                             Carlsson, Mr. Frans Olof
873                          Vander Cruyssen, Mr. Victor
874                Abelson, Mrs. Samuel (Hannah Wizosky)
875                     Najib, Miss. Adele Kiamie "Jane"
876                        Gustafsson, Mr. Alfred Ossian
877                                 Petroff, Mr. Nedelio
878                                   Laleff, Mr. Kristo
879        Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
880         Shelley, Mrs. William (Imanita Parrish Hall)
881                                   Markun, Mr. Johann
882                         Dahlberg, Miss. Gerda Ulrika
883                        Banfield, Mr. Frederick James
884                               Sutehall, Mr. Henry Jr
885                 Rice, Mrs. William (Margaret Norton)
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, dtype: object



In [23]:

    
print(alldata.ix[1, 'Name'])
print(alldata.ix[892, 'Name'])  # in test set









    



Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Wilkes, Mrs. James (Ellen Needs)



In [24]:

    
print(alldata.ix['test', 'Name'][:5])  # note the numerical indices start at 0









    



0                                Kelly, Mr. James
1                Wilkes, Mrs. James (Ellen Needs)
2                       Myles, Mr. Thomas Francis
3                                Wirz, Mr. Albert
4    Hirvonen, Mrs. Alexander (Helga E Lindqvist)
Name: Name, dtype: object

To use the .loc method of selecting data instead, you could first select the 'outer' indices, then the 'inner' indices.



In [25]:

    
print(alldata.loc['test', 'Name'][1])









    



Wilkes, Mrs. James (Ellen Needs)



In [26]:

    
sum(alldata.Age.isnull())









    Out[26]:





263



In [27]:

    
alldata.Age.var()  # the variance of the Age column (missing values excluded)









    Out[27]:





207.74878655136482



In [28]:

    
alldata.Age.interpolate(inplace=True)  # fills in missing values with mean column value



In [29]:

    
sum(alldata.Age.isnull())









    Out[29]:





0

This isn't a flawless fix; the downside is that mean imputation will change the mean, may reduce the variance, and may distort the model. On the other hand, no imputation method is perfect and you may get better results than if you did not impute the missing values. Later we'll look at more advanced methods for imputing missing values.



In [30]:

    
alldata.Age.var()









    Out[30]:





189.658152309465

Exercise

How would you handle the missing values for Cabin, Embarked?



In [31]:

    
# your code here
print(alldata.Embarked.isnull().sum())
print(alldata.Embarked.value_counts())









    



2
S    914
C    270
Q    123
Name: Embarked, dtype: int64

Solution

Don't look until you've thought about it a bit.

There are a lot of missing Cabin values:



In [32]:

    
alldata.Cabin.isnull().sum()









    Out[32]:





1014

The simplest answer is to not include this column in the model for now. Three-quarters of the information in this column is missing, which makes it difficult to say much. However, it may be possible to infer some of the missing values based on implied relationships between passengers, perhaps by looking at last names, number of siblings on board, and so on. It's worth further investigation, but for now let's leave it.



In [33]:

    
alldata.drop('Cabin', inplace=True, axis=1)



In [34]:

    
alldata.Embarked.isnull().sum()









    Out[34]:





2



In [35]:

    
alldata.Embarked.value_counts()









    Out[35]:





S    914
C    270
Q    123
Name: Embarked, dtype: int64

We have only two missing values, and the majority of the passengers embarked in Southampton (UK). Absent any other information, I'd guess that the two passengers embarked in Southampton as well.



In [36]:

    
# use the 'fillna' method
alldata.Embarked.fillna('S', inplace=True)



In [37]:

    
alldata.Embarked.isnull().sum()









    Out[37]:





0



In [38]:

    
cleaned_train = alldata.loc['train', :] #first 891 rows, use multi-index to select
cleaned_test = alldata.loc['test',:] #last 418 rows, same
print(cleaned_train.shape)
print(cleaned_test.shape)









    



(891, 11)
(418, 11)

Saving your work as csv:



In [39]:

    
cleaned_train.to_csv('../data/cleaned_train.csv', index=False)
cleaned_test.to_csv('../data/cleaned_test.csv', index=False)

or as a pickle:



In [40]:

    
cleaned_train.to_pickle('../data/cleaned_train.p')  # index is not an option for a pickle
cleaned_test.to_pickle('../data/cleaned_test.p')



In [41]:

    
del cleaned_train



In [42]:

    
whos









    



Variable       Type         Data/Info
-------------------------------------
alldata        DataFrame                Age Embarked <...>n[1309 rows x 11 columns]
cleaned_test   DataFrame          Age Embarked      F<...>\n[418 rows x 11 columns]
np             module       <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
pd             module       <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
plt            module       <module 'matplotlib.pyplo<...>es/matplotlib/pyplot.py'>
table          DataFrame    Survived         0       <...>e      0.811092  0.188908
test           DataFrame         PassengerId  Pclass <...>\n[418 rows x 12 columns]
train          DataFrame         PassengerId  Survive<...>\n[891 rows x 12 columns]



In [43]:

    
# verify that the pickle looks like the same object
ct = pd.read_pickle('../data/cleaned_train.p')
ct.describe(include='all')









    Out[43]:






  
    
      
      Age
      Embarked
      Fare
      Name
      Parch
      PassengerId
      Pclass
      Sex
      SibSp
      Survived
      Ticket
    
  
  
    
      count
      891.000000
      891
      891.000000
      891
      891.000000
      891.000000
      891.000000
      891
      891.000000
      891.000000
      891
    
    
      unique
      NaN
      3
      NaN
      891
      NaN
      NaN
      NaN
      2
      NaN
      NaN
      681
    
    
      top
      NaN
      S
      NaN
      Dantcheff, Mr. Ristiu
      NaN
      NaN
      NaN
      male
      NaN
      NaN
      CA. 2343
    
    
      freq
      NaN
      646
      NaN
      1
      NaN
      NaN
      NaN
      577
      NaN
      NaN
      7
    
    
      mean
      29.726061
      NaN
      32.204208
      NaN
      0.381594
      446.000000
      2.308642
      NaN
      0.523008
      0.383838
      NaN
    
    
      std
      13.902353
      NaN
      49.693429
      NaN
      0.806057
      257.353842
      0.836071
      NaN
      1.102743
      0.486592
      NaN
    
    
      min
      0.420000
      NaN
      0.000000
      NaN
      0.000000
      1.000000
      1.000000
      NaN
      0.000000
      0.000000
      NaN
    
    
      25%
      21.000000
      NaN
      7.910400
      NaN
      0.000000
      223.500000
      2.000000
      NaN
      0.000000
      0.000000
      NaN
    
    
      50%
      28.500000
      NaN
      14.454200
      NaN
      0.000000
      446.000000
      3.000000
      NaN
      0.000000
      0.000000
      NaN
    
    
      75%
      38.000000
      NaN
      31.000000
      NaN
      0.000000
      668.500000
      3.000000
      NaN
      1.000000
      1.000000
      NaN
    
    
      max
      80.000000
      NaN
      512.329200
      NaN
      6.000000
      891.000000
      3.000000
      NaN
      8.000000
      1.000000
      NaN



In [ ]:

    
# other options - autocomplete below:
ct.to

Part II

Pandas plays well with most databases. Form a connection to the database, then write ordinary SQL queries to bring data into Python for analysis. Here is an example with a sqlite3 database (this database is too large to host on github as is, so just follow along for this example).



In [44]:

    
import sqlite3



In [45]:

    
con = sqlite3.connect('/Volumes/data/taxis_old/taxis.sqlite3')



In [46]:

    
pd.read_sql_query('select * from sqlite_master', con)









    Out[46]:






  
    
      
      type
      name
      tbl_name
      rootpage
      sql
    
  
  
    
      0
      table
      trip_data
      trip_data
      2
      CREATE TABLE trip_data (\nmedallion varchar(40...
    
    
      1
      index
      pickup_datetime
      trip_data
      33862931
      CREATE INDEX pickup_datetime on trip_data(pick...
    
    
      2
      table
      fare_data
      fare_data
      38828040
      CREATE TABLE fare_data (\nmedallion varchar(40...
    
    
      3
      index
      tip_amount
      fare_data
      63425735
      CREATE INDEX tip_amount on fare_data(tip_amount)
    
    
      4
      index
      fame_amount
      fare_data
      65633919
      CREATE INDEX fame_amount on fare_data(fare_amo...
    
    
      5
      index
      hack_license
      fare_data
      68113447
      CREATE INDEX hack_license on fare_data(hack_li...



In [47]:

    
# metadata
pd.read_sql_query("select name from sqlite_master where type = 'table'", con)









    Out[47]:






  
    
      
      name
    
  
  
    
      0
      trip_data
    
    
      1
      fare_data



In [48]:

    
names = pd.read_sql_query('select sql from sqlite_master where tbl_name="trip_data"', con)
names.ix[0,0]









    Out[48]:





'CREATE TABLE trip_data (\nmedallion varchar(40),\nhack_license varchar(40),\nvendor_id varchar(5),\nrate_code int,\nstore_and_fwd_flag varchar(5),\npickup_datetime timestamp,\ndropoff_datetime timestamp,\npassenger_count int,\ntrip_time_in_sec int,\ntrip_distance numeric,\npickup_longitude numeric,\npickup_latitude numeric,\ndropoff_longitude numeric,\ndropoff_latitude numeric)'



In [49]:

    
pd.read_sql_query("select count(*) from trip_data", con)  # takes a while



In [50]:

    
tips = pd.read_sql_query('select tip_amount from fare_data where tip_amount > 0', con)  # takes a while



In [51]:

    
tips.tip_amount.max()









    Out[51]:





888.19000000000005



In [52]:

    
tips.describe()









    Out[52]:






  
    
      
      tip_amount
    
  
  
    
      count
      9.066210e+07
    
    
      mean
      2.607578e+00
    
    
      std
      2.439769e+00
    
    
      min
      1.000000e-02
    
    
      25%
      1.300000e+00
    
    
      50%
      2.000000e+00
    
    
      75%
      3.000000e+00
    
    
      max
      8.881900e+02



In [53]:

    
tips.hist(bins=500)
plt.xlim(0, 20);



In [54]:

    
whos









    



Variable       Type          Data/Info
--------------------------------------
alldata        DataFrame                 Age Embarked <...>n[1309 rows x 11 columns]
cleaned_test   DataFrame           Age Embarked      F<...>\n[418 rows x 11 columns]
con            Connection    <sqlite3.Connection object at 0x11766c9d0>
ct             DataFrame           Age Embarked      F<...>\n[891 rows x 11 columns]
names          DataFrame                              <...>time on trip_data(pick...
np             module        <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
pd             module        <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
plt            module        <module 'matplotlib.pyplo<...>es/matplotlib/pyplot.py'>
sqlite3        module        <module 'sqlite3' from '/<...>3.5/sqlite3/__init__.py'>
table          DataFrame     Survived         0       <...>e      0.811092  0.188908
test           DataFrame          PassengerId  Pclass <...>\n[418 rows x 12 columns]
tips           DataFrame               tip_amount\n0  <...>0662100 rows x 1 columns]
train          DataFrame          PassengerId  Survive<...>\n[891 rows x 12 columns]



In [56]:

    
tips.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90662100 entries, 0 to 90662099
Data columns (total 1 columns):
tip_amount    float64
dtypes: float64(1)
memory usage: 691.7 MB



In [57]:

    
tips[tips.tip_amount > 800].shape









    Out[57]:





(3, 1)



In [ ]:

    
names = pd.read_sql_query('select sql from sqlite_master where tbl_name="fare_data"', con)
names.ix[0,0]



In [ ]:

    
# how much time do you have?

# big_fares = pd.read_sql_query('select * from fare_data ' \
#                              'inner join trip_data where trip_data.pickup_datetime = fare_data.pickup_datetime and ' \
#                              'trip_data.medallion = fare_data.medallion and ' \
#                              'trip_data.hack_license = fare_data.medallion and ' \
#                              'fare_data.fare_amount > 100 limit 500', con)



In [58]:

    
con.close()  # don't forget to close the connection!!!!!



In [59]:

    
pd.read_sql_query("select count(*) from trip_data", con)









    



---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
<ipython-input-59-2575b3764c60> in <module>()
----> 1 pd.read_sql_query("select count(*) from trip_data", con)

/Users/jwj2/anaconda/envs/snowflakes/lib/python3.5/site-packages/pandas/io/sql.py in read_sql_query(sql, con, index_col, coerce_float, params, parse_dates, chunksize)
    429     return pandas_sql.read_query(
    430         sql, index_col=index_col, params=params, coerce_float=coerce_float,
--> 431         parse_dates=parse_dates, chunksize=chunksize)
    432 
    433 

/Users/jwj2/anaconda/envs/snowflakes/lib/python3.5/site-packages/pandas/io/sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
   1593 
   1594         args = _convert_params(sql, params)
-> 1595         cursor = self.execute(*args)
   1596         columns = [col_desc[0] for col_desc in cursor.description]
   1597 

/Users/jwj2/anaconda/envs/snowflakes/lib/python3.5/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1553             cur = self.con
   1554         else:
-> 1555             cur = self.con.cursor()
   1556         try:
   1557             if kwargs:

ProgrammingError: Cannot operate on a closed database.



In [60]:

    
whos









    



Variable       Type          Data/Info
--------------------------------------
alldata        DataFrame                 Age Embarked <...>n[1309 rows x 11 columns]
cleaned_test   DataFrame           Age Embarked      F<...>\n[418 rows x 11 columns]
con            Connection    <sqlite3.Connection object at 0x11766c9d0>
ct             DataFrame           Age Embarked      F<...>\n[891 rows x 11 columns]
names          DataFrame                              <...>time on trip_data(pick...
np             module        <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
pd             module        <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
plt            module        <module 'matplotlib.pyplo<...>es/matplotlib/pyplot.py'>
sqlite3        module        <module 'sqlite3' from '/<...>3.5/sqlite3/__init__.py'>
table          DataFrame     Survived         0       <...>e      0.811092  0.188908
test           DataFrame          PassengerId  Pclass <...>\n[418 rows x 12 columns]
tips           DataFrame               tip_amount\n0  <...>0662100 rows x 1 columns]
train          DataFrame          PassengerId  Survive<...>\n[891 rows x 12 columns]



In [ ]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
count	891.000000	891.000000	891.000000	891	891	714.000000	891.000000	891.000000	891	891.000000	204	889
unique	NaN	NaN	NaN	891	2	NaN	NaN	NaN	681	NaN	147	3
top	NaN	NaN	NaN	Dantcheff, Mr. Ristiu	male	NaN	NaN	NaN	CA. 2343	NaN	B96 B98	S
freq	NaN	NaN	NaN	1	577	NaN	NaN	NaN	7	NaN	4	644
mean	446.000000	0.383838	2.308642	NaN	NaN	29.699118	0.523008	0.381594	NaN	32.204208	NaN	NaN
std	257.353842	0.486592	0.836071	NaN	NaN	14.526497	1.102743	0.806057	NaN	49.693429	NaN	NaN
min	1.000000	0.000000	1.000000	NaN	NaN	0.420000	0.000000	0.000000	NaN	0.000000	NaN	NaN
25%	223.500000	0.000000	2.000000	NaN	NaN	20.125000	0.000000	0.000000	NaN	7.910400	NaN	NaN
50%	446.000000	0.000000	3.000000	NaN	NaN	28.000000	0.000000	0.000000	NaN	14.454200	NaN	NaN
75%	668.500000	1.000000	3.000000	NaN	NaN	38.000000	1.000000	0.000000	NaN	31.000000	NaN	NaN
max	891.000000	1.000000	3.000000	NaN	NaN	80.000000	8.000000	6.000000	NaN	512.329200	NaN	NaN

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

		Age	Cabin	Embarked	Fare	Name	Parch	PassengerId	Pclass	Sex	SibSp	Ticket
test	413	NaN	NaN	S	8.0500	Spector, Mr. Woolf	0	1305	3	male	0	A.5. 3236
	414	39.0	C105	C	108.9000	Oliva y Ocana, Dona. Fermina	0	1306	1	female	0	PC 17758
	415	38.5	NaN	S	7.2500	Saether, Mr. Simon Sivertsen	0	1307	3	male	0	SOTON/O.Q. 3101262
	416	NaN	NaN	S	8.0500	Ware, Mr. Frederick	0	1308	3	male	0	359309
	417	NaN	NaN	C	22.3583	Peter, Master. Michael J	1	1309	3	male	1	2668

	type	name	tbl_name	rootpage	sql
0	table	trip_data	trip_data	2	CREATE TABLE trip_data (\nmedallion varchar(40...
1	index	pickup_datetime	trip_data	33862931	CREATE INDEX pickup_datetime on trip_data(pick...
2	table	fare_data	fare_data	38828040	CREATE TABLE fare_data (\nmedallion varchar(40...
3	index	tip_amount	fare_data	63425735	CREATE INDEX tip_amount on fare_data(tip_amount)
4	index	fame_amount	fare_data	65633919	CREATE INDEX fame_amount on fare_data(fare_amo...
5	index	hack_license	fare_data	68113447	CREATE INDEX hack_license on fare_data(hack_li...

	tip_amount
count	9.066210e+07
mean	2.607578e+00
std	2.439769e+00
min	1.000000e-02
25%	1.300000e+00
50%	2.000000e+00
75%	3.000000e+00
max	8.881900e+02