This is the reference component to the "Creating, reading, and writing" section of the tutorial.
For the workbook section, click here.
The very first step in any data analytics project will probably be reading the data out of a file somewhere, so that's where we'll begin.
In this section, we'll look at exercises on creating pandas Series and DataFrame objects, both by hand and by reading data from storage.
The IO Tools section of the official pandas docs provides a comprehensive overview on this subject.


In [1]:
import pandas as pd

Creating data

There are two core objects in pandas:
the DataFrame and the Series.

A DataFrame is a table.
It contains an array of individual entries, each of which has a certain value.
Each entry corresponds with a row (or record) and a column.
For example, consider the following simple DataFrame:


In [2]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})


Out[2]:
No Yes
0 131 50
1 2 21

DataFrame entries aren't limited to integers.
Here's a DataFrame with strings as values:


In [3]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})


Out[3]:
Bob Sue
0 I liked it. Pretty good.
1 It was awful. Bland.

We are using the pd.DataFrame constructor to generate these DataFrame objects.
The syntax for declaring a new one is a dictionary whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries.
This is the standard way of constructing a new DataFrame, and the one you will usually encounter.
The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels.
Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.
The list of row labels used in a DataFrame is known as an Index.
We can assign values to it by using an index parameter in our constructor:


In [4]:
pd.DataFrame({'Bob': ['I loved it.', 'I hated it.'],
              'Sue': ['That was okay.', 'That was not okay.']},
            index=['Product A', 'Product B'])


Out[4]:
Bob Sue
Product A I loved it. That was okay.
Product B I hated it. That was not okay.

A Series is a sequence of values.
If a DataFrame is a table (or a matrix), then a Series is a list (or a vector).
In fact, you can create a Series with nothing more than a list:


In [5]:
pd.Series([1, 2, 3, 4, 5])


Out[5]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

Think of a Series as a single column in a DataFrame.
You can assign column values to the Series (same as above) using the index parameter.
In a Series, the columns aren't named; the Series itself has one overall name that can be assigned:


In [6]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')


Out[6]:
2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

Now that you can see the relationship between the Series and the DataFrame, it's time to move on.

Reading common file formats

It's important to understand how DataFrames and Series work, but you won't usually be creating your own data to work with.
Now it's time to learn how to deal with existing data and various formats.
Let's begin with the very common .csv (comma separated value) file format.
We can use the read_csv function to read the data into a DataFrame.


In [7]:
wine_reviews = pd.read_csv('inputs/wine-reviews/winemag-data-130k-v2.csv')
wine_reviews[:1]


Out[7]:
Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia

We can use the .shape attribute to see how large the resulting DataFrame is:


In [8]:
wine_reviews.shape


Out[8]:
(129971, 14)

It appears that our wine_reviews DataFrame has ~130,000 records with 14 different columns, giving us just under 2 million entries to work with.
We can look at the first few rows using the .head() command:


In [9]:
wine_reviews.head()


Out[9]:
Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

In [10]:
print(wine_reviews.head())


   Unnamed: 0   country                                        description  \
0           0     Italy  Aromas include tropical fruit, broom, brimston...   
1           1  Portugal  This is ripe and fruity, a wine that is smooth...   
2           2        US  Tart and snappy, the flavors of lime flesh and...   
3           3        US  Pineapple rind, lemon pith and orange blossom ...   
4           4        US  Much like the regular bottling from 2012, this...   

                          designation  points  price           province  \
0                        Vulkà Bianco      87    NaN  Sicily & Sardinia   
1                            Avidagos      87   15.0              Douro   
2                                 NaN      87   14.0             Oregon   
3                Reserve Late Harvest      87   13.0           Michigan   
4  Vintner's Reserve Wild Child Block      87   65.0             Oregon   

              region_1           region_2         taster_name  \
0                 Etna                NaN       Kerin O’Keefe   
1                  NaN                NaN          Roger Voss   
2    Willamette Valley  Willamette Valley        Paul Gregutt   
3  Lake Michigan Shore                NaN  Alexander Peartree   
4    Willamette Valley  Willamette Valley        Paul Gregutt   

  taster_twitter_handle                                              title  \
0          @kerinokeefe                  Nicosia 2013 Vulkà Bianco  (Etna)   
1            @vossroger      Quinta dos Avidagos 2011 Avidagos Red (Douro)   
2           @paulgwine       Rainstorm 2013 Pinot Gris (Willamette Valley)   
3                   NaN  St. Julian 2013 Reserve Late Harvest Riesling ...   
4           @paulgwine   Sweet Cheeks 2012 Vintner's Reserve Wild Child...   

          variety               winery  
0     White Blend              Nicosia  
1  Portuguese Red  Quinta dos Avidagos  
2      Pinot Gris            Rainstorm  
3        Riesling           St. Julian  
4      Pinot Noir         Sweet Cheeks  

The pandas read_csv() function is quite versatile, with over 30 optional parameters you can specify.
For example, you can see in this dataset that the csv file has a in-built index, which pandas did not pick up on automatically.
To make pandas use that column for the index (instead of creating a new one from scratch), we may specify and use an index_col:


In [11]:
wine_reviews = pd.read_csv('inputs/wine-reviews/winemag-data-130k-v2.csv', index_col=0)
wine_reviews.head()


Out[11]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

Another file format that you will see regularly is the Excel spreadsheet (.xls or .xlst).
An Excel file is organized as a sequence of named sheets, and each sheet is basically a table.
When we load data into pandas using read_excel(), we need to specify which sheet(s) we are interested in.


In [12]:
wic = pd.read_excel('inputs/publicassistance/xls_files_all/WICAgencies2013ytd.xls', sheet_name='Total Women')
wic.head()


Out[12]:
Note on WIC Agency Level Monthly Spreadsheets
0 This file contains monthly data for the curren...
1 currently 90 WIC State agencies: the 50 geogr...
2 Guam, the Virgin Islands, American Samoa, Nort...
3 Sixteen spreadsheets are included in the follo...
4 Pregnant Women

As you can see in this example, Excel files are often not formatted as well as CSV files are.
Spreadsheets allow (and encourage) creating notes and fields which are human-readable, but not machine-readable.
So before we can use this particular dataset, we will need to clean it up a bit.
We will see how to do so in the next section.
For now, let's move on to another common data format: SQL files.

SQL databases are where most of the data on the web ultimately gets stored.
They can be used to store data on things as simple as recipes to things as complicated as "almost everything on the Kaggle website".
Connecting to a SQL database requires a lot more thought than reading from an Excel file.
For one, you need to create a connector, something that will handle siphoning data from the database.
pandas won't do this for you automatically because there are many, many different types of SQL databases out there, each with its own connector.
So for a SQLite database (the only kind supported on Kaggle), you would need to first do the following (using the sqlite3 library that comes with Python):


In [13]:
import sqlite3
conn = sqlite3.connect('inputs/FPA_FOD_20170508.sqlite')

Now we are ready to write a SQL statement.
Internally, SQL databases all operate very differently.
Externally, however, they all provide the same API, the "Structured Query Language" (or...SQL...for short). For the purposes of analysis however we can usually just think of a SQL database as a set of tables with names, and SQL as a minor inconvenience in getting that data out of said tables.
So, without further ado, here is all the SQL you have to know to get the data out of SQLite and into pandas:


In [14]:
fires = pd.read_sql_query("SELECT * FROM fires", conn)

In [15]:
fires.head()


Out[15]:
OBJECTID FOD_ID FPA_ID SOURCE_SYSTEM_TYPE SOURCE_SYSTEM NWCG_REPORTING_AGENCY NWCG_REPORTING_UNIT_ID NWCG_REPORTING_UNIT_NAME SOURCE_REPORTING_UNIT SOURCE_REPORTING_UNIT_NAME ... FIRE_SIZE_CLASS LATITUDE LONGITUDE OWNER_CODE OWNER_DESCR STATE COUNTY FIPS_CODE FIPS_NAME Shape
0 1 1 FS-1418826 FED FS-FIRESTAT FS USCAPNF Plumas National Forest 0511 Plumas National Forest ... A 40.036944 -121.005833 5.0 USFS CA 63 063 Plumas b'\x00\x01\xad\x10\x00\x00\xe8d\xc2\x92_@^\xc0...
1 2 2 FS-1418827 FED FS-FIRESTAT FS USCAENF Eldorado National Forest 0503 Eldorado National Forest ... A 38.933056 -120.404444 5.0 USFS CA 61 061 Placer b'\x00\x01\xad\x10\x00\x00T\xb6\xeej\xe2\x19^\...
2 3 3 FS-1418835 FED FS-FIRESTAT FS USCAENF Eldorado National Forest 0503 Eldorado National Forest ... A 38.984167 -120.735556 13.0 STATE OR PRIVATE CA 17 017 El Dorado b'\x00\x01\xad\x10\x00\x00\xd0\xa5\xa0W\x13/^\...
3 4 4 FS-1418845 FED FS-FIRESTAT FS USCAENF Eldorado National Forest 0503 Eldorado National Forest ... A 38.559167 -119.913333 5.0 USFS CA 3 003 Alpine b'\x00\x01\xad\x10\x00\x00\x94\xac\xa3\rt\xfa]...
4 5 5 FS-1418847 FED FS-FIRESTAT FS USCAENF Eldorado National Forest 0503 Eldorado National Forest ... A 38.559167 -119.933056 5.0 USFS CA 3 003 Alpine b'\x00\x01\xad\x10\x00\x00@\xe3\xaa.\xb7\xfb]\...

5 rows × 39 columns

Isn't that just magical?

Writing common file formats

Another thing to love about pandas is how it takes care of conversions when you write file.
This example uses to_csv to write data to a file:


In [16]:
wine_reviews.head().to_csv('inputs/wine-reviews/wine_reviews.csv')

To write an Excel file you can use to_excel and the sheet_name again:


In [17]:
wic.to_excel('inputs/publicassistance/wic.xlsx', sheet_name='Total Women')

And for SQL database entries, supply the name of the table and specify a connecter:


In [21]:
conn = sqlite3.connect('inputs/fires.sqlite')
fires.head(10).to_sql('fires', conn)