This is the reference component to the "Creating, reading, and writing" section of the tutorial.
For the workbook section, click here.
The very first step in any data analytics project will probably be reading the data out of a file somewhere, so that's where we'll begin.
In this section, we'll look at exercises on creating pandas Series and DataFrame objects, both by hand and by reading data from storage.
The IO Tools section of the official pandas docs provides a comprehensive overview on this subject.



In [1]:

    
import pandas as pd

Creating data

There are two core objects in pandas:
the DataFrame and the Series.

A DataFrame is a table.
It contains an array of individual entries, each of which has a certain value.
Each entry corresponds with a row (or record) and a column.
For example, consider the following simple DataFrame:



In [2]:

    
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})

DataFrame entries aren't limited to integers.
Here's a DataFrame with strings as values:



In [3]:

    
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})









    Out[3]:







  
    
      
      Bob
      Sue
    
  
  
    
      0
      I liked it.
      Pretty good.
    
    
      1
      It was awful.
      Bland.

We are using the pd.DataFrame constructor to generate these DataFrame objects.
The syntax for declaring a new one is a dictionary whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries.
This is the standard way of constructing a new DataFrame, and the one you will usually encounter.
The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels.
Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.
The list of row labels used in a DataFrame is known as an Index.
We can assign values to it by using an index parameter in our constructor:



In [4]:

    
pd.DataFrame({'Bob': ['I loved it.', 'I hated it.'],
              'Sue': ['That was okay.', 'That was not okay.']},
            index=['Product A', 'Product B'])









    Out[4]:







  
    
      
      Bob
      Sue
    
  
  
    
      Product A
      I loved it.
      That was okay.
    
    
      Product B
      I hated it.
      That was not okay.

A Series is a sequence of values.
If a DataFrame is a table (or a matrix), then a Series is a list (or a vector).
In fact, you can create a Series with nothing more than a list:



In [5]:

    
pd.Series([1, 2, 3, 4, 5])









    Out[5]:





0    1
1    2
2    3
3    4
4    5
dtype: int64

Think of a Series as a single column in a DataFrame.
You can assign column values to the Series (same as above) using the index parameter.
In a Series, the columns aren't named; the Series itself has one overall name that can be assigned:



In [6]:

    
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')









    Out[6]:





2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

Now that you can see the relationship between the Series and the DataFrame, it's time to move on.

Reading common file formats

It's important to understand how DataFrames and Series work, but you won't usually be creating your own data to work with.
Now it's time to learn how to deal with existing data and various formats.
Let's begin with the very common .csv (comma separated value) file format.
We can use the read_csv function to read the data into a DataFrame.



In [7]:

    
wine_reviews = pd.read_csv('inputs/wine-reviews/winemag-data-130k-v2.csv')
wine_reviews[:1]









    Out[7]:







  
    
      
      Unnamed: 0
      country
      description
      designation
      points
      price
      province
      region_1
      region_2
      taster_name
      taster_twitter_handle
      title
      variety
      winery
    
  
  
    
      0
      0
      Italy
      Aromas include tropical fruit, broom, brimston...
      Vulkà Bianco
      87
      NaN
      Sicily & Sardinia
      Etna
      NaN
      Kerin O’Keefe
      @kerinokeefe
      Nicosia 2013 Vulkà Bianco  (Etna)
      White Blend
      Nicosia

We can use the .shape attribute to see how large the resulting DataFrame is:



In [8]:

    
wine_reviews.shape









    Out[8]:





(129971, 14)

It appears that our wine_reviews DataFrame has ~130,000 records with 14 different columns, giving us just under 2 million entries to work with.
We can look at the first few rows using the .head() command:



In [9]:

    
wine_reviews.head()









    Out[9]:







  
    
      
      Unnamed: 0
      country
      description
      designation
      points
      price
      province
      region_1
      region_2
      taster_name
      taster_twitter_handle
      title
      variety
      winery
    
  
  
    
      0
      0
      Italy
      Aromas include tropical fruit, broom, brimston...
      Vulkà Bianco
      87
      NaN
      Sicily & Sardinia
      Etna
      NaN
      Kerin O’Keefe
      @kerinokeefe
      Nicosia 2013 Vulkà Bianco  (Etna)
      White Blend
      Nicosia
    
    
      1
      1
      Portugal
      This is ripe and fruity, a wine that is smooth...
      Avidagos
      87
      15.0
      Douro
      NaN
      NaN
      Roger Voss
      @vossroger
      Quinta dos Avidagos 2011 Avidagos Red (Douro)
      Portuguese Red
      Quinta dos Avidagos
    
    
      2
      2
      US
      Tart and snappy, the flavors of lime flesh and...
      NaN
      87
      14.0
      Oregon
      Willamette Valley
      Willamette Valley
      Paul Gregutt
      @paulgwine
      Rainstorm 2013 Pinot Gris (Willamette Valley)
      Pinot Gris
      Rainstorm
    
    
      3
      3
      US
      Pineapple rind, lemon pith and orange blossom ...
      Reserve Late Harvest
      87
      13.0
      Michigan
      Lake Michigan Shore
      NaN
      Alexander Peartree
      NaN
      St. Julian 2013 Reserve Late Harvest Riesling ...
      Riesling
      St. Julian
    
    
      4
      4
      US
      Much like the regular bottling from 2012, this...
      Vintner's Reserve Wild Child Block
      87
      65.0
      Oregon
      Willamette Valley
      Willamette Valley
      Paul Gregutt
      @paulgwine
      Sweet Cheeks 2012 Vintner's Reserve Wild Child...
      Pinot Noir
      Sweet Cheeks



In [10]:

    
print(wine_reviews.head())









    



   Unnamed: 0   country                                        description  \
0           0     Italy  Aromas include tropical fruit, broom, brimston...   
1           1  Portugal  This is ripe and fruity, a wine that is smooth...   
2           2        US  Tart and snappy, the flavors of lime flesh and...   
3           3        US  Pineapple rind, lemon pith and orange blossom ...   
4           4        US  Much like the regular bottling from 2012, this...   

                          designation  points  price           province  \
0                        Vulkà Bianco      87    NaN  Sicily & Sardinia   
1                            Avidagos      87   15.0              Douro   
2                                 NaN      87   14.0             Oregon   
3                Reserve Late Harvest      87   13.0           Michigan   
4  Vintner's Reserve Wild Child Block      87   65.0             Oregon   

              region_1           region_2         taster_name  \
0                 Etna                NaN       Kerin O’Keefe   
1                  NaN                NaN          Roger Voss   
2    Willamette Valley  Willamette Valley        Paul Gregutt   
3  Lake Michigan Shore                NaN  Alexander Peartree   
4    Willamette Valley  Willamette Valley        Paul Gregutt   

  taster_twitter_handle                                              title  \
0          @kerinokeefe                  Nicosia 2013 Vulkà Bianco  (Etna)   
1            @vossroger      Quinta dos Avidagos 2011 Avidagos Red (Douro)   
2           @paulgwine       Rainstorm 2013 Pinot Gris (Willamette Valley)   
3                   NaN  St. Julian 2013 Reserve Late Harvest Riesling ...   
4           @paulgwine   Sweet Cheeks 2012 Vintner's Reserve Wild Child...   

          variety               winery  
0     White Blend              Nicosia  
1  Portuguese Red  Quinta dos Avidagos  
2      Pinot Gris            Rainstorm  
3        Riesling           St. Julian  
4      Pinot Noir         Sweet Cheeks

The pandas read_csv() function is quite versatile, with over 30 optional parameters you can specify.
For example, you can see in this dataset that the csv file has a in-built index, which pandas did not pick up on automatically.
To make pandas use that column for the index (instead of creating a new one from scratch), we may specify and use an index_col:



In [11]:

    
wine_reviews = pd.read_csv('inputs/wine-reviews/winemag-data-130k-v2.csv', index_col=0)
wine_reviews.head()









    Out[11]:







  
    
      
      country
      description
      designation
      points
      price
      province
      region_1
      region_2
      taster_name
      taster_twitter_handle
      title
      variety
      winery
    
  
  
    
      0
      Italy
      Aromas include tropical fruit, broom, brimston...
      Vulkà Bianco
      87
      NaN
      Sicily & Sardinia
      Etna
      NaN
      Kerin O’Keefe
      @kerinokeefe
      Nicosia 2013 Vulkà Bianco  (Etna)
      White Blend
      Nicosia
    
    
      1
      Portugal
      This is ripe and fruity, a wine that is smooth...
      Avidagos
      87
      15.0
      Douro
      NaN
      NaN
      Roger Voss
      @vossroger
      Quinta dos Avidagos 2011 Avidagos Red (Douro)
      Portuguese Red
      Quinta dos Avidagos
    
    
      2
      US
      Tart and snappy, the flavors of lime flesh and...
      NaN
      87
      14.0
      Oregon
      Willamette Valley
      Willamette Valley
      Paul Gregutt
      @paulgwine
      Rainstorm 2013 Pinot Gris (Willamette Valley)
      Pinot Gris
      Rainstorm
    
    
      3
      US
      Pineapple rind, lemon pith and orange blossom ...
      Reserve Late Harvest
      87
      13.0
      Michigan
      Lake Michigan Shore
      NaN
      Alexander Peartree
      NaN
      St. Julian 2013 Reserve Late Harvest Riesling ...
      Riesling
      St. Julian
    
    
      4
      US
      Much like the regular bottling from 2012, this...
      Vintner's Reserve Wild Child Block
      87
      65.0
      Oregon
      Willamette Valley
      Willamette Valley
      Paul Gregutt
      @paulgwine
      Sweet Cheeks 2012 Vintner's Reserve Wild Child...
      Pinot Noir
      Sweet Cheeks

Another file format that you will see regularly is the Excel spreadsheet (.xls or .xlst).
An Excel file is organized as a sequence of named sheets, and each sheet is basically a table.
When we load data into pandas using read_excel(), we need to specify which sheet(s) we are interested in.



In [12]:

    
wic = pd.read_excel('inputs/publicassistance/xls_files_all/WICAgencies2013ytd.xls', sheet_name='Total Women')
wic.head()









    Out[12]:







  
    
      
      Note on WIC Agency Level Monthly Spreadsheets
    
  
  
    
      0
      This file contains monthly data for the curren...
    
    
      1
      currently 90 WIC State agencies:  the 50 geogr...
    
    
      2
      Guam, the Virgin Islands, American Samoa, Nort...
    
    
      3
      Sixteen spreadsheets are included in the follo...
    
    
      4
      Pregnant Women

As you can see in this example, Excel files are often not formatted as well as CSV files are.
Spreadsheets allow (and encourage) creating notes and fields which are human-readable, but not machine-readable.
So before we can use this particular dataset, we will need to clean it up a bit.
We will see how to do so in the next section.
For now, let's move on to another common data format: SQL files.

SQL databases are where most of the data on the web ultimately gets stored.
They can be used to store data on things as simple as recipes to things as complicated as "almost everything on the Kaggle website".
Connecting to a SQL database requires a lot more thought than reading from an Excel file.
For one, you need to create a connector, something that will handle siphoning data from the database.
pandas won't do this for you automatically because there are many, many different types of SQL databases out there, each with its own connector.
So for a SQLite database (the only kind supported on Kaggle), you would need to first do the following (using the sqlite3 library that comes with Python):



In [13]:

    
import sqlite3
conn = sqlite3.connect('inputs/FPA_FOD_20170508.sqlite')

Now we are ready to write a SQL statement.
Internally, SQL databases all operate very differently.
Externally, however, they all provide the same API, the "Structured Query Language" (or...SQL...for short). For the purposes of analysis however we can usually just think of a SQL database as a set of tables with names, and SQL as a minor inconvenience in getting that data out of said tables.
So, without further ado, here is all the SQL you have to know to get the data out of SQLite and into pandas:



In [14]:

    
fires = pd.read_sql_query("SELECT * FROM fires", conn)



In [15]:

    
fires.head()









    Out[15]:







  
    
      
      OBJECTID
      FOD_ID
      FPA_ID
      SOURCE_SYSTEM_TYPE
      SOURCE_SYSTEM
      NWCG_REPORTING_AGENCY
      NWCG_REPORTING_UNIT_ID
      NWCG_REPORTING_UNIT_NAME
      SOURCE_REPORTING_UNIT
      SOURCE_REPORTING_UNIT_NAME
      ...
      FIRE_SIZE_CLASS
      LATITUDE
      LONGITUDE
      OWNER_CODE
      OWNER_DESCR
      STATE
      COUNTY
      FIPS_CODE
      FIPS_NAME
      Shape
    
  
  
    
      0
      1
      1
      FS-1418826
      FED
      FS-FIRESTAT
      FS
      USCAPNF
      Plumas National Forest
      0511
      Plumas National Forest
      ...
      A
      40.036944
      -121.005833
      5.0
      USFS
      CA
      63
      063
      Plumas
      b'\x00\x01\xad\x10\x00\x00\xe8d\xc2\x92_@^\xc0...
    
    
      1
      2
      2
      FS-1418827
      FED
      FS-FIRESTAT
      FS
      USCAENF
      Eldorado National Forest
      0503
      Eldorado National Forest
      ...
      A
      38.933056
      -120.404444
      5.0
      USFS
      CA
      61
      061
      Placer
      b'\x00\x01\xad\x10\x00\x00T\xb6\xeej\xe2\x19^\...
    
    
      2
      3
      3
      FS-1418835
      FED
      FS-FIRESTAT
      FS
      USCAENF
      Eldorado National Forest
      0503
      Eldorado National Forest
      ...
      A
      38.984167
      -120.735556
      13.0
      STATE OR PRIVATE
      CA
      17
      017
      El Dorado
      b'\x00\x01\xad\x10\x00\x00\xd0\xa5\xa0W\x13/^\...
    
    
      3
      4
      4
      FS-1418845
      FED
      FS-FIRESTAT
      FS
      USCAENF
      Eldorado National Forest
      0503
      Eldorado National Forest
      ...
      A
      38.559167
      -119.913333
      5.0
      USFS
      CA
      3
      003
      Alpine
      b'\x00\x01\xad\x10\x00\x00\x94\xac\xa3\rt\xfa]...
    
    
      4
      5
      5
      FS-1418847
      FED
      FS-FIRESTAT
      FS
      USCAENF
      Eldorado National Forest
      0503
      Eldorado National Forest
      ...
      A
      38.559167
      -119.933056
      5.0
      USFS
      CA
      3
      003
      Alpine
      b'\x00\x01\xad\x10\x00\x00@\xe3\xaa.\xb7\xfb]\...
    
  

5 rows × 39 columns

Isn't that just magical?

Writing common file formats

Another thing to love about pandas is how it takes care of conversions when you write file.
This example uses to_csv to write data to a file:



In [16]:

    
wine_reviews.head().to_csv('inputs/wine-reviews/wine_reviews.csv')

To write an Excel file you can use to_excel and the sheet_name again:



In [17]:

    
wic.to_excel('inputs/publicassistance/wic.xlsx', sheet_name='Total Women')

And for SQL database entries, supply the name of the table and specify a connecter:



In [21]:

    
conn = sqlite3.connect('inputs/fires.sqlite')
fires.head(10).to_sql('fires', conn)

	Bob	Sue
Product A	I loved it.	That was okay.
Product B	I hated it.	That was not okay.

	Unnamed: 0	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

	Note on WIC Agency Level Monthly Spreadsheets
0	This file contains monthly data for the curren...
1	currently 90 WIC State agencies: the 50 geogr...
2	Guam, the Virgin Islands, American Samoa, Nort...
3	Sixteen spreadsheets are included in the follo...
4	Pregnant Women

	OBJECTID	FOD_ID	FPA_ID	SOURCE_SYSTEM_TYPE	SOURCE_SYSTEM	NWCG_REPORTING_AGENCY	NWCG_REPORTING_UNIT_ID	NWCG_REPORTING_UNIT_NAME	SOURCE_REPORTING_UNIT	SOURCE_REPORTING_UNIT_NAME	...	FIRE_SIZE_CLASS	LATITUDE	LONGITUDE	OWNER_CODE	OWNER_DESCR	STATE	COUNTY	FIPS_CODE	FIPS_NAME	Shape
0	1	1	FS-1418826	FED	FS-FIRESTAT	FS	USCAPNF	Plumas National Forest	0511	Plumas National Forest	...	A	40.036944	-121.005833	5.0	USFS	CA	63	063	Plumas	b'\x00\x01\xad\x10\x00\x00\xe8d\xc2\x92_@^\xc0...
1	2	2	FS-1418827	FED	FS-FIRESTAT	FS	USCAENF	Eldorado National Forest	0503	Eldorado National Forest	...	A	38.933056	-120.404444	5.0	USFS	CA	61	061	Placer	b'\x00\x01\xad\x10\x00\x00T\xb6\xeej\xe2\x19^\...
2	3	3	FS-1418835	FED	FS-FIRESTAT	FS	USCAENF	Eldorado National Forest	0503	Eldorado National Forest	...	A	38.984167	-120.735556	13.0	STATE OR PRIVATE	CA	17	017	El Dorado	b'\x00\x01\xad\x10\x00\x00\xd0\xa5\xa0W\x13/^\...
3	4	4	FS-1418845	FED	FS-FIRESTAT	FS	USCAENF	Eldorado National Forest	0503	Eldorado National Forest	...	A	38.559167	-119.913333	5.0	USFS	CA	3	003	Alpine	b'\x00\x01\xad\x10\x00\x00\x94\xac\xa3\rt\xfa]...
4	5	5	FS-1418847	FED	FS-FIRESTAT	FS	USCAENF	Eldorado National Forest	0503	Eldorado National Forest	...	A	38.559167	-119.933056	5.0	USFS	CA	3	003	Alpine	b'\x00\x01\xad\x10\x00\x00@\xe3\xaa.\xb7\xfb]\...

Creating, reading, and writing reference

Creating data

Reading common file formats

Writing common file formats