This is the reference component to the "Creating, reading, and writing" section of the tutorial.
For the workbook section, click here.
The very first step in any data analytics project will probably be reading the data out of a file somewhere, so that's where we'll begin.
In this section, we'll look at exercises on creating pandas
Series
and DataFrame
objects, both by hand and by reading data from storage.
The IO Tools section of the official pandas
docs provides a comprehensive overview on this subject.
In [1]:
import pandas as pd
There are two core objects in pandas
:
the DataFrame
and the Series
.
A DataFrame is a table.
It contains an array of individual entries, each of which has a certain value.
Each entry corresponds with a row (or record) and a column.
For example, consider the following simple DataFrame:
In [2]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
Out[2]:
DataFrame
entries aren't limited to integers.
Here's a DataFrame
with strings as values:
In [3]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
Out[3]:
We are using the pd.DataFrame
constructor to generate these DataFrame
objects.
The syntax for declaring a new one is a dictionary whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries.
This is the standard way of constructing a new DataFrame
, and the one you will usually encounter.
The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels.
Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.
The list of row labels used in a DataFrame
is known as an Index
.
We can assign values to it by using an index parameter in our constructor:
In [4]:
pd.DataFrame({'Bob': ['I loved it.', 'I hated it.'],
'Sue': ['That was okay.', 'That was not okay.']},
index=['Product A', 'Product B'])
Out[4]:
A Series
is a sequence of values.
If a DataFrame
is a table (or a matrix), then a Series
is a list (or a vector).
In fact, you can create a Series
with nothing more than a list:
In [5]:
pd.Series([1, 2, 3, 4, 5])
Out[5]:
Think of a Series
as a single column in a DataFrame
.
You can assign column values to the Series
(same as above) using the index
parameter.
In a Series
, the columns aren't named; the Series
itself has one overall name
that can be assigned:
In [6]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
Out[6]:
Now that you can see the relationship between the Series
and the DataFrame
, it's time to move on.
It's important to understand how DataFrames and Series work, but you won't usually be creating your own data to work with.
Now it's time to learn how to deal with existing data and various formats.
Let's begin with the very common .csv
(comma separated value) file format.
We can use the read_csv
function to read the data into a DataFrame.
In [7]:
wine_reviews = pd.read_csv('inputs/wine-reviews/winemag-data-130k-v2.csv')
wine_reviews[:1]
Out[7]:
We can use the .shape
attribute to see how large the resulting DataFrame is:
In [8]:
wine_reviews.shape
Out[8]:
It appears that our wine_reviews
DataFrame
has ~130,000 records with 14 different columns, giving us just under 2 million entries to work with.
We can look at the first few rows using the .head()
command:
In [9]:
wine_reviews.head()
Out[9]:
In [10]:
print(wine_reviews.head())
The pandas read_csv()
function is quite versatile, with over 30 optional parameters you can specify.
For example, you can see in this dataset that the csv file has a in-built index, which pandas did not pick up on automatically.
To make pandas use that column for the index (instead of creating a new one from scratch), we may specify and use an index_col
:
In [11]:
wine_reviews = pd.read_csv('inputs/wine-reviews/winemag-data-130k-v2.csv', index_col=0)
wine_reviews.head()
Out[11]:
Another file format that you will see regularly is the Excel spreadsheet (.xls
or .xlst
).
An Excel file is organized as a sequence of named sheets, and each sheet is basically a table.
When we load data into pandas using read_excel()
, we need to specify which sheet(s) we are interested in.
In [12]:
wic = pd.read_excel('inputs/publicassistance/xls_files_all/WICAgencies2013ytd.xls', sheet_name='Total Women')
wic.head()
Out[12]:
As you can see in this example, Excel files are often not formatted as well as CSV files are.
Spreadsheets allow (and encourage) creating notes and fields which are human-readable, but not machine-readable.
So before we can use this particular dataset, we will need to clean it up a bit.
We will see how to do so in the next section.
For now, let's move on to another common data format: SQL files.
SQL databases are where most of the data on the web ultimately gets stored.
They can be used to store data on things as simple as recipes to things as complicated as "almost everything on the Kaggle website".
Connecting to a SQL database requires a lot more thought than reading from an Excel file.
For one, you need to create a connector, something that will handle siphoning data from the database.
pandas
won't do this for you automatically because there are many, many different types of SQL databases out there, each with its own connector.
So for a SQLite database (the only kind supported on Kaggle), you would need to first do the following (using the sqlite3
library that comes with Python):
In [13]:
import sqlite3
conn = sqlite3.connect('inputs/FPA_FOD_20170508.sqlite')
Now we are ready to write a SQL statement.
Internally, SQL databases all operate very differently.
Externally, however, they all provide the same API, the "Structured Query Language" (or...SQL...for short).
For the purposes of analysis however we can usually just think of a SQL database as a set of tables with names, and SQL as a minor inconvenience in getting that data out of said tables.
So, without further ado, here is all the SQL you have to know to get the data out of SQLite and into pandas:
In [14]:
fires = pd.read_sql_query("SELECT * FROM fires", conn)
In [15]:
fires.head()
Out[15]:
Isn't that just magical?
Another thing to love about pandas is how it takes care of conversions when you write file.
This example uses to_csv
to write data to a file:
In [16]:
wine_reviews.head().to_csv('inputs/wine-reviews/wine_reviews.csv')
To write an Excel file you can use to_excel
and the sheet_name
again:
In [17]:
wic.to_excel('inputs/publicassistance/wic.xlsx', sheet_name='Total Women')
And for SQL database entries, supply the name of the table and specify a connecter:
In [21]:
conn = sqlite3.connect('inputs/fires.sqlite')
fires.head(10).to_sql('fires', conn)