In [1]:
%xmode plain
Pandas is a Python library for convenient data analysis, visualization, and formatting.
Pandas is kind of like R combined with SQL, but in Python.
To display plots in the Jupyter/IPython notebook, we must execute the following line first.
In [2]:
%matplotlib inline
To use Pandas, we must import the library:
In [3]:
import pandas
The primary data structure in Pandas is the DataFrame
. It contains data similar to a SQL table, a CSV file, or a spreadsheet.
There are a few methods in which we can create a DataFrame
.
From list_of_dicts
, each dictionary is one row. The dictionary key represents the column name, and the value represents the value for the row and column.
In [4]:
list_of_dicts = [{'a': 1, 'b': 2, 'c': 3}, {'a': 4, 'b': 5, 'c': 6, 'd': 7}]
df = pandas.DataFrame(list_of_dicts)
In [5]:
df
Out[5]:
DataFrames consist of rows and columns, just like a SQL table or spreadsheet. The columns are fields, and the rows are records or observations. In the above example, the data frame df
has two rows and four columns.
If a particular dictionary in that list doesn't have a value for that key, then it assigns NaN
("not a number") for that index-column pair. As we see above, column d
was absent in row 0. Therefore, it assigned that cell NaN
.
Another example:
In [6]:
list_of_dicts = [{'a': 1}, {'b': 5}]
df = pandas.DataFrame(list_of_dicts)
In [7]:
df
Out[7]:
From list_of_lists
, each inner list represents one row. column_names
represents the column names.
In [8]:
list_of_lists = [[1, 2, 3], [4, 5, 6]]
column_names = ['a', 'b', 'c']
df = pandas.DataFrame(data=list_of_lists, columns=column_names)
In [9]:
df
Out[9]:
If the number of columns exceeds the number of values with an inner list, Pandas will fill empty values with NaN
:
In [10]:
list_of_lists = [[1, 2], [4, 5, 6]]
column_names = ['a', 'b', 'c']
df = pandas.DataFrame(data=list_of_lists, columns=column_names)
In [11]:
df
Out[11]:
To load data from a CSV file directly to a DataFrame
, where the first row of the CSV is a header row containing column names:
df = pandas.read_csv('my_data.csv')
If there is no header row, you must name each column manually:
df = pandas.read_csv('my_data.csv', header=None, names=['col1', 'col2', ...., 'col12'])
In the following example, the CSV file uses a semicolon as a field delimiter:
In [12]:
df = pandas.read_csv('data/red_wine.csv', delimiter=';')
We will use SQLAlchemy, a Python library that makes it convenient to read SQL databases (SQLite, MySQL, PostgreSQL) in Python.
To install SQLAlchemy, at the prompt (Mac users: the Terminal prompt; Windows users: the Anaconda prompt):
pip install sqlalchemy
If Mac users get a permissions error, then try:
sudo pip install sqlalchemy
Import:
In [13]:
import sqlalchemy
Let's begin with a simple example using SQLite.
engine
is an object that makes connections to the database. We must specify the location of the SQLite database. (A SQLite database is a single file.)
The engine string takes the following format:
engine = create_engine('sqlite:///path_to_database')
In [14]:
engine = sqlalchemy.create_engine('sqlite:///data/red_wine.db')
read_sql
allows us to use a SQL query string to query to database.
In [15]:
df = pandas.read_sql('select * from Wine limit 3', engine)
In [16]:
df
Out[16]:
In [17]:
df = pandas.read_sql('select chlorides, density, alcohol from Wine limit 3', engine)
In [18]:
df
Out[18]:
The engine string takes the following format:
engine = create_engine('postgresql://username:password@host:port/database')
In [19]:
engine = sqlalchemy.create_engine('postgresql://steve@localhost:5432/winequalityred')
df = pandas.read_sql('select * from "Wine" limit 3', engine)
In [20]:
df
Out[20]:
In [21]:
engine = sqlalchemy.create_engine('postgresql://steve@localhost:5432/winequalitywhite')
df = pandas.read_sql('select * from "Wine" limit 3', engine)
In [22]:
df
Out[22]:
Load a data frame:
In [23]:
df = pandas.read_csv('data/red_wine.csv', delimiter=';')
The shape
attribute returns a tuple containing the number of rows and columns:
In [24]:
df.shape
Out[24]:
The columns
attribute returns a list of column names:
In [25]:
df.columns
Out[25]:
The describe
method provides a statistical summary of the data frame:
In [26]:
df.describe()
Out[26]:
The head
method shows the first n
rows (by default, n
is 5). This method is good to get an immediate feel for the contents of the data frame:
In [27]:
df.head(2)
Out[27]:
In [28]:
df.head()
Out[28]:
The tail
method shows the last n
rows (by default, n
is 5).
In [29]:
df.tail()
Out[29]: