In [1]:
%xmode plain


Exception reporting mode: Plain

Pandas: Basics, Reading Data

Pandas is a Python library for convenient data analysis, visualization, and formatting.

Pandas is kind of like R combined with SQL, but in Python.

Imports

To display plots in the Jupyter/IPython notebook, we must execute the following line first.


In [2]:
%matplotlib inline

To use Pandas, we must import the library:


In [3]:
import pandas

Creating a DataFrame

The primary data structure in Pandas is the DataFrame. It contains data similar to a SQL table, a CSV file, or a spreadsheet.

There are a few methods in which we can create a DataFrame.

List of Dictionaries

From list_of_dicts, each dictionary is one row. The dictionary key represents the column name, and the value represents the value for the row and column.


In [4]:
list_of_dicts = [{'a': 1, 'b': 2, 'c': 3}, {'a': 4, 'b': 5, 'c': 6, 'd': 7}]
df = pandas.DataFrame(list_of_dicts)

In [5]:
df


Out[5]:
a b c d
0 1 2 3 NaN
1 4 5 6 7.0

DataFrames consist of rows and columns, just like a SQL table or spreadsheet. The columns are fields, and the rows are records or observations. In the above example, the data frame df has two rows and four columns.

If a particular dictionary in that list doesn't have a value for that key, then it assigns NaN ("not a number") for that index-column pair. As we see above, column d was absent in row 0. Therefore, it assigned that cell NaN.

Another example:


In [6]:
list_of_dicts = [{'a': 1}, {'b': 5}]
df = pandas.DataFrame(list_of_dicts)

In [7]:
df


Out[7]:
a b
0 1.0 NaN
1 NaN 5.0

List of Lists

From list_of_lists, each inner list represents one row. column_names represents the column names.


In [8]:
list_of_lists = [[1, 2, 3], [4, 5, 6]]
column_names = ['a', 'b', 'c']
df = pandas.DataFrame(data=list_of_lists, columns=column_names)

In [9]:
df


Out[9]:
a b c
0 1 2 3
1 4 5 6

If the number of columns exceeds the number of values with an inner list, Pandas will fill empty values with NaN:


In [10]:
list_of_lists = [[1, 2], [4, 5, 6]]
column_names = ['a', 'b', 'c']
df = pandas.DataFrame(data=list_of_lists, columns=column_names)

In [11]:
df


Out[11]:
a b c
0 1 2 NaN
1 4 5 6.0

External Data: CSV

To load data from a CSV file directly to a DataFrame, where the first row of the CSV is a header row containing column names:

df = pandas.read_csv('my_data.csv')

If there is no header row, you must name each column manually:

df = pandas.read_csv('my_data.csv', header=None, names=['col1', 'col2', ...., 'col12'])

In the following example, the CSV file uses a semicolon as a field delimiter:


In [12]:
df = pandas.read_csv('data/red_wine.csv', delimiter=';')

External Data: SQL

We will use SQLAlchemy, a Python library that makes it convenient to read SQL databases (SQLite, MySQL, PostgreSQL) in Python.

To install SQLAlchemy, at the prompt (Mac users: the Terminal prompt; Windows users: the Anaconda prompt):

pip install sqlalchemy

If Mac users get a permissions error, then try:

sudo pip install sqlalchemy

Import:


In [13]:
import sqlalchemy

External Data: SQLite

Let's begin with a simple example using SQLite.

engine is an object that makes connections to the database. We must specify the location of the SQLite database. (A SQLite database is a single file.)

The engine string takes the following format:

engine = create_engine('sqlite:///path_to_database')

In [14]:
engine = sqlalchemy.create_engine('sqlite:///data/red_wine.db')

read_sql allows us to use a SQL query string to query to database.


In [15]:
df = pandas.read_sql('select * from Wine limit 3', engine)

In [16]:
df


Out[16]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5

In [17]:
df = pandas.read_sql('select chlorides, density, alcohol from Wine limit 3', engine)

In [18]:
df


Out[18]:
chlorides density alcohol
0 0.076 0.9978 9.4
1 0.098 0.9968 9.8
2 0.092 0.9970 9.8

External Data: PostgreSQL

The engine string takes the following format:

engine = create_engine('postgresql://username:password@host:port/database')

In [19]:
engine = sqlalchemy.create_engine('postgresql://steve@localhost:5432/winequalityred')
df = pandas.read_sql('select * from "Wine" limit 3', engine)

In [20]:
df


Out[20]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5

In [21]:
engine = sqlalchemy.create_engine('postgresql://steve@localhost:5432/winequalitywhite')
df = pandas.read_sql('select * from "Wine" limit 3', engine)

In [22]:
df


Out[22]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6

DataFrame Properties

Load a data frame:


In [23]:
df = pandas.read_csv('data/red_wine.csv', delimiter=';')

The shape attribute returns a tuple containing the number of rows and columns:


In [24]:
df.shape


Out[24]:
(1599, 12)

The columns attribute returns a list of column names:


In [25]:
df.columns


Out[25]:
Index([u'fixed acidity', u'volatile acidity', u'citric acid',
       u'residual sugar', u'chlorides', u'free sulfur dioxide',
       u'total sulfur dioxide', u'density', u'pH', u'sulphates', u'alcohol',
       u'quality'],
      dtype='object')

DataFrame Methods

The describe method provides a statistical summary of the data frame:


In [26]:
df.describe()


Out[26]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000

The head method shows the first n rows (by default, n is 5). This method is good to get an immediate feel for the contents of the data frame:


In [27]:
df.head(2)


Out[27]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.0 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5

In [28]:
df.head()


Out[28]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

The tail method shows the last n rows (by default, n is 5).


In [29]:
df.tail()


Out[29]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6