In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
We'll be using the MovieLens dataset in many examples going forward. The dataset contains 100,000 ratings made by 943 users on 1,682 movies.
In [2]:
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
df_users = pd.read_csv('data/MovieLens-100k/u.user', sep='|', names=u_cols)
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
df_ratings = pd.read_csv('data/MovieLens-100k/u.data', sep='\t', names=r_cols)
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
df_movies = pd.read_csv('data/MovieLens-100k/u.item', sep='|', names=m_cols, usecols=range(5))# only load the first five columns
Pandas has a variety of functions for getting basic information about your DataFrame.
The most basic of which is calling your DataFrame by name. The output tells a few things about our DataFrame.
In [3]:
print df_movies.dtypes,'\n'
print df_users.dtypes,'\n'
print df_ratings.dtypes,'\n'
In [4]:
df_users.describe()
Out[4]:
Notice user_id
was included since it's numeric. Since this is an ID value, the stats for it don't really matter.
We can quickly see the average age of our users is just above 34 years old, with the youngest being 7 and the oldest being 73. The median age is 31, with the youngest quartile of users being 25 or younger, and the oldest quartile being at least 43.
In [5]:
print df_users.head()
In [6]:
print df_users.tail(3)
In [7]:
print df_users[20:22]
In [8]:
df_users['occupation'].head()
Out[8]:
Multiple columns selection
To select multiple columns, simply pass a list of column names to the DataFrame, the output of which will be a DataFrame.
In [9]:
list_of_cols = ['occupation', 'sex']
print df_users[list_of_cols].head()
Row selection can be done multiple ways, but using boolean indexing or individual index .ix()
are typically easiest.
In [10]:
# users older than 25
print df_users[df_users.age > 25].head(3), '\n'
# users aged 40 AND male
print df_users[(df_users.age == 40) & (df_users.sex == 'M')].head(3), '\n'
# users younger than 30 OR female
print df_users[(df_users.sex == 'F') | (df_users.age < 30)].head(3)
.ix()
methodWhen you change the indexing of a DataFrame to a specific column, you use the default pandas 0-based index.
Use .ix()
method for row selection based on the new index.
Let's set the index to the user_id
using the .set_index()
method.
NB: By default, .set_index()
returns a new DataFrame, so you'll have to specify if you'd like the changes to occur in place.
In [11]:
# Change index column (new DataFrame)
new_df_users = df_users.set_index('user_id')
print new_df_users.head(3)
# Change index column (inplace)
df_users.set_index('user_id', inplace=True)
print df_users.head(3)
In [12]:
# Select users using their respective user_id
print df_users.ix[99], '\n'
print df_users.ix[[1, 50, 300]]
Use the .reset_index()
method to reset the default index (the same rule apply for inplace).
In [13]:
df_users.reset_index(inplace=True)
print df_users.head()
.sort()
for DataFramesUse .sort()
method to sort DataFrames. Returns a new instance of a Dataframe. (See DOC)
column
: column name to base the sorting on (list for nested sorting / tuple for multi-index sorting)ascending (True)
: sort ascending vs. descending (specify list for multiple sort orders)inplace (False)
: result is a new instance of DataFrame
In [14]:
# Oldest techicians
df_users.sort('age', ascending=False, inplace=True)
print df_users[df_users.occupation == "technician"][:5]
In [15]:
print df_users.zip_code.order()[:3]
A large number of methods for computing descriptive statistics and other related operations on Series, DataFrame, and Panel. For DataFrames these methods take an axis argument:
- axis=0 : compute over indexes
- axis=1 : compute over columns
Most methods produce a lower-dimensional result (aka aggregate functions) :
.count()
: number of NOT NULL values.nunique()
: number of unique NOT NULL values.size()
: number of values.min()
: minimum.max()
: maximum.sum()
: sum of values.prod()
: product of values.median()
: arithmetic median of values.quantile()
: sample quantile (value at %).mean()
: mean of values.std()
: unbiased standard deviation.var()
: unbiased variance.mad()
: mean absolute deviation.sem()
: unbiased standard error of the mean.skew()
: unbiased skewness (3rd moment).kurt()
: unbiased kurtosis (4th moment)Some methods produce an object of the same size :
.rank()
: compute data rank (1 through n).mode()
: mode.abs()
: absolute value.cumsum()
: cumulative sum.cumprod()
: cumulative product.cummax()
: cumulative maximum.cummin()
: cumulative minimumTo apply your own or another library’s functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or Series, row- or column-wise, or elementwise.
.pipe()
.apply()
.applymap()
or .map()
.pipe()
Use .pipe()
for method chaining over a DataFrame. (See DOC)
The following two are equivalent :
- f(g(h(df), arg1=1), arg2=2, arg3=3)
- df.pipe(h).pipe(g, arg1=1).pipe(f, arg2=2, arg3=3)
The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which have introduced the popular (%>%) (read pipe) operator for R.
.apply()
Use .apply()
to apply a function along the axes of a DataFrame, like the descriptive statistics methods. (See DOC)
- df.apply(np.mean, axis=1)
- df.apply(lambda x: x.max() - x.min())
.applymap()
/ .map()
Use .applymap()
on DataFrame or .map()
on Series to operate elementwise.
The vectorized function must take a single value and return a single value.(See DOC)
- df.applymap(lambda x: len(str(x)))
- df['colA'].map(lambda x: len(str(x)))
Use pandas.cut()
static method to bin numeric values into groups. Useful for discretization. (DOC)
pandas.cut(x, bins)
returns an array of the indices (or labels) of the half-open bins to which each value of x
belongs.
x
: array of values to be binnedbins
: sequence defining the bin edgesright
(True): boolean indicating whether the bins include the rightmost edge or not ([a,b] or [a,b[)labels
(None): array used as labels for the resulting bins
In [16]:
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
bins = range(0, 81, 10) # [0, 10, 20, 30, 40, 50, 60, 70, 80]
df_users['age_group'] = pd.cut(df_users.age, bins, right=False, labels=labels)
print df_users[27:31] # preview of age bin
Use .value_counts()
Series method to return the counts of unique values (ie frequency). (See DOC)
In [17]:
df_users['occupation'].value_counts().head()
Out[17]:
Use .groupby()
method to execute the split-apply-combine strategy for data analysis :
DataFrameGroupBy/SeriesGroupBy Methods (See Doc)
.apply()
: apply your own or another library's function or list of functions .agg()
: aggregate using input function or dict of {column: function}.transform()
: transform .filter()
: return a copy of a DataFrame excluding elements from groups
In the apply step, we might wish to do one of the following:
gby.agg([np.sum, np.mean])
gby.agg([np.size, np.mean])
gby.transform(lambda x: (x - x.mean()) / x.std())
gby.fillna(x.mean())
gby.filter(lambda x: x.size() > 100)
gby.filter(lambda x: x['A'].sum() + x['B'].sum() > 0)
gby.dropna(axis=0)
In [18]:
!head -n 3 data/city-of-chicago-salaries.csv
Since the data contains a '$' sign for each salary, python will treat the field as a series of strings. We can use the converters
parameter to change this when reading in the file.
converters
= Dict of functions for converting values in certain columns. Keys can either be integers or column labels
In [19]:
headers = ['name', 'title', 'department', 'salary']
df_chicago = pd.read_csv('data/city-of-chicago-salaries.csv',
header=False,
names=headers,
converters={'salary': lambda x: float(x.replace('$', ''))})
print df_chicago.head()
In [20]:
print df_chicago.groupby('department').count().head(3), '\n' # NOT NULL records within each column
print df_chicago.groupby('department').size().head(3) # total records for each department
In [21]:
print df_chicago.groupby('department').agg({'salary': [np.size, np.mean]}).head()
In [22]:
print df_chicago.groupby('department').title.nunique().order(ascending=False)[:3]
In [23]:
print df_chicago.groupby('department').mean().sort('salary', ascending=False).head()
In [24]:
print df_chicago.groupby('department').agg({'salary': [np.size, np.mean]}).sort(('salary', 'mean'), ascending=False).head()
.rank()
method.order()
is for Series, .sort()
is for DataFrames)For the .rank()
method, use attributes:
ascending=False
: to rank high (1) to low (N)method='first'
: so that equally high paid people within a department don't get the same rank .
In [67]:
df_chicago['dept_rank'] = df_chicago.groupby('department')['salary'].rank(method='first', ascending=False)
df_chicago.sort('salary', ascending=False, inplace=True)
print df_chicago[df_chicago['dept_rank'] == 1].head()
In [49]:
print df_chicago[df_chicago['department'] == 'MAYOR\'S OFFICE'].tail(10)
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
Use .dropna()
method to drop rows or columns with missing data (NaN).
ex: df.dropna()
- axis {(0), 1} : drop rows/columns
- subset (None) : list of columns/rows to consider
- inplace (False) : result is a new instance of DataFrame
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
Use .fillna()
method to drop rows or columns with missing data (NaN).
ex: df['col_A'].fillna(df['col_A'].mean())
- value : value to use to fill holes
- method {'backfill', 'bfill', 'pad', 'ffill', (None)} : method of propagation to use for filling holes
- inplace (False) : result is a new instance of DataFrame