Pandas
pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.
pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.
The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:
Additional Recommended Resources:
Let's get started with our first pandas notebook!
Import Libraries
In [1]:
import pandas as pd
Introduction to pandas Data Structures
pandas Series
pandas Series one-dimensional labeled array.
In [2]:
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
In [3]:
ser
Out[3]:
In [4]:
ser.index
Out[4]:
In [5]:
ser.loc[['nancy','bob']]
Out[5]:
In [6]:
ser[[4, 3, 1]]
Out[6]:
In [7]:
ser.iloc[2]
Out[7]:
In [8]:
'bob' in ser
Out[8]:
In [9]:
ser
Out[9]:
In [10]:
ser * 2
Out[10]:
In [11]:
ser[['nancy', 'eric']] ** 2
Out[11]:
pandas DataFrame
pandas DataFrame is a 2-dimensional labeled data structure.
Create DataFrame from dictionary of Python Series
In [12]:
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
In [13]:
df = pd.DataFrame(d)
print(df)
In [14]:
df.index
Out[14]:
In [15]:
df.columns
Out[15]:
In [16]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])
Out[16]:
In [17]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
Out[17]:
Create DataFrame from list of Python dictionaries
In [18]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
In [19]:
pd.DataFrame(data)
Out[19]:
In [20]:
pd.DataFrame(data, index=['orange', 'red'])
Out[20]:
In [21]:
pd.DataFrame(data, columns=['joe', 'dora','alice'])
Out[21]:
Basic DataFrame operations
In [22]:
df
Out[22]:
In [23]:
df['one']
Out[23]:
In [24]:
df['three'] = df['one'] * df['two']
df
Out[24]:
In [25]:
df['flag'] = df['one'] > 250
df
Out[25]:
In [26]:
three = df.pop('three')
In [27]:
three
Out[27]:
In [28]:
df
Out[28]:
In [29]:
del df['two']
In [30]:
df
Out[30]:
In [31]:
df.insert(2, 'copy_of_one', df['one'])
df
Out[31]:
In [32]:
df['one_upper_half'] = df['one'][:2]
df
Out[32]:
Case Study: Movie Data Analysis
Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.
Here are the links to the data source and location:
Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.
Let us look at the files in this dataset using the UNIX command ls.
In [33]:
# Note: Adjust the name of the folder to match your local directory
!ls ./movielens
In [34]:
!cat ./movielens/movies.csv | wc -l
In [35]:
!head -5 ./movielens/ratings.csv
Use Pandas to Read the Dataset
Using the read_csv function in pandas, we will ingest these three files.
In [36]:
movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)
In [ ]:
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970
tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()
In [ ]:
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()
In [ ]:
# For current analysis, we will remove timestamp (we will come back to it!)
del ratings['timestamp']
del tags['timestamp']
In [ ]:
#Extract 0th row: notice that it is infact a Series
row_0 = tags.iloc[0]
type(row_0)
In [ ]:
print(row_0)
In [ ]:
row_0.index
In [ ]:
row_0['userId']
In [ ]:
'rating' in row_0
In [ ]:
row_0.name
In [ ]:
row_0 = row_0.rename('first_row')
row_0.name
In [ ]:
tags.head()
In [ ]:
tags.index
In [ ]:
tags.columns
In [ ]:
# Extract row 0, 11, 2000 from DataFrame
tags.iloc[ [0,11,2000] ]
Let's look how the ratings are distributed!
In [ ]:
ratings['rating'].describe()
In [ ]:
ratings.describe()
In [ ]:
ratings['rating'].mean()
In [ ]:
ratings.mean()
In [ ]:
ratings['rating'].min()
In [ ]:
ratings['rating'].max()
In [ ]:
ratings['rating'].std()
In [ ]:
ratings['rating'].mode()
In [ ]:
ratings.corr()
In [ ]:
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()
In [ ]:
filter_2 = ratings['rating'] > 0
filter_2.all()
In [ ]:
movies.shape
In [ ]:
#is any row NULL ?
movies.isnull().any()
Thats nice ! No NULL values !
In [ ]:
ratings.shape
In [ ]:
#is any row NULL ?
ratings.isnull().any()
Thats nice ! No NULL values !
In [ ]:
tags.shape
In [ ]:
#is any row NULL ?
tags.isnull().any()
We have some tags which are NULL.
In [ ]:
tags = tags.dropna()
In [ ]:
#Check again: is any row NULL ?
tags.isnull().any()
In [ ]:
tags.shape
Thats nice ! No NULL values ! Notice the number of lines have reduced.
In [ ]:
%matplotlib inline
ratings.hist(column='rating', figsize=(15,10))
In [ ]:
ratings.boxplot(column='rating', figsize=(15,20))
In [ ]:
tags['tag'].head()
In [ ]:
movies[['title','genres']].head()
In [ ]:
ratings[-10:]
In [ ]:
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]
In [ ]:
tag_counts[:10].plot(kind='bar', figsize=(15,10))
In [ ]:
is_highly_rated = ratings['rating'] >= 4.0
ratings[is_highly_rated][30:50]
In [ ]:
is_animation = movies['genres'].str.contains('Animation')
movies[is_animation][5:15]
In [ ]:
movies[is_animation].head(15)
In [ ]:
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count
In [ ]:
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()
In [ ]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()
In [ ]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()
In [ ]:
tags.head()
In [ ]:
movies.head()
In [ ]:
t = movies.merge(tags, on='movieId', how='inner')
t.head()
More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html
Combine aggreagation, merging, and filters to get useful analytics
In [ ]:
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()
In [ ]:
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()
In [ ]:
is_highly_rated = box_office['rating'] >= 4.0
box_office[is_highly_rated][-5:]
In [ ]:
is_comedy = box_office['genres'].str.contains('Comedy')
box_office[is_comedy][:5]
In [ ]:
box_office[is_comedy & is_highly_rated][-5:]
In [ ]:
movies.head()
Split 'genres' into multiple columns
In [ ]:
movie_genres = movies['genres'].str.split('|', expand=True)
In [ ]:
movie_genres[:10]
Add a new column for comedy genre flag
In [ ]:
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
In [ ]:
movie_genres[:10]
Extract year from title e.g. (1995)
In [ ]:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
In [ ]:
movies.tail()
More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!
In [ ]:
tags = pd.read_csv('./movielens/tags.csv', sep=',')
In [ ]:
tags.dtypes
Unix time / POSIX time / epoch time records
time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970
In [ ]:
tags.head(5)
In [ ]:
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')
Data Type datetime64[ns] maps to either
In [ ]:
tags['parsed_time'].dtype
In [ ]:
tags.head(2)
Selecting rows based on timestamps
In [ ]:
greater_than_t = tags['parsed_time'] > '2015-02-01'
selected_rows = tags[greater_than_t]
tags.shape, selected_rows.shape
Sorting the table using the timestamps
In [ ]:
tags.sort_values(by='parsed_time', ascending=True)[:10]
In [ ]:
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()
In [ ]:
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()
In [ ]:
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]
In [ ]:
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)
Do some years look better for the boxoffice movies than others?
Does any data point seem like an outlier in some sense?
In [ ]: