Pandas

pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

Generating descriptive statistics on data
Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging multiple datasets using dataframes
Working with timestamps and time-series data

Additional Recommended Resources:

pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/
Python for Data Analysis by Wes McKinney
Python Data Science Handbook by Jake VanderPlas

Let's get started with our first pandas notebook!

Import Libraries



In [1]:

    
import pandas as pd

Introduction to pandas Data Structures

*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

pandas Series one-dimensional labeled array.



In [2]:

    
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])



In [3]:

    
ser









    Out[3]:





tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object



In [4]:

    
ser.index









    Out[4]:





Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')



In [5]:

    
ser.loc[['nancy','bob']]









    Out[5]:





nancy    300
bob      foo
dtype: object



In [6]:

    
ser[[4, 3, 1]]









    Out[6]:





eric    500
dan     bar
bob     foo
dtype: object



In [7]:

    
ser.iloc[2]









    Out[7]:





300



In [8]:

    
'bob' in ser









    Out[8]:





True



In [9]:

    
ser









    Out[9]:





tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object



In [10]:

    
ser * 2









    Out[10]:





tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object



In [11]:

    
ser[['nancy', 'eric']] ** 2









    Out[11]:





nancy     90000
eric     250000
dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series



In [12]:

    
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}



In [13]:

    
df = pd.DataFrame(d)
print(df)









    



          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0



In [14]:

    
df.index









    Out[14]:





Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')



In [15]:

    
df.columns









    Out[15]:





Index(['one', 'two'], dtype='object')



In [16]:

    
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])



In [17]:

    
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])









    Out[17]:







  
    
      
      two
      five
    
  
  
    
      dancy
      4444.0
      NaN
    
    
      ball
      222.0
      NaN
    
    
      apple
      111.0
      NaN

Create DataFrame from list of Python dictionaries



In [ ]:

    
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]



In [ ]:

    
pd.DataFrame(data)



In [ ]:

    
pd.DataFrame(data, index=['orange', 'red'])



In [ ]:

    
pd.DataFrame(data, columns=['joe', 'dora','alice'])

Basic DataFrame operations



In [ ]:

    
df



In [ ]:

    
df['one']



In [ ]:

    
df['three'] = df['one'] * df['two']
df



In [ ]:

    
df['flag'] = df['one'] > 250
df



In [ ]:

    
three = df.pop('three')



In [ ]:

    
three



In [ ]:

    
df



In [ ]:

    
del df['two']



In [ ]:

    
df



In [ ]:

    
df.insert(2, 'copy_of_one', df['one'])
df



In [ ]:

    
df['one_upper_half'] = df['one'][:2]
df

Case Study: Movie Data Analysis

This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using pandas.

Download the Dataset

Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:

Data Source: MovieLens web site (filename: ml-20m.zip)
Location: https://grouplens.org/datasets/movielens/

Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.

Let us look at the files in this dataset using the UNIX command ls.



In [21]:

    
# Note: Adjust the name of the folder to match your local directory

!ls ./movielens









    



Icon?       README.txt  links.csv   movies.csv  ratings.csv tags.csv



In [22]:

    
!cat ./movielens/movies.csv | wc -l



In [27]:

    
!head -5 ./movielens/ratings.csv

Use Pandas to Read the Dataset

In this notebook, we will be using three CSV files:

ratings.csv : userId,movieId,rating, timestamp
tags.csv : userId,movieId, tag, timestamp
movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.



In [24]:

    
movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)









    



<class 'pandas.core.frame.DataFrame'>






    Out[24]:







  
    
      
      movieId
      title
      genres
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children|Fantasy
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy
    
    
      5
      6
      Heat (1995)
      Action|Crime|Thriller
    
    
      6
      7
      Sabrina (1995)
      Comedy|Romance
    
    
      7
      8
      Tom and Huck (1995)
      Adventure|Children
    
    
      8
      9
      Sudden Death (1995)
      Action
    
    
      9
      10
      GoldenEye (1995)
      Action|Adventure|Thriller
    
    
      10
      11
      American President, The (1995)
      Comedy|Drama|Romance
    
    
      11
      12
      Dracula: Dead and Loving It (1995)
      Comedy|Horror
    
    
      12
      13
      Balto (1995)
      Adventure|Animation|Children
    
    
      13
      14
      Nixon (1995)
      Drama
    
    
      14
      15
      Cutthroat Island (1995)
      Action|Adventure|Romance



In [25]:

    
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()









    Out[25]:







  
    
      
      userId
      movieId
      tag
      timestamp
    
  
  
    
      0
      15
      339
      sandra 'boring' bullock
      1138537770
    
    
      1
      15
      1955
      dentist
      1193435061
    
    
      2
      15
      7478
      Cambodia
      1170560997
    
    
      3
      15
      32892
      Russian
      1170626366
    
    
      4
      15
      34162
      forgettable
      1141391765



In [28]:

    
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()



In [29]:

    
# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series



In [30]:

    
#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)









    Out[30]:





pandas.core.series.Series



In [31]:

    
print(row_0)









    



userId                          15
movieId                        339
tag        sandra 'boring' bullock
Name: 0, dtype: object



In [32]:

    
row_0.index









    Out[32]:





Index(['userId', 'movieId', 'tag'], dtype='object')



In [33]:

    
row_0['userId']









    Out[33]:





15



In [34]:

    
'rating' in row_0









    Out[34]:





False



In [35]:

    
row_0.name









    Out[35]:





0



In [36]:

    
row_0 = row_0.rename('first_row')
row_0.name









    Out[36]:





'first_row'

DataFrames



In [37]:

    
tags.head()









    Out[37]:







  
    
      
      userId
      movieId
      tag
    
  
  
    
      0
      15
      339
      sandra 'boring' bullock
    
    
      1
      15
      1955
      dentist
    
    
      2
      15
      7478
      Cambodia
    
    
      3
      15
      32892
      Russian
    
    
      4
      15
      34162
      forgettable



In [38]:

    
tags.index









    Out[38]:





RangeIndex(start=0, stop=1296, step=1)



In [39]:

    
tags.columns









    Out[39]:





Index(['userId', 'movieId', 'tag'], dtype='object')



In [41]:

    
# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,1000] ]









    Out[41]:







  
    
      
      userId
      movieId
      tag
    
  
  
    
      0
      15
      339
      sandra 'boring' bullock
    
    
      11
      23
      150
      Ron Howard
    
    
      1000
      547
      44199
      toplist06

Descriptive Statistics

Let's look how the ratings are distributed!



In [ ]:

    
ratings['rating'].describe()



In [ ]:

    
ratings.describe()



In [ ]:

    
ratings['rating'].mean()



In [ ]:

    
ratings.mean()



In [ ]:

    
ratings['rating'].min()



In [ ]:

    
ratings['rating'].max()



In [ ]:

    
ratings['rating'].std()



In [ ]:

    
ratings['rating'].mode()



In [ ]:

    
ratings.corr()



In [ ]:

    
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()



In [ ]:

    
filter_2 = ratings['rating'] > 0
filter_2.all()

Data Cleaning: Handling Missing Data



In [ ]:

    
movies.shape



In [ ]:

    
#is any row NULL ?

movies.isnull().any()

Thats nice ! No NULL values !



In [ ]:

    
ratings.shape



In [ ]:

    
#is any row NULL ?

ratings.isnull().any()

Thats nice ! No NULL values !



In [ ]:

    
tags.shape



In [ ]:

    
#is any row NULL ?

tags.isnull().any()

We have some tags which are NULL.



In [ ]:

    
tags = tags.dropna()



In [ ]:

    
#Check again: is any row NULL ?

tags.isnull().any()



In [ ]:

    
tags.shape

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization



In [ ]:

    
%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))



In [ ]:

    
ratings.boxplot(column='rating', figsize=(15,20))

Slicing Out Columns



In [ ]:

    
tags['tag'].head()



In [ ]:

    
movies[['title','genres']].head()



In [ ]:

    
ratings[-10:]



In [ ]:

    
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]



In [ ]:

    
tag_counts[:10].plot(kind='bar', figsize=(15,10))

Filters for Selecting Rows



In [ ]:

    
is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]



In [ ]:

    
is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]



In [ ]:

    
movies[is_animation].head(15)

Group By and Aggregate



In [ ]:

    
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count



In [ ]:

    
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()



In [ ]:

    
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()



In [ ]:

    
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

Merge Dataframes



In [ ]:

    
tags.head()



In [ ]:

    
movies.head()



In [ ]:

    
t = movies.merge(tags, on='movieId', how='inner')
t.head()

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

Combine aggreagation, merging, and filters to get useful analytics



In [ ]:

    
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()



In [ ]:

    
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()



In [ ]:

    
is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]



In [ ]:

    
is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]



In [ ]:

    
box_office[is_comedy & is_highly_rated][-5:]

Vectorized String Operations



In [ ]:

    
movies.head()

Split 'genres' into multiple columns



In [ ]:

    
movie_genres = movies['genres'].str.split('|', expand=True)



In [ ]:

    
movie_genres[:10]

Add a new column for comedy genre flag



In [ ]:

    
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')



In [ ]:

    
movie_genres[:10]

Extract year from title e.g. (1995)



In [ ]:

    
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)



In [ ]:

    
movies.tail()

More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!



In [ ]:

    
tags = pd.read_csv('./movielens/tags.csv', sep=',')



In [ ]:

    
tags.dtypes

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970



In [ ]:

    
tags.head(5)



In [ ]:

    
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

Data Type datetime64[ns] maps to either M8[ns] depending on the hardware



In [ ]:

    
tags['parsed_time'].dtype



In [ ]:

    
tags.head(2)

Selecting rows based on timestamps



In [ ]:

    
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

Sorting the table using the timestamps



In [ ]:

    
tags.sort_values(by='parsed_time', ascending=True)[:10]

Average Movie Ratings over Time



In [ ]:

    
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()



In [ ]:

    
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()



In [ ]:

    
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]



In [ ]:

    
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)

Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?



In [ ]:

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller
10	11	American President, The (1995)	Comedy\|Drama\|Romance
11	12	Dracula: Dead and Loving It (1995)	Comedy\|Horror
12	13	Balto (1995)	Adventure\|Animation\|Children
13	14	Nixon (1995)	Drama
14	15	Cutthroat Island (1995)	Action\|Adventure\|Romance

	userId	movieId	tag	timestamp
0	15	339	sandra 'boring' bullock	1138537770
1	15	1955	dentist	1193435061
2	15	7478	Cambodia	1170560997
3	15	32892	Russian	1170626366
4	15	34162	forgettable	1141391765

	userId	movieId	tag
0	15	339	sandra 'boring' bullock
11	23	150	Ron Howard
1000	547	44199	toplist06