Pandas

pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

Generating descriptive statistics on data
Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging multiple datasets using dataframes
Working with timestamps and time-series data

Additional Recommended Resources:

pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/
Python for Data Analysis by Wes McKinney
Python Data Science Handbook by Jake VanderPlas

Let's get started with our first pandas notebook!

Import Libraries



In [1]:

    
import pandas as pd

Introduction to pandas Data Structures

*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

pandas Series one-dimensional labeled array.



In [2]:

    
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])



In [3]:

    
ser









    Out[3]:





tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object



In [4]:

    
ser.index









    Out[4]:





Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')



In [5]:

    
ser.loc[['nancy','bob']]









    Out[5]:





nancy    300
bob      foo
dtype: object



In [6]:

    
ser[[4, 3, 1]]









    Out[6]:





eric    500
dan     bar
bob     foo
dtype: object



In [7]:

    
ser.iloc[2]









    Out[7]:





300



In [8]:

    
'bob' in ser









    Out[8]:





True



In [9]:

    
ser









    Out[9]:





tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object



In [10]:

    
ser * 2









    Out[10]:





tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object



In [11]:

    
ser[['nancy', 'eric']] ** 2









    Out[11]:





nancy     90000
eric     250000
dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series



In [12]:

    
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}



In [13]:

    
df = pd.DataFrame(d)
print(df)









    



          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0



In [14]:

    
df.index









    Out[14]:





Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')



In [15]:

    
df.columns









    Out[15]:





Index(['one', 'two'], dtype='object')



In [16]:

    
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])



In [17]:

    
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])









    Out[17]:







  
    
      
      two
      five
    
  
  
    
      dancy
      4444.0
      NaN
    
    
      ball
      222.0
      NaN
    
    
      apple
      111.0
      NaN

Create DataFrame from list of Python dictionaries



In [18]:

    
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]



In [19]:

    
pd.DataFrame(data)



In [20]:

    
pd.DataFrame(data, index=['orange', 'red'])









    Out[20]:







  
    
      
      alex
      alice
      dora
      ema
      joe
    
  
  
    
      orange
      1.0
      NaN
      NaN
      NaN
      2.0
    
    
      red
      NaN
      20.0
      10.0
      5.0
      NaN



In [21]:

    
pd.DataFrame(data, columns=['joe', 'dora','alice'])

Basic DataFrame operations



In [22]:

    
df



In [23]:

    
df['one']









    Out[23]:





apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64



In [24]:

    
df['three'] = df['one'] * df['two']
df



In [25]:

    
df['flag'] = df['one'] > 250
df









    Out[25]:







  
    
      
      one
      two
      three
      flag
    
  
  
    
      apple
      100.0
      111.0
      11100.0
      False
    
    
      ball
      200.0
      222.0
      44400.0
      False
    
    
      cerill
      NaN
      333.0
      NaN
      False
    
    
      clock
      300.0
      NaN
      NaN
      True
    
    
      dancy
      NaN
      4444.0
      NaN
      False



In [26]:

    
three = df.pop('three')



In [27]:

    
three









    Out[27]:





apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64



In [28]:

    
df









    Out[28]:







  
    
      
      one
      two
      flag
    
  
  
    
      apple
      100.0
      111.0
      False
    
    
      ball
      200.0
      222.0
      False
    
    
      cerill
      NaN
      333.0
      False
    
    
      clock
      300.0
      NaN
      True
    
    
      dancy
      NaN
      4444.0
      False



In [29]:

    
del df['two']



In [30]:

    
df









    Out[30]:







  
    
      
      one
      flag
    
  
  
    
      apple
      100.0
      False
    
    
      ball
      200.0
      False
    
    
      cerill
      NaN
      False
    
    
      clock
      300.0
      True
    
    
      dancy
      NaN
      False



In [31]:

    
df.insert(2, 'copy_of_one', df['one'])
df









    Out[31]:







  
    
      
      one
      flag
      copy_of_one
    
  
  
    
      apple
      100.0
      False
      100.0
    
    
      ball
      200.0
      False
      200.0
    
    
      cerill
      NaN
      False
      NaN
    
    
      clock
      300.0
      True
      300.0
    
    
      dancy
      NaN
      False
      NaN



In [32]:

    
df['one_upper_half'] = df['one'][:2]
df









    Out[32]:







  
    
      
      one
      flag
      copy_of_one
      one_upper_half
    
  
  
    
      apple
      100.0
      False
      100.0
      100.0
    
    
      ball
      200.0
      False
      200.0
      200.0
    
    
      cerill
      NaN
      False
      NaN
      NaN
    
    
      clock
      300.0
      True
      300.0
      NaN
    
    
      dancy
      NaN
      False
      NaN
      NaN

Case Study: Movie Data Analysis

This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using pandas.

Download the Dataset

Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:

Data Source: MovieLens web site (filename: ml-20m.zip)
Location: https://grouplens.org/datasets/movielens/

Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.

Let us look at the files in this dataset using the UNIX command ls.



In [33]:

    
# Note: Adjust the name of the folder to match your local directory

!ls ./movielens









    



Icon?



In [34]:

    
!cat ./movielens/movies.csv | wc -l









    



cat: ./movielens/movies.csv: No such file or directory
       0



In [35]:

    
!head -5 ./movielens/ratings.csv









    



head: ./movielens/ratings.csv: No such file or directory

Use Pandas to Read the Dataset

In this notebook, we will be using three CSV files:

ratings.csv : userId,movieId,rating, timestamp
tags.csv : userId,movieId, tag, timestamp
movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.



In [36]:

    
movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)









    



---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-36-2dd15c048abf> in <module>()
----> 1 movies = pd.read_csv('./movielens/movies.csv', sep=',')
      2 print(type(movies))
      3 movies.head(15)

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654 
--> 655         return _read(filepath_or_buffer, kwds)
    656 
    657     parser_f.__name__ = name

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    403 
    404     # Create the parser.
--> 405     parser = TextFileReader(filepath_or_buffer, **kwds)
    406 
    407     if chunksize or iterator:

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    762             self.options['has_index_names'] = kwds['has_index_names']
    763 
--> 764         self._make_engine(self.engine)
    765 
    766     def close(self):

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    983     def _make_engine(self, engine='c'):
    984         if engine == 'c':
--> 985             self._engine = CParserWrapper(self.f, **self.options)
    986         else:
    987             if engine == 'python':

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1603         kwds['allow_leading_cols'] = self.index_col is not False
   1604 
-> 1605         self._reader = parsers.TextReader(src, **kwds)
   1606 
   1607         # XXX

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)()

FileNotFoundError: File b'./movielens/movies.csv' does not exist



In [ ]:

    
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()



In [ ]:

    
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()



In [ ]:

    
# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series



In [ ]:

    
#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)



In [ ]:

    
print(row_0)



In [ ]:

    
row_0.index



In [ ]:

    
row_0['userId']



In [ ]:

    
'rating' in row_0



In [ ]:

    
row_0.name



In [ ]:

    
row_0 = row_0.rename('first_row')
row_0.name

DataFrames



In [ ]:

    
tags.head()



In [ ]:

    
tags.index



In [ ]:

    
tags.columns



In [ ]:

    
# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]

Descriptive Statistics

Let's look how the ratings are distributed!



In [ ]:

    
ratings['rating'].describe()



In [ ]:

    
ratings.describe()



In [ ]:

    
ratings['rating'].mean()



In [ ]:

    
ratings.mean()



In [ ]:

    
ratings['rating'].min()



In [ ]:

    
ratings['rating'].max()



In [ ]:

    
ratings['rating'].std()



In [ ]:

    
ratings['rating'].mode()



In [ ]:

    
ratings.corr()



In [ ]:

    
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()



In [ ]:

    
filter_2 = ratings['rating'] > 0
filter_2.all()

Data Cleaning: Handling Missing Data



In [ ]:

    
movies.shape



In [ ]:

    
#is any row NULL ?

movies.isnull().any()

Thats nice ! No NULL values !



In [ ]:

    
ratings.shape



In [ ]:

    
#is any row NULL ?

ratings.isnull().any()

Thats nice ! No NULL values !



In [ ]:

    
tags.shape



In [ ]:

    
#is any row NULL ?

tags.isnull().any()

We have some tags which are NULL.



In [ ]:

    
tags = tags.dropna()



In [ ]:

    
#Check again: is any row NULL ?

tags.isnull().any()



In [ ]:

    
tags.shape

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization



In [ ]:

    
%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))



In [ ]:

    
ratings.boxplot(column='rating', figsize=(15,20))

Slicing Out Columns



In [ ]:

    
tags['tag'].head()



In [ ]:

    
movies[['title','genres']].head()



In [ ]:

    
ratings[-10:]



In [ ]:

    
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]



In [ ]:

    
tag_counts[:10].plot(kind='bar', figsize=(15,10))

Filters for Selecting Rows



In [ ]:

    
is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]



In [ ]:

    
is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]



In [ ]:

    
movies[is_animation].head(15)

Group By and Aggregate



In [ ]:

    
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count



In [ ]:

    
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()



In [ ]:

    
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()



In [ ]:

    
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

Merge Dataframes



In [ ]:

    
tags.head()



In [ ]:

    
movies.head()



In [ ]:

    
t = movies.merge(tags, on='movieId', how='inner')
t.head()

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

Combine aggreagation, merging, and filters to get useful analytics



In [ ]:

    
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()



In [ ]:

    
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()



In [ ]:

    
is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]



In [ ]:

    
is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]



In [ ]:

    
box_office[is_comedy & is_highly_rated][-5:]

Vectorized String Operations



In [ ]:

    
movies.head()

Split 'genres' into multiple columns



In [ ]:

    
movie_genres = movies['genres'].str.split('|', expand=True)



In [ ]:

    
movie_genres[:10]

Add a new column for comedy genre flag



In [ ]:

    
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')



In [ ]:

    
movie_genres[:10]

Extract year from title e.g. (1995)



In [ ]:

    
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)



In [ ]:

    
movies.tail()

More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!



In [ ]:

    
tags = pd.read_csv('./movielens/tags.csv', sep=',')



In [ ]:

    
tags.dtypes

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970



In [ ]:

    
tags.head(5)



In [ ]:

    
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

Data Type datetime64[ns] maps to either M8[ns] depending on the hardware



In [ ]:

    
tags['parsed_time'].dtype



In [ ]:

    
tags.head(2)

Selecting rows based on timestamps



In [ ]:

    
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

Sorting the table using the timestamps



In [ ]:

    
tags.sort_values(by='parsed_time', ascending=True)[:10]

Average Movie Ratings over Time



In [ ]:

    
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()



In [ ]:

    
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()



In [ ]:

    
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]



In [ ]:

    
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)

Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?



In [ ]:

	one	two
apple	100.0	111.0
ball	200.0	222.0
cerill	NaN	333.0
clock	300.0	NaN
dancy	NaN	4444.0

	one	flag
apple	100.0	False
ball	200.0	False
cerill	NaN	False
clock	300.0	True
dancy	NaN	False