Pandas


pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

  • Generating descriptive statistics on data
  • Data cleaning using built in pandas functions
  • Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
  • Merging multiple datasets using dataframes
  • Working with timestamps and time-series data

Additional Recommended Resources:

Let's get started with our first pandas notebook!


Import Libraries


In [1]:
import pandas as pd

Introduction to pandas Data Structures


*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

pandas Series one-dimensional labeled array.


In [2]:
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])

In [3]:
ser


Out[3]:
tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

In [4]:
ser.index


Out[4]:
Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

In [5]:
ser.loc[['nancy','bob']]


Out[5]:
nancy    300
bob      foo
dtype: object

In [6]:
ser[[4, 3, 1]]


Out[6]:
eric    500
dan     bar
bob     foo
dtype: object

In [7]:
ser.iloc[2]


Out[7]:
300

In [8]:
'bob' in ser


Out[8]:
True

In [9]:
ser


Out[9]:
tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

In [10]:
ser * 2


Out[10]:
tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object

In [11]:
ser[['nancy', 'eric']] ** 2


Out[11]:
nancy     90000
eric     250000
dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series


In [12]:
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}

In [13]:
df = pd.DataFrame(d)
print(df)


          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0

In [14]:
df.index


Out[14]:
Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [15]:
df.columns


Out[15]:
Index(['one', 'two'], dtype='object')

In [16]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])


Out[16]:
one two
dancy NaN 4444.0
ball 200.0 222.0
apple 100.0 111.0

In [17]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])


Out[17]:
two five
dancy 4444.0 NaN
ball 222.0 NaN
apple 111.0 NaN

Create DataFrame from list of Python dictionaries


In [18]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

In [19]:
pd.DataFrame(data)


Out[19]:
alex alice dora ema joe
0 1.0 NaN NaN NaN 2.0
1 NaN 20.0 10.0 5.0 NaN

In [20]:
pd.DataFrame(data, index=['orange', 'red'])


Out[20]:
alex alice dora ema joe
orange 1.0 NaN NaN NaN 2.0
red NaN 20.0 10.0 5.0 NaN

In [21]:
pd.DataFrame(data, columns=['joe', 'dora','alice'])


Out[21]:
joe dora alice
0 2.0 NaN NaN
1 NaN 10.0 20.0

Basic DataFrame operations


In [22]:
df


Out[22]:
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0

In [23]:
df['one']


Out[23]:
apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

In [24]:
df['three'] = df['one'] * df['two']
df


Out[24]:
one two three
apple 100.0 111.0 11100.0
ball 200.0 222.0 44400.0
cerill NaN 333.0 NaN
clock 300.0 NaN NaN
dancy NaN 4444.0 NaN

In [25]:
df['flag'] = df['one'] > 250
df


Out[25]:
one two three flag
apple 100.0 111.0 11100.0 False
ball 200.0 222.0 44400.0 False
cerill NaN 333.0 NaN False
clock 300.0 NaN NaN True
dancy NaN 4444.0 NaN False

In [26]:
three = df.pop('three')

In [27]:
three


Out[27]:
apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64

In [28]:
df


Out[28]:
one two flag
apple 100.0 111.0 False
ball 200.0 222.0 False
cerill NaN 333.0 False
clock 300.0 NaN True
dancy NaN 4444.0 False

In [29]:
del df['two']

In [30]:
df


Out[30]:
one flag
apple 100.0 False
ball 200.0 False
cerill NaN False
clock 300.0 True
dancy NaN False

In [31]:
df.insert(2, 'copy_of_one', df['one'])
df


Out[31]:
one flag copy_of_one
apple 100.0 False 100.0
ball 200.0 False 200.0
cerill NaN False NaN
clock 300.0 True 300.0
dancy NaN False NaN

In [32]:
df['one_upper_half'] = df['one'][:2]
df


Out[32]:
one flag copy_of_one one_upper_half
apple 100.0 False 100.0 100.0
ball 200.0 False 200.0 200.0
cerill NaN False NaN NaN
clock 300.0 True 300.0 NaN
dancy NaN False NaN NaN

Case Study: Movie Data Analysis


This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using pandas.

Download the Dataset

Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:

Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.

Let us look at the files in this dataset using the UNIX command ls.


In [33]:
# Note: Adjust the name of the folder to match your local directory

!ls ./movielens


Icon?

In [34]:
!cat ./movielens/movies.csv | wc -l


cat: ./movielens/movies.csv: No such file or directory
       0

In [35]:
!head -5 ./movielens/ratings.csv


head: ./movielens/ratings.csv: No such file or directory

Use Pandas to Read the Dataset


In this notebook, we will be using three CSV files:

  • ratings.csv : userId,movieId,rating, timestamp
  • tags.csv : userId,movieId, tag, timestamp
  • movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.


In [36]:
movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)


---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-36-2dd15c048abf> in <module>()
----> 1 movies = pd.read_csv('./movielens/movies.csv', sep=',')
      2 print(type(movies))
      3 movies.head(15)

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654 
--> 655         return _read(filepath_or_buffer, kwds)
    656 
    657     parser_f.__name__ = name

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    403 
    404     # Create the parser.
--> 405     parser = TextFileReader(filepath_or_buffer, **kwds)
    406 
    407     if chunksize or iterator:

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    762             self.options['has_index_names'] = kwds['has_index_names']
    763 
--> 764         self._make_engine(self.engine)
    765 
    766     def close(self):

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    983     def _make_engine(self, engine='c'):
    984         if engine == 'c':
--> 985             self._engine = CParserWrapper(self.f, **self.options)
    986         else:
    987             if engine == 'python':

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1603         kwds['allow_leading_cols'] = self.index_col is not False
   1604 
-> 1605         self._reader = parsers.TextReader(src, **kwds)
   1606 
   1607         # XXX

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873)()

FileNotFoundError: File b'./movielens/movies.csv' does not exist

In [ ]:
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()

In [ ]:
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

In [ ]:
# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series


In [ ]:
#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)

In [ ]:
print(row_0)

In [ ]:
row_0.index

In [ ]:
row_0['userId']

In [ ]:
'rating' in row_0

In [ ]:
row_0.name

In [ ]:
row_0 = row_0.rename('first_row')
row_0.name

DataFrames


In [ ]:
tags.head()

In [ ]:
tags.index

In [ ]:
tags.columns

In [ ]:
# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]

Descriptive Statistics

Let's look how the ratings are distributed!


In [ ]:
ratings['rating'].describe()

In [ ]:
ratings.describe()

In [ ]:
ratings['rating'].mean()

In [ ]:
ratings.mean()

In [ ]:
ratings['rating'].min()

In [ ]:
ratings['rating'].max()

In [ ]:
ratings['rating'].std()

In [ ]:
ratings['rating'].mode()

In [ ]:
ratings.corr()

In [ ]:
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()

In [ ]:
filter_2 = ratings['rating'] > 0
filter_2.all()

Data Cleaning: Handling Missing Data


In [ ]:
movies.shape

In [ ]:
#is any row NULL ?

movies.isnull().any()

Thats nice ! No NULL values !


In [ ]:
ratings.shape

In [ ]:
#is any row NULL ?

ratings.isnull().any()

Thats nice ! No NULL values !


In [ ]:
tags.shape

In [ ]:
#is any row NULL ?

tags.isnull().any()

We have some tags which are NULL.


In [ ]:
tags = tags.dropna()

In [ ]:
#Check again: is any row NULL ?

tags.isnull().any()

In [ ]:
tags.shape

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization


In [ ]:
%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))

In [ ]:
ratings.boxplot(column='rating', figsize=(15,20))

Slicing Out Columns


In [ ]:
tags['tag'].head()

In [ ]:
movies[['title','genres']].head()

In [ ]:
ratings[-10:]

In [ ]:
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]

In [ ]:
tag_counts[:10].plot(kind='bar', figsize=(15,10))

Filters for Selecting Rows


In [ ]:
is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]

In [ ]:
is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]

In [ ]:
movies[is_animation].head(15)

Group By and Aggregate


In [ ]:
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

In [ ]:
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()

In [ ]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

In [ ]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

Merge Dataframes


In [ ]:
tags.head()

In [ ]:
movies.head()

In [ ]:
t = movies.merge(tags, on='movieId', how='inner')
t.head()


Combine aggreagation, merging, and filters to get useful analytics


In [ ]:
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()

In [ ]:
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()

In [ ]:
is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]

In [ ]:
is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]

In [ ]:
box_office[is_comedy & is_highly_rated][-5:]

Vectorized String Operations


In [ ]:
movies.head()


Split 'genres' into multiple columns


In [ ]:
movie_genres = movies['genres'].str.split('|', expand=True)

In [ ]:
movie_genres[:10]


Add a new column for comedy genre flag


In [ ]:
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

In [ ]:
movie_genres[:10]


Extract year from title e.g. (1995)


In [ ]:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

In [ ]:
movies.tail()


More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!


In [ ]:
tags = pd.read_csv('./movielens/tags.csv', sep=',')

In [ ]:
tags.dtypes

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970


In [ ]:
tags.head(5)

In [ ]:
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

Data Type datetime64[ns] maps to either M8[ns] depending on the hardware


In [ ]:
tags['parsed_time'].dtype

In [ ]:
tags.head(2)

Selecting rows based on timestamps


In [ ]:
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

Sorting the table using the timestamps


In [ ]:
tags.sort_values(by='parsed_time', ascending=True)[:10]

Average Movie Ratings over Time


In [ ]:
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()

In [ ]:
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()

In [ ]:
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]

In [ ]:
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)

Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?


In [ ]: