What follows is a fairly thorough introduction to the library.
I chose to break it into three parts as I felt it was too long and daunting as one.

  • Part 1: Intro to pandas data structures, covers the basics of the library's two main data structures - Series and DataFrames.
  • Part 2: Working with DataFrames, dives a bit deeper into the functionality of DataFrames. It shows how to inspect, select, filter, merge, combine, and group your data.
  • Part 3: Using pandas with the MovieLens dataset, applies the learnings of the first two parts in order to answer a few basic analysis questions about the MovieLens ratings data.

If you'd like to follow along, you can find the necessary CSV files here and the MovieLens dataset download link here.
My goal for this tutorial is to teach the basics of pandas by comparing and contrasting its syntax with SQL.
If you're interested in learning more about the library, pandas author Wes McKinney has written Python for Data Analysis, which covers it in much greater detail.

Part 1 -- Data Structures

The data structures you will deal with most often in pandas are the Series and the DataFrame, both of which are built on top of NumPy.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
%matplotlib inline

Series


In [ ]: