Python for Data Analysis Lightning Tutorials

Pandas Cookbook Series

Python for Data Analysis Lightning Tutorials is a series of tutorials in Data Analysis, Statistics, and Graphics using Python. The Pandas Cookbook series of tutorials provides recipes for common tasks and moves on to more advanced topics in statistics and time series analysis.

Created by Alfred Essa, May 26th, 2014

Note: IPython Notebook and Data files can be found at my Github Site: http://github/alfredessa

4.1b Introduction

In this tutorial we learn to use "queries" and "filters" in Pandas.

4.11 Preliminaries


In [1]:
# Load pandas and numpy libraries
import pandas as pd
import numpy as np

In [2]:
# Set default option for Pandas
pd.set_option('display.max_rows',10)

In [3]:
# Plot inline in notebook
%pylab inline


Populating the interactive namespace from numpy and matplotlib

4.12 Read Data


In [4]:
# Read dataset
auto = pd.read_csv('data/auto.csv')

In [5]:
# Show first lines of data
auto.head()


Out[5]:
make price mpg repairs weight length foreign
0 AMC 4099 22 3 2930 186 0
1 AMC 4749 17 3 3350 173 0
2 AMC 3799 22 3 2640 168 0
3 Audi 9690 17 5 2830 189 1
4 Audi 6295 23 3 2070 174 1

5 rows × 7 columns

4.13 Apply Filters


In [6]:
# search in column "forum" for value == 1
auto.foreign==1


Out[6]:
0    False
1    False
2    False
...
23    True
24    True
25    True
Name: foreign, Length: 26, dtype: bool

In [7]:
# set variable "mask" to the search filter
mask = auto.foreign==1

In [8]:
mask


Out[8]:
0    False
1    False
2    False
...
23    True
24    True
25    True
Name: foreign, Length: 26, dtype: bool

In [9]:
# apply the filter to the dataset
auto[mask]


Out[9]:
make price mpg repairs weight length foreign
3 Audi 9690 17 5 2830 189 1
4 Audi 6295 23 3 2070 174 1
5 BMW 9735 25 4 2650 177 1
22 Datsun 6229 23 4 2370 170 1
23 Datsun 4589 35 5 2020 165 1
24 Datsun 5079 24 4 2280 170 1
25 Datsun 8129 21 4 2750 184 1

7 rows × 7 columns


In [10]:
# create a new dataframe (subset of original)
foreign = auto[mask]

In [11]:
foreign


Out[11]:
make price mpg repairs weight length foreign
3 Audi 9690 17 5 2830 189 1
4 Audi 6295 23 3 2070 174 1
5 BMW 9735 25 4 2650 177 1
22 Datsun 6229 23 4 2370 170 1
23 Datsun 4589 35 5 2020 165 1
24 Datsun 5079 24 4 2280 170 1
25 Datsun 8129 21 4 2750 184 1

7 rows × 7 columns


In [12]:
# apply inverse filter
domestic = auto[np.invert(mask)]

In [13]:
domestic


Out[13]:
make price mpg repairs weight length foreign
0 AMC 4099 22 3 2930 186 0
1 AMC 4749 17 3 3350 173 0
2 AMC 3799 22 3 2640 168 0
6 Buick 4816 20 3 3250 196 0
7 Buick 7827 15 4 4080 222 0
8 Buick 5788 18 3 3670 218 0
9 Buick 4453 26 3 2230 170 0
10 Buick 5189 20 3 3280 200 0
11 Buick 10372 16 3 3880 207 0
12 Buick 4082 19 3 3400 200 0
... ... ... ... ... ... ...

19 rows × 7 columns


In [17]:
mask2 = ((auto.mpg>20) & (auto.price<5000))

In [18]:
myselection = auto[mask2]

In [19]:
myselection


Out[19]:
make price mpg repairs weight length foreign
0 AMC 4099 22 3 2930 186 0
2 AMC 3799 22 3 2640 168 0
9 Buick 4453 26 3 2230 170 0
16 Chevrolet 3299 29 3 2110 163 0
18 Chevrolet 4504 22 3 3180 193 0
20 Chevrolet 3667 24 2 2750 179 0
23 Datsun 4589 35 5 2020 165 1

7 rows × 7 columns

4.14 Practice using crime data set


In [20]:
# read data
crime = pd.read_csv('data/crime.csv')

In [21]:
# verify data
crime.head()


Out[21]:
State Type of Crime Crime Year Count
0 Alabama Violent Crime Murder and nonnegligent Manslaughter 1960 406
1 Alabama Violent Crime Murder and nonnegligent Manslaughter 1961 427
2 Alabama Violent Crime Murder and nonnegligent Manslaughter 1962 316
3 Alabama Violent Crime Murder and nonnegligent Manslaughter 1963 340
4 Alabama Violent Crime Murder and nonnegligent Manslaughter 1964 316

5 rows × 5 columns


In [30]:
# set mask
mask3 = ((crime.State=='California') & (crime.Crime=='Murder and nonnegligent Manslaughter'))

In [31]:
# apply mask or filter
cal_murder = crime[mask3]

In [32]:
cal_murder


Out[32]:
State Type of Crime Crime Year Count
1288 California Violent Crime Murder and nonnegligent Manslaughter 1960 616
1289 California Violent Crime Murder and nonnegligent Manslaughter 1961 605
1290 California Violent Crime Murder and nonnegligent Manslaughter 1962 657
1291 California Violent Crime Murder and nonnegligent Manslaughter 1963 673
1292 California Violent Crime Murder and nonnegligent Manslaughter 1964 740
1293 California Violent Crime Murder and nonnegligent Manslaughter 1965 880
1294 California Violent Crime Murder and nonnegligent Manslaughter 1966 868
1295 California Violent Crime Murder and nonnegligent Manslaughter 1967 1039
1296 California Violent Crime Murder and nonnegligent Manslaughter 1968 1150
1297 California Violent Crime Murder and nonnegligent Manslaughter 1969 1386
... ... ... ... ...

46 rows × 5 columns


In [33]:
# plot data
cal_murder.plot(x='Year', y='Count')


Out[33]:
<matplotlib.axes.AxesSubplot at 0x11038e710>

In [ ]: