In [11]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
from scipy import stats
# import qgrid
import matplotlib.pyplot as plt
import seaborn as sns

In [12]:
df = pd.read_csv('buffy.csv')

Your first step should probably be to check for whether the dataframe is formatted as you'd expect it to be. Here's the definition of a dataframe according to Python for Data Analysis:

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).


In [24]:
df.columns


Out[24]:
Index([u'Character', u'Species', u'Height (inches)', u'Actor DOB', u'Number of Episodes', u'Ranking', u'Gender'], dtype='object')

In [25]:
# display first five rows
df.head()


Out[25]:
Character Species Height (inches) Actor DOB Number of Episodes Ranking Gender
0 Buffy Human 64.0 4/14/77 145 5 F
1 Xander Human 70.0 4/12/71 145 11 M
2 Willow Human 64.5 3/24/74 144 1 F
3 Giles Human 73.0 2/20/54 123 3 M
4 Cordelia Human 67.0 7/23/70 58 7 F

In [26]:
df.tail()


Out[26]:
Character Species Height (inches) Actor DOB Number of Episodes Ranking Gender
9 Tara Human 64 1/8/77 47 13 F
10 Dawn Human 66 10/11/85 66 50 F
11 Joyce Human 67 4/17/55 58 24 F
12 Faith Human 65 12/30/80 20 4 F
13 Drusilla Vampire 66 3/30/65 17 12 F

In [27]:
df.describe()


Out[27]:
Height (inches) Number of Episodes Ranking
count 14.000000 14.000000 14.000000
mean 66.857143 78.857143 12.785714
std 3.236977 45.195060 12.741082
min 63.500000 17.000000 1.000000
25% 64.125000 49.750000 4.250000
50% 66.000000 62.500000 10.500000
75% 68.500000 116.500000 16.750000
max 73.000000 145.000000 50.000000

In [28]:
# check for null values
pd.notnull(df)


Out[28]:
Character Species Height (inches) Actor DOB Number of Episodes Ranking Gender
0 True True True True True True True
1 True True True True True True True
2 True True True True True True True
3 True True True True True True True
4 True True True True True True True
5 True True True True True True True
6 True True True True True True True
7 True True True True True True True
8 True True True True True True True
9 True True True True True True True
10 True True True True True True True
11 True True True True True True True
12 True True True True True True True
13 True True True True True True True

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:


In [29]:
# attribute
df.Character


Out[29]:
0        Buffy
1       Xander
2       Willow
3        Giles
4     Cordelia
5        Angel
6           Oz
7        Spike
8         Anya
9         Tara
10        Dawn
11       Joyce
12       Faith
13    Drusilla
Name: Character, dtype: object

In [30]:
# dict
df['Character']


Out[30]:
0        Buffy
1       Xander
2       Willow
3        Giles
4     Cordelia
5        Angel
6           Oz
7        Spike
8         Anya
9         Tara
10        Dawn
11       Joyce
12       Faith
13    Drusilla
Name: Character, dtype: object

Rows can also be retrieved by position or name by a couple of methods, such as the ix indexing field (much more on this later):


In [33]:



Out[33]:
Character               Angel
Species               Vampire
Height (inches)            73
Actor DOB             5/16/69
Number of Episodes         59
Ranking                    19
Gender                      M
Name: 5, dtype: object

In [ ]:
df.ix['Colorado', ['two', 'three']]