In [1]:
import pandas as pd
from pandas import Series, DataFrame
Since Series and DataFrame are used frequently, they should be imported directly by name.
In [2]:
ps = Series([4,2,1,3])
print ps
Get values and indeces like this:
In [3]:
print ps.values
print ps.index
ps[0]
Out[3]:
To use a custom index, do this:
In [4]:
ps2 = Series([4, 7, -1, 8], ['a','b','c','d'])
ps2
Out[4]:
Often, you want to create Series from python dict
In [5]:
ps3 = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
ps3
Out[5]:
In [6]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = DataFrame(data)
df
Out[6]:
You can specify a sequence of columns like so:
In [7]:
DataFrame(data, columns=['year', 'state', 'pop'])
Out[7]:
In addition to index and values, DataFrame has columns
In [8]:
print df.index
print
print df.values
print
print df.columns
You can get a specific column like this:
In [9]:
df['state']
Out[9]:
Rows can be retrieved using the ix
method:
In [10]:
df.ix[0]
Out[10]:
Another common form of data to create DataFrame is a nested dict of dicts OR nested dict of Series:
In [11]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
df2 = DataFrame(pop)
df2
Out[11]:
You can pass explicit index when creating DataFrame:
In [12]:
df3=DataFrame(pop, index=[2001, 2002, 2003])
df3
Out[12]:
If a DataFrame’s index and columns have their name attributes set, these will also be displayed:
In [13]:
df2.index.name = 'year'
df2.columns.name = 'state'
df2
Out[13]:
The 3rd common data input structures is a list of dicts or Series:
In [14]:
films = [{'star': 9.3, 'title': 'The Shawshank Redemption', 'content_rating': 'R'},
{'star': 9.2, 'title': 'The Godfather', 'content_rating': 'R'},
{'star': 9.1, 'title': 'The Godfather: Part II', 'content_rating': 'R'}
]
df3 = DataFrame(films)
df3
Out[14]:
More on DataFrame manipulation will come later.
There are two main methods for reading data from file to DataFrame: read_table
and read_csv
. read_csv
is exactly the same as read_table
, except it assumes a comma separator.
You can read a data set using read_table
like so:
In [15]:
orders = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/chipotle.tsv')
orders.head (5)
Out[15]:
A file does not always have a header row. In this case, you can use default column names or specify column names yourself:
In [16]:
users = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/u.user', sep='|', header=None)
users.head(5)
Out[16]:
In [17]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users2 = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/u.user', sep='|', header=None, names=user_cols)
users2.head(5)
Out[17]:
You can choose a specific column to be the index column instead of the default generated by Pandas:
In [18]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users3 = pd.read_table('https://raw.githubusercontent.com/minhhh/charts/master/pandas/data/u.user', sep='|', header=None, names=user_cols, index_col='user_id')
users3.head(5)
Out[18]: