As a personal preference, I believe it is better not to import all the functions into the current namespace.
In [5]:
import numpy as np
import pandas as pd
There are 3 types of data structures.
Data Structure | Dimensions |
---|---|
Series | 1-Dim |
DataFrame | 2-Dim |
3-Dim |
We will be dealing with Series and DataFrames. We will not be handing Panel here.
All datastructures have both List-like and Dict-like properties.
A Series at it's simplest form can be created from a dict.
In [4]:
data = {'Mon':'Monday',
'Tues':'Tuesday',
'Wed':'Wednesday',
'Thurs':'Thursday',
}
s = pd.Series(data)
s
Out[4]:
In [5]:
s.index
Out[5]:
A Series can also be created from a sequence of values and a sequence of index.
In [19]:
s = pd.Series(np.random.randint(5, 15, 7), ('Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat',
'Sun'), name='Temperature')
In [20]:
s.index.name = "Day of the Week"
In [21]:
s
Out[21]:
In [22]:
s['Tues']
Out[22]:
In [23]:
'Mon' in s
Out[23]:
In [24]:
'Son' in s
Out[24]:
The Series can also be sliced using index.
In [25]:
s['Thur':'Sun']
Out[25]:
In [26]:
s.max()
Out[26]:
In [27]:
s + 2*s #Vectorized operation
Out[27]:
In [28]:
s[1] #Accessing a value by position
Out[28]:
In [29]:
s[2:5] #Slicing the Series by position
Out[29]:
In [33]:
s[:1]
Out[33]:
In [39]:
s - np.random.randint(5, 15, 7)
Out[39]:
In [42]:
for x in s: print x #iterating over values
In [43]:
for pos, value in enumerate(s): print pos, ':', value
In [44]:
for key, value in s.iteritems(): print key, ':', value
Dataframe is a two dimensional array, and probably the most used data structure in Pandas. The columns themselves can have different data types but all the values within each column should be of the same datatype.
A dataframe can be created from
-Now let us look at the obligatory Day-Temperature example.
In [1]:
import datetime
In [13]:
base = datetime.datetime.today()
days = 20
date_list = [base - datetime.timedelta(days=x) for x in range(0, days)]
date_list = [datetime.date(x.year, x.month, x.day) for x in date_list]
date_list.reverse()
data = {'date':date_list,
'Chennai':np.random.randint(25,35,days),
'Mumbai':np.random.randint(15,25,days),
'Delhi':np.random.randint(5,15,days)}
df = pd.DataFrame(data)
In [14]:
type(df)
Out[14]:
In [15]:
df.head()
Out[15]:
In [18]:
df = df.set_index('date')
In [19]:
df.head()
Out[19]:
In [20]:
df.median()
Out[20]:
In [21]:
df.mean()
Out[21]:
In [24]:
df.diff().head()
Out[24]:
In [25]:
titanic = pd.read_csv('data/titanic.csv')
In [33]:
titanic = titanic.set_index('PassengerId')
In [34]:
titanic.head()
Out[34]:
In [29]:
len(titanic)
Out[29]:
In [30]:
titanic.Fare.sum()
Out[30]:
In [31]:
titanic.Survived.value_counts()
Out[31]:
In [35]:
titanic.Pclass.value_counts()
Out[35]:
In [ ]: