Getting started

  • Python basiscs
  • Loading data
  • Plotting data

Python basics

For a data scientist learning some basic data structures available in python is very important for efficient usage of the tools available in the python ecosystem.

Variables

Variables can be defined using any alphanumeric string which starts with an alphabet or "_". It can include "_" in it.


In [1]:
a = 10
b = 20
c = "Hello"

print a, b, c


10 20 Hello

Lists

A list is a data structure which can hold a list of items of different types. Think of a shopping list. Items in the list can be accessed using zero based index. You will use these when you want to add more data and access the data based on the position in index.


In [2]:
list_items = ["milk", "cereal", "banana", 22.5, [1,2,3]]  ## A list can contain another list and items of different types
print list_items
print "3rd item in the list: ", list_items[2] # Zero based index starts from 0 so 3rd item will have index 2


['milk', 'cereal', 'banana', 22.5, [1, 2, 3]]
3rd item in the list:  banana

Sets

Like list but only store unique items which are hashable (think basic data types like string, ints and not lists, will explain later). Super useful for checking if an item is already in the list. Items are not indexed. So items can only be added or removed. You will use these when you want to keep track of unique items e.g. feature names in the data.


In [3]:
set_items = set([1,2,3, 1])
print set_items
print "Is 1 in set_items: ", 1 in set_items
print "Is 10 in set_items: ", 10 in set_items


set([1, 2, 3])
Is 1 in set_items:  True
Is 10 in set_items:  False

Dictionaries

Like sets but can also map values to each unique item. Essentially, it stores key-value pairs which are useful for fast lookup of items. Think of telephone directory or shopping catalogue. Keys should be of same time as items in sets, but values can be anything. You will use these when you want to keep unique items and their related values e.g. words in the data and the number of times they occur.


In [4]:
item_details = {
    "milk": {
        "brand": "Amul",
        "quantity": 2.5,
        "cost": 10
    },
    "chocolate": {
        "brand": "Cadbury",
        "quantity": 1,
        "cost": 5
    },
}

print item_details
print "What are is the brand of milk: ", item_details["milk"]["brand"]
print "What are is the cost of chocolate: ", item_details["chocolate"]["cost"]


{'milk': {'brand': 'Amul', 'cost': 10, 'quantity': 2.5}, 'chocolate': {'brand': 'Cadbury', 'cost': 5, 'quantity': 1}}
What are is the brand of milk:  Amul
What are is the cost of chocolate:  5

Functions

Using a function is handy in cases when you need to repeat something over an over again. A function can take arguments and return some variables.

E.g. if you want to fetch tweets using different queries then you can define a function which takes the query and gives you as output the tweets on that query. You can then just call the function with different queries rather than rewriting the whole code for getting the queries.


In [5]:
def get_items_from_file(filename):
    data = []
    with open(filename) as fp:
        for line in fp:
            line = line.strip().split(" ")
            data.append(line)
    return data

In [6]:
print "Data in file data/temp1.txt"
print get_items_from_file("../data/temp1.txt")


Data in file data/temp1.txt
[['milk', '5'], ['chocolate', '10'], ['honey', '20']]

In [7]:
print "Data in file data/temp2.txt"
print get_items_from_file("../data/temp2.txt")


Data in file data/temp2.txt
[['Alex', '222-222-1212', 'Ohio'], ['Shubh', '111-221-3452', 'Illinois'], ['Carlos', '445-123-1231', 'Washington']]

Loading Data


In [8]:
from scipy.io import arff

In [9]:
data, meta = arff.loadarff("../data/iris.arff")

In [10]:
data.shape, meta


Out[10]:
((150,), Dataset: iris
 	sepallength's type is numeric
 	sepalwidth's type is numeric
 	petallength's type is numeric
 	petalwidth's type is numeric
 	class's type is nominal, range is ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'))

In [11]:
data[0]


Out[11]:
(5.1, 3.5, 1.4, 0.2, 'Iris-setosa')

Pandas

Pandas is a wonderful library for working with tabular data in python. It can read csv files easily and represents them as dataframes. Think of it like excel but faster and without a GUI.


In [12]:
import pandas as pd

In [13]:
df_iris = pd.DataFrame(data, columns=meta.names())
df_iris.head()


Out[13]:
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

In [14]:
print "The shape of iris data is: ", df_iris.shape


The shape of iris data is:  (150, 5)

In [15]:
print "Show how many instances are of each class: "
df_iris["class"].value_counts()


Show how many instances are of each class: 
Out[15]:
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: class, dtype: int64

In [16]:
df_iris["sepallength"].hist(bins=10)


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c65b49b10>

Filtering data Filtering parts of the data in pandas is really easy. If you want to filter data for editing it then you need to make a copy of the filtered data.


In [17]:
print "Show data containing with petalwidth > 2.0"
df_iris[df_iris["petalwidth"] > 2.0]


Show data containing with petalwidth > 2.0
Out[17]:
sepallength sepalwidth petallength petalwidth class
100 6.3 3.3 6.0 2.5 Iris-virginica
102 7.1 3.0 5.9 2.1 Iris-virginica
104 6.5 3.0 5.8 2.2 Iris-virginica
105 7.6 3.0 6.6 2.1 Iris-virginica
109 7.2 3.6 6.1 2.5 Iris-virginica
112 6.8 3.0 5.5 2.1 Iris-virginica
114 5.8 2.8 5.1 2.4 Iris-virginica
115 6.4 3.2 5.3 2.3 Iris-virginica
117 7.7 3.8 6.7 2.2 Iris-virginica
118 7.7 2.6 6.9 2.3 Iris-virginica
120 6.9 3.2 5.7 2.3 Iris-virginica
124 6.7 3.3 5.7 2.1 Iris-virginica
128 6.4 2.8 5.6 2.1 Iris-virginica
132 6.4 2.8 5.6 2.2 Iris-virginica
135 7.7 3.0 6.1 2.3 Iris-virginica
136 6.3 3.4 5.6 2.4 Iris-virginica
139 6.9 3.1 5.4 2.1 Iris-virginica
140 6.7 3.1 5.6 2.4 Iris-virginica
141 6.9 3.1 5.1 2.3 Iris-virginica
143 6.8 3.2 5.9 2.3 Iris-virginica
144 6.7 3.3 5.7 2.5 Iris-virginica
145 6.7 3.0 5.2 2.3 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica

Titanic data

VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

In [18]:
df = pd.read_csv("../data/titanic.csv")
df.shape


Out[18]:
(891, 15)

In [19]:
df.head()


Out[19]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

Plotting data

Great for visual inspection.

Matplotlib and Seaborn

Matplotlib is a low level python library which gives you complete control over your plots. Seaborn is a library made on top of matplotlib and which adds functionality to create certain types of plots easily. Works great with pandas.


In [20]:
# We need the line below to show plots directly in the notebook.
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

In [21]:
sns.set_style("ticks")
sns.set_context("paper")

In [22]:
colors = {
    "Iris-setosa": "red",
    "Iris-versicolor": "green",
    "Iris-virginica": "blue",
}
plt.scatter(df_iris.petallength, df_iris.petalwidth, c=map(lambda x: colors[x], df_iris["class"]))
plt.xlabel("petallength")
plt.ylabel("petalwidth")


Out[22]:
<matplotlib.text.Text at 0x7f5c61548f90>

In [23]:
sns.lmplot(x="petallength", y="petalwidth", hue="class", data=df_iris, fit_reg=False)


Out[23]:
<seaborn.axisgrid.FacetGrid at 0x7f5c614a3890>

In [24]:
sns.pairplot(df_iris, hue="class")


Out[24]:
<seaborn.axisgrid.PairGrid at 0x7f5c60bcfd50>

In [25]:
sns.countplot(x="sex", data=df)


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c5f8a46d0>

In [26]:
sns.countplot(x="class", data=df)


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c5fbc8d10>

In [27]:
sns.countplot(x="embark_town", data=df)


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c5f836c10>

In [28]:
sns.countplot(x="alive", data=df)


Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c5f82e250>

In [29]:
sns.countplot(x="alone", data=df)


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c5d4a3f90>

In [30]:
sns.lmplot(x="age", y="survived", hue="sex", data=df, fit_reg=True, logistic=True)


Out[30]:
<seaborn.axisgrid.FacetGrid at 0x7f5c5d43c410>

In [31]:
sns.barplot(x="sex", y="survived", hue="embark_town", data=df)


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c5d3bc3d0>

In [32]:
sns.barplot(x="sex", y="survived", hue="class", data=df)


Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c5d446ed0>

In [33]:
sns.barplot(x="sex", y="survived", hue=pd.cut(df.age, bins=[0,18,30,100]), data=df)


Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c5d3c7610>

In [34]:
sns.barplot(x="sex", y="survived", hue="alone", data=df)


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c5d2bb310>

In [35]:
sns.barplot(x="sex", y="survived", hue=pd.cut(df.sibsp, bins=[0,1,2,3,10]), data=df)


Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c5c0f0690>

In [36]:
sns.barplot(x="sex", y="survived", hue=pd.cut(df.parch, bins=[0,1,2,3,10]), data=df)


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c553b8f90>

In [37]:
sns.barplot(x="sex", y="age", hue=pd.cut(df.parch, bins=[0,1,2,3,10]), data=df)


Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c55325750>

In [38]:
sns.barplot(x="sex", y="age", hue=pd.cut(df.sibsp, bins=[0,1,2,3,10]), data=df)


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c551c9410>

In [39]:
sns.barplot(x="sex", y="age", hue="embark_town", data=df)


Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c550d5f90>

In [40]:
sns.barplot(x="sex", y="age", hue="class", data=df)


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c54fe72d0>

Question: Draw the plot of mean petalwidth of the various categories of Iris-classes. It should show the mean petalwidth for each petallengths in buckets [0, 2.5, 4.5, 6.5, 10]


In [ ]:

ANSWER BELOW
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

In [41]:
sns.barplot(x="class", y="petalwidth", hue=pd.cut(df_iris.petallength, bins=[0, 2.5, 4.5, 6.5, 10]), data=df_iris)


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5c54edf9d0>

In [ ]: