In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
  1. Load iqsize.csv using pd.read_csv

In [7]:
df1 = pd.read_csv("../data/iqsize.csv")
# we can apply head method, it will return the n first rows
# n = 5 as a default value
df1.head(10)


Out[7]:
id piq brain height weight sex
0 0 124.0 81.69 64.5 118 Female
1 1 150.0 103.84 73.3 143 Male
2 2 128.0 96.54 68.8 172 Female
3 3 134.0 95.15 65.0 147 Male
4 4 110.0 92.88 69.0 146 Male
5 5 131.0 99.13 64.5 138 Male
6 6 98.0 85.43 66.0 175 Female
7 7 84.0 90.49 66.3 134 Male
8 8 00147.0 95.55 68.8 172 Female
9 9 124.0 83.39 64.5 118 Male
  1. Print both column and row indexes

In [8]:
print("Columns: {}".format(df1.columns))
print("Rows: {}".format(df1.index))


Columns: Index(['id', 'piq', 'brain', 'height', 'weight', 'sex'], dtype='object')
Rows: RangeIndex(start=0, stop=38, step=1)

One common property of the dataframe to check is the size (in terms of number of rows and columns)


In [8]:
df1.shape


Out[8]:
(38, 6)
  1. Print column dtype an think if they have the proper variable type

Series are iterable, so we can iterate through df.columns and then check the dtype of each series


In [9]:
for column in df1.columns:
    print("column name: {} - dtype: {}".format(column, df1[column].dtype))


column name: id - dtype: int64
column name: piq - dtype: object
column name: brain - dtype: float64
column name: height - dtype: object
column name: weight - dtype: int64
column name: sex - dtype: object

This is a correct way of checking variable names, note:

  • If we want to analyse the results, we have to create new structures to store the values
  • We are iterating using a for loop and wirting custom code

We can access directly DataFrame dtypes this way


In [17]:
df1.dtypes


Out[17]:
id          int64
piq        object
brain     float64
height     object
weight      int64
sex        object
dtype: object

The result is a series, so we can still do transformations on the result


In [18]:
type(df1.dtypes)


Out[18]:
pandas.core.series.Series

We can use df1.dtypes to select only integer columns


In [24]:
df1[df1.dtypes[df1.dtypes == np.int64].index].head()


Out[24]:
id weight
0 0 118
1 1 143
2 2 172
3 3 147
4 4 146

What happened here?

From inner to outer operations:

  1. note that df1.dtypes is a Series object
  2. we check which dtypes are int64 just comparing all values to np.int64. The equality operation is performed to all elementos of df1.dtypes. This operation returns a Boolean Series object

In [27]:
result1 = df1.dtypes == np.int64
display(result1)


id         True
piq       False
brain     False
height    False
weight     True
sex       False
dtype: bool

In [28]:
type(result1)


Out[28]:
pandas.core.series.Series
  1. we slice df1.dtypes Series using result1 series. Note that the result is a Series containing the values of the dtypes of the original dataframe

In [32]:
df1.dtypes[result1]


Out[32]:
id        int64
weight    int64
dtype: object

If we want to select the columns instead of the values, we need the indexes


In [36]:
result2 = df1.dtypes[result1].index
display(result2)


Index(['id', 'weight'], dtype='object')

We can get the same result using result1 to slice df1.columns


In [35]:
df1.columns[result1]


Out[35]:
Index(['id', 'weight'], dtype='object')
  1. Once we have the column names we just need to slice the original dataframe using result2

In [38]:
df1[result2].head()


Out[38]:
id weight
0 0 118
1 1 143
2 2 172
3 3 147
4 4 146

Think if they have the proper variable type


In [39]:
df1.dtypes


Out[39]:
id          int64
piq        object
brain     float64
height     object
weight      int64
sex        object
dtype: object
  1. Use DataFrame.loc and DataFrame.iloc to (in each case print the result and check what data type you obtain as reponse):

    1. Get the second column

    Using iloc:


In [48]:
df1.iloc[:,1].head()


Out[48]:
0    124.0
1    150.0
2    128.0
3    134.0
4    110.0
Name: piq, dtype: object

We can also use loc


In [42]:
df1.columns


Out[42]:
Index(['id', 'piq', 'brain', 'height', 'weight', 'sex'], dtype='object')

In [43]:
df1.columns[1]


Out[43]:
'piq'

In [47]:
df1.loc[:,'piq'].head()


Out[47]:
0    124.0
1    150.0
2    128.0
3    134.0
4    110.0
Name: piq, dtype: object

In [45]:
df1.loc[:,df1.columns[1]].head()


Out[45]:
0    124.0
1    150.0
2    128.0
3    134.0
4    110.0
Name: piq, dtype: object

In [51]:
print(type(df1.loc[:,df1.columns[1]].head()))


<class 'pandas.core.series.Series'>

B. Get the third row

Using iloc


In [49]:
df1.iloc[3,:]


Out[49]:
id            3
piq       134.0
brain     95.15
height     65.0
weight      147
sex        Male
Name: 3, dtype: object

Using loc


In [50]:
df1.loc[3,:]


Out[50]:
id            3
piq       134.0
brain     95.15
height     65.0
weight      147
sex        Male
Name: 3, dtype: object

In [53]:
print(type(df1.loc[3,:]))


<class 'pandas.core.series.Series'>

C. Get all but last column


In [54]:
df1.columns


Out[54]:
Index(['id', 'piq', 'brain', 'height', 'weight', 'sex'], dtype='object')

In [56]:
df1.columns[:-1]


Out[56]:
Index(['id', 'piq', 'brain', 'height', 'weight'], dtype='object')

In [57]:
df1.loc[:,df1.columns[:-1]].head()


Out[57]:
id piq brain height weight
0 0 124.0 81.69 64.5 118
1 1 150.0 103.84 73.3 143
2 2 128.0 96.54 68.8 172
3 3 134.0 95.15 65.0 147
4 4 110.0 92.88 69.0 146

In [58]:
print(type(df1.loc[:,df1.columns[:-1]].head()))


<class 'pandas.core.frame.DataFrame'>

D. Get rows from 4 to 10


In [60]:
df1.iloc[4:11,:]


Out[60]:
id piq brain height weight sex
4 4 110.0 92.88 69.0 146 Male
5 5 131.0 99.13 64.5 138 Male
6 6 98.0 85.43 66.0 175 Female
7 7 84.0 90.49 66.3 134 Male
8 8 00147.0 95.55 68.8 172 Female
9 9 124.0 83.39 64.5 118 Male
10 10 128.0 107.95 70.0 151 Woman

In [61]:
print(type(df1.iloc[4:11,:]))


<class 'pandas.core.frame.DataFrame'>

E. Get values from columns 2 and 3 containing 3 and 4 row values

Using iloc


In [62]:
df1.iloc[2:4,3:5]


Out[62]:
height weight
2 68.8 172
3 65.0 147

In [10]:
print(type(df1.iloc[2:4,3:5]))


<class 'pandas.core.frame.DataFrame'>

Using loc


In [63]:
df1.columns


Out[63]:
Index(['id', 'piq', 'brain', 'height', 'weight', 'sex'], dtype='object')

In [65]:
df1.columns[3:5]


Out[65]:
Index(['height', 'weight'], dtype='object')

In [67]:
df1.loc[2:3,["height","weight"]]


Out[67]:
height weight
2 68.8 172
3 65.0 147

In [73]:
df1.loc[2:3,df1.columns[3:5]]


Out[73]:
height weight
2 68.8 172
3 65.0 147

F. Get all iq values grater than 100


In [83]:
df1.loc[df1.loc[:,"weight"] > 136,"weight"].head()


Out[83]:
1    143
2    172
3    147
4    146
5    138
Name: weight, dtype: int64

G. Divide previous results by 100


In [84]:
(df1.loc[df1.loc[:,"weight"] > 136,"weight"] / 100).head()


Out[84]:
1    1.43
2    1.72
3    1.47
4    1.46
5    1.38
Name: weight, dtype: float64

Extra: methods for quantitative variables

  • DataFrame.describe() returns a set of statistical measures of the quantitative variables of the dataset.

In [85]:
df1.describe()


Out[85]:
id brain weight
count 38.000000 38.000000 38.000000
mean 18.473684 86.297895 146.289474
std 11.083801 29.126715 33.471979
min 0.000000 -83.180000 0.000000
25% 9.250000 85.485000 134.250000
50% 18.500000 90.540000 146.000000
75% 27.750000 94.955000 171.750000
max 37.000000 107.950000 192.000000
  • We can as well apply the methods for de computation of each one of the statistics (e.g. max, mean, std)

In [92]:
df1["brain"].max()


Out[92]:
107.95

In [93]:
df1["brain"].mean()


Out[93]:
86.29789473684211

In [94]:
df1["brain"].std()


Out[94]:
29.12671515960254

Extra: methods for qualitative variables


In [90]:
s = df1["sex"]

In [91]:
s.value_counts()


Out[91]:
Female    19
Male      14
man        1
woman      1
Woman      1
Man        1
Name: sex, dtype: int64