notebook.community

Edit and run



In [2]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Load iqsize.csv using pd.read_csv



In [7]:

    
df1 = pd.read_csv("../data/iqsize.csv")
# we can apply head method, it will return the n first rows
# n = 5 as a default value
df1.head(10)

Print both column and row indexes



In [8]:

    
print("Columns: {}".format(df1.columns))
print("Rows: {}".format(df1.index))









    



Columns: Index(['id', 'piq', 'brain', 'height', 'weight', 'sex'], dtype='object')
Rows: RangeIndex(start=0, stop=38, step=1)

One common property of the dataframe to check is the size (in terms of number of rows and columns)



In [8]:

    
df1.shape









    Out[8]:





(38, 6)

Print column dtype an think if they have the proper variable type

Series are iterable, so we can iterate through df.columns and then check the dtype of each series



In [9]:

    
for column in df1.columns:
    print("column name: {} - dtype: {}".format(column, df1[column].dtype))









    



column name: id - dtype: int64
column name: piq - dtype: object
column name: brain - dtype: float64
column name: height - dtype: object
column name: weight - dtype: int64
column name: sex - dtype: object

This is a correct way of checking variable names, note:

If we want to analyse the results, we have to create new structures to store the values
We are iterating using a for loop and wirting custom code

We can access directly DataFrame dtypes this way



In [17]:

    
df1.dtypes









    Out[17]:





id          int64
piq        object
brain     float64
height     object
weight      int64
sex        object
dtype: object

The result is a series, so we can still do transformations on the result



In [18]:

    
type(df1.dtypes)









    Out[18]:





pandas.core.series.Series

We can use df1.dtypes to select only integer columns



In [24]:

    
df1[df1.dtypes[df1.dtypes == np.int64].index].head()

What happened here?

From inner to outer operations:

note that df1.dtypes is a Series object
we check which dtypes are int64 just comparing all values to np.int64. The equality operation is performed to all elementos of df1.dtypes. This operation returns a Boolean Series object



In [27]:

    
result1 = df1.dtypes == np.int64
display(result1)









    





id         True
piq       False
brain     False
height    False
weight     True
sex       False
dtype: bool



In [28]:

    
type(result1)









    Out[28]:





pandas.core.series.Series

we slice df1.dtypes Series using result1 series. Note that the result is a Series containing the values of the dtypes of the original dataframe



In [32]:

    
df1.dtypes[result1]









    Out[32]:





id        int64
weight    int64
dtype: object

If we want to select the columns instead of the values, we need the indexes



In [36]:

    
result2 = df1.dtypes[result1].index
display(result2)









    





Index(['id', 'weight'], dtype='object')

We can get the same result using result1 to slice df1.columns



In [35]:

    
df1.columns[result1]









    Out[35]:





Index(['id', 'weight'], dtype='object')

Once we have the column names we just need to slice the original dataframe using result2



In [38]:

    
df1[result2].head()

Think if they have the proper variable type



In [39]:

    
df1.dtypes









    Out[39]:





id          int64
piq        object
brain     float64
height     object
weight      int64
sex        object
dtype: object

Use DataFrame.loc and DataFrame.iloc to (in each case print the result and check what data type you obtain as reponse):
1. Get the second column
Using iloc:



In [48]:

    
df1.iloc[:,1].head()









    Out[48]:





0    124.0
1    150.0
2    128.0
3    134.0
4    110.0
Name: piq, dtype: object

We can also use loc



In [42]:

    
df1.columns









    Out[42]:





Index(['id', 'piq', 'brain', 'height', 'weight', 'sex'], dtype='object')



In [43]:

    
df1.columns[1]









    Out[43]:





'piq'



In [47]:

    
df1.loc[:,'piq'].head()









    Out[47]:





0    124.0
1    150.0
2    128.0
3    134.0
4    110.0
Name: piq, dtype: object



In [45]:

    
df1.loc[:,df1.columns[1]].head()









    Out[45]:





0    124.0
1    150.0
2    128.0
3    134.0
4    110.0
Name: piq, dtype: object



In [51]:

    
print(type(df1.loc[:,df1.columns[1]].head()))









    



<class 'pandas.core.series.Series'>

B. Get the third row

Using iloc



In [49]:

    
df1.iloc[3,:]









    Out[49]:





id            3
piq       134.0
brain     95.15
height     65.0
weight      147
sex        Male
Name: 3, dtype: object

Using loc



In [50]:

    
df1.loc[3,:]









    Out[50]:





id            3
piq       134.0
brain     95.15
height     65.0
weight      147
sex        Male
Name: 3, dtype: object



In [53]:

    
print(type(df1.loc[3,:]))









    



<class 'pandas.core.series.Series'>

C. Get all but last column



In [54]:

    
df1.columns









    Out[54]:





Index(['id', 'piq', 'brain', 'height', 'weight', 'sex'], dtype='object')



In [56]:

    
df1.columns[:-1]









    Out[56]:





Index(['id', 'piq', 'brain', 'height', 'weight'], dtype='object')



In [57]:

    
df1.loc[:,df1.columns[:-1]].head()



In [58]:

    
print(type(df1.loc[:,df1.columns[:-1]].head()))









    



<class 'pandas.core.frame.DataFrame'>

D. Get rows from 4 to 10



In [60]:

    
df1.iloc[4:11,:]



In [61]:

    
print(type(df1.iloc[4:11,:]))









    



<class 'pandas.core.frame.DataFrame'>

E. Get values from columns 2 and 3 containing 3 and 4 row values

Using iloc



In [62]:

    
df1.iloc[2:4,3:5]



In [10]:

    
print(type(df1.iloc[2:4,3:5]))









    



<class 'pandas.core.frame.DataFrame'>

Using loc



In [63]:

    
df1.columns









    Out[63]:





Index(['id', 'piq', 'brain', 'height', 'weight', 'sex'], dtype='object')



In [65]:

    
df1.columns[3:5]









    Out[65]:





Index(['height', 'weight'], dtype='object')



In [67]:

    
df1.loc[2:3,["height","weight"]]



In [73]:

    
df1.loc[2:3,df1.columns[3:5]]

F. Get all iq values grater than 100



In [83]:

    
df1.loc[df1.loc[:,"weight"] > 136,"weight"].head()









    Out[83]:





1    143
2    172
3    147
4    146
5    138
Name: weight, dtype: int64

G. Divide previous results by 100



In [84]:

    
(df1.loc[df1.loc[:,"weight"] > 136,"weight"] / 100).head()









    Out[84]:





1    1.43
2    1.72
3    1.47
4    1.46
5    1.38
Name: weight, dtype: float64

Extra: methods for quantitative variables

DataFrame.describe() returns a set of statistical measures of the quantitative variables of the dataset.



In [85]:

    
df1.describe()

We can as well apply the methods for de computation of each one of the statistics (e.g. max, mean, std)



In [92]:

    
df1["brain"].max()









    Out[92]:





107.95



In [93]:

    
df1["brain"].mean()









    Out[93]:





86.29789473684211



In [94]:

    
df1["brain"].std()









    Out[94]:





29.12671515960254

Extra: methods for qualitative variables



In [90]:

    
s = df1["sex"]



In [91]:

    
s.value_counts()









    Out[91]:





Female    19
Male      14
man        1
woman      1
Woman      1
Man        1
Name: sex, dtype: int64

	id	brain	weight
count	38.000000	38.000000	38.000000
mean	18.473684	86.297895	146.289474
std	11.083801	29.126715	33.471979
min	0.000000	-83.180000	0.000000
25%	9.250000	85.485000	134.250000
50%	18.500000	90.540000	146.000000
75%	27.750000	94.955000	171.750000
max	37.000000	107.950000	192.000000

	id	piq	brain	height	weight	sex
0	0	124.0	81.69	64.5	118	Female
1	1	150.0	103.84	73.3	143	Male
2	2	128.0	96.54	68.8	172	Female
3	3	134.0	95.15	65.0	147	Male
4	4	110.0	92.88	69.0	146	Male
5	5	131.0	99.13	64.5	138	Male
6	6	98.0	85.43	66.0	175	Female
7	7	84.0	90.49	66.3	134	Male
8	8	00147.0	95.55	68.8	172	Female
9	9	124.0	83.39	64.5	118	Male