Labos en Python

Tous les labos, version Python. (Pas de labo 1.)


In [3]:
# jupyter magic
%matplotlib inline

# python scientific stack
import numpy as np
import pandas as pd
import scipy.stats as scs
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats as sms

# fileformat
from simpledbf import Dbf5
from sas7bdat import SAS7BDAT

Labo 2

Import data


In [4]:
# excel
#df = pd.read_excel('data/labo2/SR_Data.xls')

# DBF (Dbase)
dbf = Dbf5('data/labo2/SR_Data.dbf')
df = dbf.to_dataframe()

# SPSS
# savReaderWriter error with pip install

# SAS
sas = SAS7BDAT('data/labo2/tableau1.sas7bdat')

Dataframe manipulation

  • show var (columns)
  • delete var
  • rename var
  • create var
  • head

In [ ]:
# show vars
df.columns

# delete var
df = df.drop('Shape_Leng', 1) # 1 = column axis
# df.drop('Shape_Leng', 1, inplace=True) # same as previous, inplace impacts this dataframe instead of the returned one

# rename var
df = df.rename(columns={'POPTOT_FR':'POPTOT'})

# create var
df['km'] = df['Shape_Area'] / 1000000
df['HabKm2'] = df['POPTOT'] / df['km']

# show data head
df.head()

Normality

Skewness


In [5]:
#scs.skew(df)
df.skew()


Out[5]:
POPTOT_FR     0.460748
FAIBLEREV     1.305112
MONOPCT       0.318384
MENAGE1PCT   -0.195568
IMMREC_PCT    1.889274
TX_CHOM       2.071395
NOECOLEPCT    0.131756
SCO_M9PCT     0.255036
SCO_M13PCT   -0.209523
PARTIELPCT    0.440455
FAIBREVPCT    0.357728
INDICE_PAU    0.294139
Dist_Min      3.526814
N_1000        0.956193
Dist_Moy_3    3.622690
Shape_Leng    3.602587
Shape_Area    8.012661
dtype: float64

Kurtosis


In [6]:
df.kurt()  # or df.kurtosis()


Out[6]:
POPTOT_FR     -0.260861
FAIBLEREV      2.206736
MONOPCT        0.536391
MENAGE1PCT    -0.348104
IMMREC_PCT     3.909355
TX_CHOM       10.961608
NOECOLEPCT    -0.121454
SCO_M9PCT     -0.745866
SCO_M13PCT    -0.958561
PARTIELPCT     0.566064
FAIBREVPCT    -0.123080
INDICE_PAU     0.094835
Dist_Min      19.814668
N_1000         0.816666
Dist_Moy_3    21.412225
Shape_Leng    19.450893
Shape_Area    88.786170
dtype: float64

Kolmogorov-Smirnov

Lilliefors : Kolmogorov Smirnov test with estimated mean and variance


In [7]:
# scs.kstest(df['HabKm2'], 'norm')
statsmodels.stats.diagnostic.kstest_normal(df['HabKm2'])


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-7-844ca04d97dc> in <module>()
      1 # scs.kstest(df['HabKm2'], 'norm')
----> 2 statsmodels.stats.diagnostic.kstest_normal(df['HabKm2'])

/home/inrs/lucretius/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1967             return self._getitem_multilevel(key)
   1968         else:
-> 1969             return self._getitem_column(key)
   1970 
   1971     def _getitem_column(self, key):

/home/inrs/lucretius/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1974         # get column
   1975         if self.columns.is_unique:
-> 1976             return self._get_item_cache(key)
   1977 
   1978         # duplicate columns & possible reduce dimensionality

/home/inrs/lucretius/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1089         res = cache.get(item)
   1090         if res is None:
-> 1091             values = self._data.get(item)
   1092             res = self._box_item_values(item, values)
   1093             cache[item] = res

/home/inrs/lucretius/anaconda3/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3209 
   3210             if not isnull(item):
-> 3211                 loc = self.items.get_loc(item)
   3212             else:
   3213                 indexer = np.arange(len(self.items))[isnull(self.items)]

/home/inrs/lucretius/anaconda3/lib/python3.5/site-packages/pandas/core/index.py in get_loc(self, key, method, tolerance)
   1757                                  'backfill or nearest lookups')
   1758             key = _values_from_object(key)
-> 1759             return self._engine.get_loc(key)
   1760 
   1761         indexer = self.get_indexer([key], method=method,

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()

KeyError: 'HabKm2'

In [ ]:
scs.shapiro(df['HabKm2'])

Transformations

Square root


In [ ]:
df['SqrtDens'] = np.sqrt(df['HabKm2'])
df['SqrtImg'] = np.sqrt(df['IMMREC_PCT'])

Logarithmic


In [ ]:
# log(0) = error
df['LogDens'] = np.log(df['HabKm2'])
df['LogImg'] = np.log(df['IMMREC_PCT'] + 1)

Centrage et réduction


In [ ]:
#df[['INDICE_PAU', 'Dist_Min', 'N_1000', 'Dist_Moy_3']]
zscores = df.ix[:,11:15]

In [ ]:
# scaling from machine learning
from sklearn.preprocessing import scale

zscores = pd.DataFrame(scale(zscores), index=zscores.index, columns=zscores.columns)

In [ ]:
zscores.mean()

In [ ]:
zscores.std()

Descriptive statistics


In [ ]:
df.describe()

In [ ]:
df.mean()
df.std()
df.min()
df.max()
df.median()
#df.range() : min, max
df.quantile(0.75) # param : 0.25, 0.75... default 0.5

Labo 3

Histograms

Classic histograms

Histograms with normal curve

Scatterplots

Scatterplots avec droite de régression

Scatterplots matrix

3D scatterplots

Correlation matrix


In [ ]:
df.corr()

Pearson

Spearman

Simple linear regression

Contingency table

Modalités variables nominales

Contingency table


Labo 4

T-Test

F Test


In [ ]:
scs.ttest_ind?

Satterthwaite method

Pooled method

Results analysis


In [ ]:
df.cov()

In [ ]:
#statsmodels.stats.anova.anova_lm
statsmodels.stats.anova.anova_lm?

Boxplot

ANOVA

F Test

R square

Tukey test