I recommend installing Anaconda for ease: https://www.continuum.io/downloads certain packages i.e. seaborn, do not come with the standard installation. A simple search for how to 'install seaborn with anaconda' will easily find the installation instructions. i.e. in this case: http://seaborn.pydata.org/installing.html in terminal (mac): conda install seaborn
Alexandru Agachi
m.a.agachi@gmail.com
Background: pursued degrees and graduate and postgraduate diplomas in international relations, energy science,
surgical robotics, neuro anatomy and imagery, and biomedical innovation
Working with a data driven firm in London that operates on financial markets, and teaching a big data module in the
biomedical innovation degree at Pierre et Marie Curie University in Paris.
Why statistics
Why focus on data
Better to be in an expanding world and not quite in exactly the right field, than to be in a contracting world where people's worst behavior comes out.
Eric Weinstein, Fellow University of Oxford, MD Thiel Capital.
Why healthcare Ripe for data analysis 30% of world's data
The biomedical sciences have been the pillar of the health care system for a long time now. The new system will have two equal pillars — the biomedical sciences and the data sciences.
Dr Scott Zeger, Johns Hopkins University.
In [3]:
#import the packages we will use
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#use simplest tool available
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import statsmodels.api as sm
import scipy.stats
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
#from sklearn import preprocessing
#from sklearn.linear_model import LassoLarsCV
#from sklearn.cluster import KMeans
%matplotlib inline
sns.set(style="whitegrid", color_codes=True)
#Whenever you are not sure of a function or parameter used, please verify documentation for that function at:
#http://pandas.pydata.org/pandas-docs/stable/
All credit for the data, and our many thanks, go to the principal investigators who collected this data and made it available:
Online repositories for data used and for this notebook:
http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/ https://github.com/AlexandruAgachi/introductory_statistics_tutorial/blob/master/Introductory%20statistical%20data%20analysis%20with%20pandas%20PyData%20Berlin%202017.ipynb
here specify your own directory for where you stored the datasets on your computer:
For me: cd Downloads/Heart disease dataset
note: when downloading the datasets, some people had to rename them manually with a 'txt' ending the four data files that interest us should be named:
processed.cleveland.data.txt processed.hungarian.data.txt processed.switzerland.data.txt processed.va.data.txt
Data management is an integral part of your research process
Choices you make here will influence your entire study
In [20]:
#let's load the first dataset in pandas with the read_csv function. This will create pandas dataframe objects per below.
cleveland = pd.read_csv('processed.cleveland.data.txt', header=None)
In [93]:
cleveland.head(10)
Out[93]:
In [25]:
hungary = pd.read_csv('processed.hungarian.data.txt', header=None)
In [26]:
hungary.head(10)
Out[26]:
In [27]:
switzerland = pd.read_csv('processed.switzerland.data.txt', header=None)
In [28]:
switzerland.head(10)
Out[28]:
In [30]:
va = pd.read_csv('processed.va.data.txt', header=None)
In [31]:
va.head(10)
Out[31]:
In [37]:
df = pd.concat([cleveland, hungary, va, switzerland])
In [38]:
df
Out[38]:
In [57]:
#we rename all columns to make them more legible
df.columns = ['age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs', 'rest_ecg', 'max_heart_rate', 'exercise_angina', 'st_depression', 'slope', 'fluoroscopy',
'defect', 'diagnosis']
In [603]:
#let's look at our dataframe. the head(10) option tells pandas to return only the first ten rows of our dataframe.
df.head(10)
Out[603]:
A. Observational or experimental
Observational and experimental data
to know if your data is obersevational or experimental, you ask if the explanatory variable was manipulated or not
In a true experiment, 3 conditions:
Quasi experiment:
in an observational study the regression line only describes data you see. it cannot be used to predict result of intervention
*randomization works best as your sample size approaches infinity. for small sizes, imbalances in the groups can occur.
B. Longitudinal or cross sectional
Then data analysis? Wrong!
Background research. Always starts with background research. Domain knowledge.
We quickly find three studies using this dataset:
Detrano et al.: http://www.ajconline.org/article/0002-9149(89)90524-9/pdf Sundaram and Kakade: http://www2.rmcil.edu/dataanalytics/v2015/papers/Clinical_Decision_Support_For_Heart_Disease.pdf Soni et al.: http://www.enggjournals.com/ijcse/doc/IJCSE11-03-06-120.pdf
Value the time of domain experts. Dataset we'll focus on today. look at txt file AND at explanation
First stage in a study is always exploratory data analysis We typically receive the data without context, in a raw file that we cannot easily interpret The five steps of exploratory data analysis
This can be summarized in a notebook or even an initial data report for everyone involved in the data project
In [378]:
print(len(df))
In [379]:
print(len(df.columns))
number of observations to variables largely exceeds heuristic rule of 5 to 1
see Victoria Stodden (2006) for a discussion of why this is important
https://web.stanford.edu/~vcs/thesis.pdf
convert all variables to numeric ones
In [60]:
#mark all variables as numeric data, and signify, for the relevant ones, that they are categorical rather than
#quantitative variables
#errors='coerce' tells pandas to return invalid values as NaN rather than as the input values themselves
#crucial to do this step otherwise subsequent analyses will not work properly. i.e. pandas would interpret missing
#values as strings
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['sex'] = pd.to_numeric(df['sex'], errors='coerce').astype('category')
df['chest_pain'] = pd.to_numeric(df['chest_pain'], errors='coerce').astype('category')
df['rest_bp'] = pd.to_numeric(df['rest_bp'], errors='coerce')
df['cholesterol'] = pd.to_numeric(df['cholesterol'], errors='coerce')
df['fasting_bs'] = pd.to_numeric(df['fasting_bs'], errors='coerce').astype('category')
df['rest_ecg'] = pd.to_numeric(df['rest_ecg'], errors='coerce').astype('category')
df['max_heart_rate'] = pd.to_numeric(df['max_heart_rate'], errors='coerce')
df['exercise_angina'] = pd.to_numeric(df['exercise_angina'], errors='coerce').astype('category')
df['st_depression'] = pd.to_numeric(df['st_depression'], errors='coerce')
df['slope'] = pd.to_numeric(df['slope'], errors='coerce').astype('category')
df['fluoroscopy'] = pd.to_numeric(df['fluoroscopy'], errors='coerce').astype('category')
df['defect'] = pd.to_numeric(df['defect'], errors='coerce').astype('category')
df['diagnosis'] = pd.to_numeric(df['diagnosis'], errors='coerce').astype('category')
Why bother? Nearly all statistical methods assume complete information. In the presence of missing data:
Three types of missing variables:
If MAR, missing data is ignorable and there is no need to model missing data mechanism if NMAR, missing data mechanism is not ignorable and one must develop v good understanding of missing data process to model it
Standard options:
Advanced options:
For a great discussion of missing data: http://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf
Statsmodels and Scikit learn imputation functions: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html http://www.statsmodels.org/dev/imputation.html
At this point in time, pandas, scikit learn and statsmodels offer relatively poor methods of data imputation. You therefore will need to manually create your processes. But this will be worth your time!
"The only really good solution to the missing data problem is not to have any. So in the design and execution of research projects, it is essential to put great effort into minimizing the occurrence of missing data. Statistical adjustments can never make up for sloppy research." Paul Allison, 2001.
*Bias affects all measurements the same way, while chance errors vary from measurement to measurement therefore bias cannot be noticed by looking just at measurements, we need an external/theoretical benchmark as well. Bias is often discussed in conjunction with variance.
Now we go through each variable in our dataset one by one. We check for missing values if any, how many, and what % of total values they represent for each variable. For this we use: isnull() function checks for missing values value_counts() counts them we also look at what values our variables take, by applying the value_counts() function directly, without isnull()
In [5]:
df['age'].isnull().value_counts()
In [65]:
df['sex'].isnull().value_counts()
Out[65]:
In [66]:
df['sex'].value_counts()
Out[66]:
In [381]:
#the normalize = True parameter returns for us the % of total rather than the absolute number.
df['sex'].value_counts(normalize=True)
Out[381]:
In [67]:
df['chest_pain'].isnull().value_counts()
Out[67]:
In [68]:
df['chest_pain'].value_counts()
Out[68]:
In [69]:
df['rest_bp'].isnull().value_counts()
Out[69]:
In [608]:
#we divide by len(df), to manually calculate the % of total observations that this represents each time.
df['rest_bp'].isnull().value_counts()/len(df)
Out[608]:
In [71]:
df['cholesterol'].isnull().value_counts()
Out[71]:
In [386]:
df['cholesterol'].isnull().value_counts()/len(df)
Out[386]:
In [72]:
df['fasting_bs'].isnull().value_counts()
Out[72]:
In [387]:
df['fasting_bs'].isnull().value_counts()/len(df)
Out[387]:
In [541]:
#standard value_counts() function drops missing values. To avoid this you can add dropna=False argument to function.
df['fasting_bs'].value_counts()
Out[541]:
In [388]:
df['rest_ecg'].isnull().value_counts()/len(df)
Out[388]:
In [75]:
df['rest_ecg'].value_counts()
Out[75]:
In [76]:
df['max_heart_rate'].isnull().value_counts()
Out[76]:
In [389]:
df['max_heart_rate'].isnull().value_counts()/len(df)
Out[389]:
In [78]:
df['exercise_angina'].isnull().value_counts()
Out[78]:
In [390]:
df['exercise_angina'].isnull().value_counts()/len(df)
Out[390]:
In [79]:
df['exercise_angina'].value_counts()
Out[79]:
In [80]:
df['st_depression'].isnull().value_counts()
Out[80]:
In [391]:
df['st_depression'].isnull().value_counts()/len(df)
Out[391]:
In [82]:
df['slope'].isnull().value_counts()
Out[82]:
In [392]:
df['slope'].isnull().value_counts()/len(df)
Out[392]:
In [83]:
df['slope'].value_counts()
Out[83]:
In [84]:
df['fluoroscopy'].isnull().value_counts()
Out[84]:
In [393]:
df['fluoroscopy'].isnull().value_counts()/len(df)
Out[393]:
In [85]:
df['fluoroscopy'].value_counts()
Out[85]:
In [86]:
df['defect'].isnull().value_counts()
Out[86]:
In [394]:
df['defect'].isnull().value_counts()/len(df)
Out[394]:
In [87]:
df['defect'].value_counts()
Out[87]:
In [88]:
df['diagnosis'].isnull().value_counts()
Out[88]:
In [89]:
df['diagnosis'].value_counts()
Out[89]:
In [395]:
len(df.columns)
Out[395]:
In [90]:
#the variables slope, defect and fluoroscopy have 33-47% of missing values
#context. Why?
#can we correct for this without introducing other biases into our dataset?
#Here we will decide to eliminate these variables.
#Data analysts are mere mortals too. Cannot fix everything.
df_red = df[['age', 'sex', 'chest_pain', 'rest_bp', 'cholesterol', 'fasting_bs', 'rest_ecg', 'max_heart_rate',
'exercise_angina', 'st_depression', 'diagnosis']]
In [126]:
len(df_red)
Out[126]:
In [ ]:
#rest_bp, cholesterol, fasting_bs, rest_ecg, max_heart_rate, exercise_angina, st_depression
In [173]:
#rest_ecg is a categorical variable with only 2% of missing values.
#we make the choice to impute missing values with a straightforward method: the mode
#we are conscious this may introduce biases but due to low number of missing values, in what is
#a small range categorical variable (no extreme outliers possible) we feel comfortable doing this
df_red['rest_ecg'].fillna(df_red['rest_ecg'].mode().iloc[0], inplace=True)
In [174]:
df_red['rest_ecg'].isnull().value_counts()
Out[174]:
In [278]:
#let's see if there is any overlap between missing variables relative to observations, and what would happen if we
#were to limit our dataset to observations with non missing values
df_clean = df_red[df_red['rest_bp'].notnull() & df_red['cholesterol'].notnull() & df_red['fasting_bs'].notnull() & df['max_heart_rate'].notnull() & df['exercise_angina'].notnull() &
df['st_depression'].notnull()]
In [396]:
len(df_clean)
Out[396]:
In [189]:
df_clean.isnull().any()
Out[189]:
this is not approach we would use in a real world study, but for exploratory purposes in this tutorial we can retain 85% of observations that were properly recorded regarding all variables I will add here an example with conditional mean imputation for one of the variables Statsmodels very limited in options = very manual work. From version 0.8.0 onwards MICE function.
now that we cleaned the data we can move on to univariate analysis - one variable at a time
data reporting tells you what is happening, but data analysis tells you why it is happening
Descriptive statistics
a parameter is calculated from the population, while a statistic from a sample
In studying samples, we assume the Central Limit Theorem holds: if you draw enough samples, from a population, and each sample is large enough, the distribution of the statistics of the samples will be normally distributed. normal curve discovered around 1720 by Abraham de Moivre. Around 1870, Adolph Quetelet thought of using it as the curve of an ideal histogram
center-spread-shape 3 measures of center: mean, median and mode 1 measure of spread: standard deviation 2 attributes of shape:
now we will aim to get an idea of the shape, center, and spread of these variables we will analyze the shape visually by checking for modality and skewness we will check for measures of center such as mean, median and mode we will check the spread through the standard deviation for quantitative variables
standard deviation: how far away observations are from their average in a normal distribution roughly 68% are within 1 SD and 95% within 2 SDs.
Quantitative variables shape, center and spread histogram
Categorical variables mode bar chart or frequency distribution
for visualizing one variable: if it is categorical we use a bar chart i.e. sns's countplot function if it is quantitative, we can combine a kernel density estimate and a histogram with sns's distplot function
for visualizing two variables: C-Q: bivariate bar graph with sns factorplot (bin/collapse explanatory variable), categories on x axis, and mean of response variable on y axis Q-Q: scatterplot with sns regplot C-C: you can plot them one a time. problem with a bivariate graph is that mean has no meaning in context of a categorical variable
for further reading: http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/datapresentation/DataPresentation7.html
In [522]:
#Gender
#given that gender is a categorical variable, we use a countplot to visualize it
#we always plot the variable of interest on the x axis, and the count or frequency on the y axis
sns.countplot(x='sex', data = df_clean)
plt.suptitle('Frequency of observations by gender')
Out[522]:
the results definitely suggest a discussion about class imbalance is in order here.
In [194]:
#Diagnosis
#we check the counts per each value of the variable. sort=False tells pandas not to sort the results by values.
#normalize = True tells it to return the relative frequencies rather than the absolute counts
#if we had not cleaned the data, we could add parameter dropna=False so that value_counts does not drop null values
df_clean['diagnosis'].value_counts(sort=False, normalize=True)
Out[194]:
In [410]:
sns.countplot(x='diagnosis', data=df_clean)
plt.suptitle('Frequency distribution of diagnosis state')
Out[410]:
In [411]:
#Let's look at age now
#the describe request gives us the count, mean, std, min, max, as well as the quartiles for the
#respective value distribution
df_clean['age'].describe()
Out[411]:
In [412]:
#sns distplot function combines matplotlib hist() function with sns kdeplot() function.
sns.distplot(df_clean['age'])
plt.suptitle('Distribution of age')
Out[412]:
In [417]:
df_clean['max_heart_rate'].describe()
Out[417]:
In [420]:
sns.distplot(df_clean['max_heart_rate'])
plt.suptitle('Distribution of maximal heart rate')
Out[420]:
In [433]:
sns.swarmplot('diagnosis', data=df_clean)
Out[433]:
Their order does not matter here, hence Q-C and C-Q use the same visualization rule
C-Q: bivariate bar graph with sns factorplot, categories on x axis, and mean of response variable on y axis Q-Q: scatterplot with sns regplot C-C: you can plot them one a time. problem with a bivariate graph is that mean has no meaning in context of a categorical variable
In [611]:
#kind = bar asks for a bar graph and ci=None suppresses error bars
sns.factorplot(x='sex', y='age', kind='bar', data=df_clean, ci=None)
plt.suptitle('Gender vs age')
Out[611]:
In [612]:
#categorical explanatory variable 'sex' and quantitative response variable 'rest_bp'
sns.factorplot(x='sex', y='rest_bp', data=df_clean, kind='bar')
Out[612]:
In [423]:
df_clean['chest_pain'].dtype
Out[423]:
In [613]:
#We use a regplot to plot two quantitative variables, age and cholesterol, while also having a regression line suggesting
#any association present
#we always plot the explanatory variable on the x axis, and the response variable on the y axis
sns.regplot(x='age', y='cholesterol', data=df_clean)
Out[613]:
In [614]:
#We see that several observations have a cholesterol of "0". This will need to be investigated subsequently and treated
#as missing values
In [444]:
#how can we gain a better idea of how two categorical variables interact?
df_clean.groupby('sex')['diagnosis'].value_counts()/len(df)
Out[444]:
In [615]:
#common statistical boxplot visualization
sns.boxplot(x='cholesterol', y = 'sex', data=df_clean)
plt.suptitle('Cholesterol levels by gender')
Out[615]:
In [617]:
#describing the dataset/the variables of the dataset, after we group the values in the dataset by diagnostic category
#the describe function gives us the key statistics for each quantitative variable in the dataset
df_clean.groupby('diagnosis').describe()
Out[617]:
In [279]:
#before starting to manipulate the dataset itself, we make a copy, and will work on the copy rather than the original
df_clean_copy = df_clean.copy()
We call them inferential statistics because we try to infer the population's paramters from our sample
The process is always the same and involves hypothesis testing:
Typical H0: there is no relationship between the explanatory and response variable Typical H1: there is a statistically significant relationship
Bivariate statistical tools: Three main tools: ANOVA; chi-square; correlation coefficient
Type 1 vs Type 2 errors Type 1 error: the incorrect rejection of a true null hypothesis Type 2 error: retaining a false null hypothesis
we will test the null hypothesis that age and diagnosis are not related. the type of variables we have (explanatory/response and categorical/quantitative for each) determines the type of statistical tools we will use:
Explanatory categorical and response quantitative: ANOVA Explanatory categorical and response categorical: Chi Square test Explanatory quantitative and response categorical: classify/bin explanatory variable and use chi square test Explanatory quantitative and response quantitative: pearson correlation
a result is statistically significant if it is unlikely to be the result of chance
before performing these analyses, one needs to use the .dropna() function to include only valid data
In [207]:
#Going the wrong way?
#In regression analysis you can change your predictor and response variables. This is because they may be correlated,
#in which case X is correlated to Y and by definition Y is correlated with X. There is no directionality implied, which
#is also why you cannot talk of causation, but only of correlation.
test1 = smf.ols(formula = 'age ~ C(diagnosis)', data = df_clean_copy).fit()
print(test1.summary())
#*Whenever we have an explanatory categorical variable with more than two levels, we need to explicitly state this for
#statsmodels. Here we state it by adding 'C'and the variable name in parentheses. By definition, statsmodels then
#converts the first level of the variable into the reference group, therefore showing only n-1 levels in the results.
#each level is in brief a comparison to the reference group. Here the reference group becomes a diagnosis of 0.
What interests us in an ANOVA table are: R squared The F statistic The p value (here called 'Prob (F-statistic) The coefficients of each explanatory variable and the associated p value for it (here 'coef' and 'P>|t| columns) The 95% confidence interval for our coefficients. Here called '[95.0% Conf. Int.]
The 'R Squared' statistic. This is a measure of how much of the variability in the response variable, age here, is explained by our model. We see this here it is 0.134 only. So diagnostic group only helps us explain 13.4% of the variability in age. This can indicate that either we omit important explanatory variables that we can add to the model, or that we miss the structure of the association.
The F statistic: An F test is any statistical test in which the test statistic has an F distribution under the null hypothesis The F statistic = variation among sample means/variation within sample groups ANOVA F Test = are the differences among the sample means due to true differences among the population means, or merely due to sampling variability?
p value in an ANOVA table = probability of getting an F value as large or larger if H0 is true probability of finding this value if there is no difference between sample means. I would like to thank M.N. for sending me the below blog, which contains a couple of great posts on common statistical concepts:
http://blog.minitab.com/blog/adventures-in-statistics-2/how-to-correctly-interpret-p-values
In an ANOVA table, we can decide if the relationship is statistically significant by checking the value of the F statistic against the relevant statistical table, F=28.42 above, or by looking at the p value. The latter is easiest. A long tradition in statistics is to consider a p value (also called alpha) below 0.05 as indicative of a significant relationship. When conducting such tests, you need to decide what p value/alpha value you feel comfortable with, and from then onwards you create a binary framework: either a statistic has an associated p value below the value you decided on at the beginning, or not. In this sense, a p value of 0.0001 is not indicative of a 5 times stronger relationship than a p value of 0.0005. In our case here a p value of 5.56e-22 means the relationship is statistically significant (5.56e-22 < 0.05).
The coefficients for each variable. Here for example our model would be: age = 50.3025 + 2.6438diagnosis1 + 8.1279 diagnosis2 + 8.8770 diagnosis3 + 8.9248 diagnosis4 We can also say that being having a diagnosis of 1 increases someone's age by 2.6438 assuming we hold all other explanatory variables in our model constant at 0. a negative but significant coefficient indicates that our explanatory variable and the respons variable are negatively associated: as one increases, the other decreases. the higher the coefficient, the more impact it will have on the value of our response variable based on our model. However, these coefficients result from our sample, meaning that the population parameters may differ from these. The confidence intervals give us a range in which these parameters can be for the population, with 95% confidence. For example we can be 95% confident that the population parameter for diagnosis1 is between 1.134 and 4.153.
Whenever the explanatory variable has more than 2 levels, we need to also perform post hoc statistical tests to better understand the relationship between the explanatory variable and the response variable we know the groups tested are different overall, but not exactly where/how they are different for explanatory variables with multiples levels, F test and p value do not tell us why the group means are not equal, or how. there are many ways in which this can be the case.
How are the response and explanatory variables associated per level of the explanatory variable?
post hoc tests aim to protect against type 1 error when explanatory variable is multilevel
In [209]:
#now we examine the means and standard deviations
grouped1_mean = df_clean_copy.groupby('diagnosis').mean()['age']
print(grouped1_mean)
In [210]:
grouped1_std = df_clean_copy.groupby('diagnosis').std()['age']
print(grouped1_std)
In [211]:
#given that we have an explanatory categorical variable with multiple levels, we use the
#tuckey hsd test
#other tests:
#Holm T
#Least Significant Difference
tuckey1 = multi.MultiComparison(df_clean_copy['age'], df_clean_copy['diagnosis'])
res1 = tuckey1.tukeyhsd()
print(res1.summary())
We will now test another hypothesis:
Hypothesis(0)(a): the presence of chest pain and the diagnosis (0 or 1) are independent
Alternative Hypothesis 1: presence of chest pain and diagnosis are not independent
Paradox: you can get better results with great feature engineering and a poor model than with poor feature engineering but a great model
A feature is an attribute that is useful to your problem
Dr. Jason Brownlee "The algorithms we used are very standard for Kagglers...We spent most of our efforts in feature engineering.
Xavier Conort, #1 Kaggler in 2013
Aims to convert data attributes into data features
Aims to optimize data modelling
Requires understanding of the dataset and research problem, and understanding of model you plan on using can be domain driven, or data driven i.e. for SVM with linear kernel you need to manually construct nonlinear interactions between features and feed them as input to your SVM model. An SVM with polynomial kernel will naturally capture them. other example: SVMs are very sensitive to dimensions of features, while DT/RFs are not With tabular data you combine, aggregate, split and/or decompose features in order to create new ones Given an output y and a feature x, you can try the following transforms first: e^x, log(x), x^2, x^3 an indicator of the usefulness of the transformation is if the correlation between y and x' is higher than the correlation between y and x best way to validate this is to check your model error with or without the transformed feature http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/ http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/
In [6]:
diagnosis_dic = {0:0, 1:1, 2:1, 3:1, 4:1}
df_clean_copy['diagnosis_binary'] = df_clean_copy['diagnosis'].map(diagnosis_dic)
In [474]:
df_clean_copy['diagnosis_binary'].value_counts()
Out[474]:
In [222]:
#contingency table of observed counts
#the crosstab function allows us to cross one variable with another
#when creating contingency tables, we put the response variable first (therefore vertical in table),
#and the explanatory variable second, therefore horizontal at the top of the table.
ct1 = pd.crosstab(df_clean_copy['diagnosis_binary'], df_clean_copy['chest_pain'])
print(ct1)
In [223]:
#column percentages
colsum = ct1.sum(axis=0)
colpct = ct1/colsum
print(colpct)
In [224]:
#chi square test
#Expected counts: p assuming events are independent. p(1) * p(2) | column total*row total/table total
#Chi square statistic summarizes this. difference between our obersavtion and what we would expect if H0 is true
#We rely on the p value, as different distributions define whether the chi square itself is large or not
print('chi-square value, p value, expected counts')
cs1 = scipy.stats.chi2_contingency(ct1)
print(cs1)
In [526]:
#Explanatory variable with multiple levels!
#we would have to do a pairwise comparison between every two groups of the explanatory
#variable, vs the response variable
#This would be a Bonferroni adjustment - we adjust p value we use by number of pairwise comparisons, and test these.
In [225]:
df_clean_copy.columns
Out[225]:
In [226]:
ct2 = pd.crosstab(df_clean_copy['diagnosis_binary'], df_clean_copy['sex'])
print(ct2)
In [227]:
#column percentages
colsum2 = ct2.sum(axis=0)
colpct2 = ct2/colsum2
print(colpct2)
In [228]:
#chi square test
print('chi-square value, p value, expected counts')
cs2 = scipy.stats.chi2_contingency(ct2)
print(cs2)
In [510]:
#Moderators
#a moderator is a third variable that affects the direction and/or strength between your explanatory and response variables
#the question is, is our response variable associated with our explanatory variable, for each level of our third variable?
In [240]:
#let's see if sex is a moderator in the statistically significant relationship between chest pain and diagnosis
df_clean_copy_men = df_clean_copy[df_clean_copy['sex'] == 0]
len(df_clean_copy_men)
Out[240]:
In [241]:
df_clean_copy_women = df_clean_copy[df_clean_copy['sex'] == 1]
len(df_clean_copy_women)
Out[241]:
In [242]:
#contingency table of observed counts
#when creating contingency tables, we put the response variable first (therefore vertical in table),
#and the explanatory variable second, therefore horizontal at the top of the table.
ct3 = pd.crosstab(df_clean_copy_men['diagnosis_binary'], df_clean_copy_men['chest_pain'])
print(ct3)
In [243]:
#column percentages
colsum = ct3.sum(axis=0)
colpct = ct3/colsum
print(colpct)
In [244]:
#chi square test
print('chi-square value, p value, expected counts')
cs3 = scipy.stats.chi2_contingency(ct3)
print(cs3)
In [246]:
#contingency table of observed counts
#when creating contingency tables, we put the response variable first (therefore vertical in table),
#and the explanatory variable second, therefore horizontal at the top of the table.
ct4 = pd.crosstab(df_clean_copy_women['diagnosis_binary'], df_clean_copy_women['chest_pain'])
print(ct4)
In [247]:
#column percentages
colsum = ct4.sum(axis=0)
colpct = ct4/colsum
print(colpct)
In [248]:
#chi square test
print('chi-square value, p value, expected counts')
cs4 = scipy.stats.chi2_contingency(ct4)
print(cs4)
In [535]:
df_clean_copy_women.groupby('chest_pain')['diagnosis'].value_counts()
Out[535]:
the relationship between chest pain and the diagnosis holds for both levels of the sex variables, hence it is not a
moderator.
we would test for moderator variables in the case of a quantitative response variable the same way:
identifying a confounding variable does not allow to establish causation, just to get closer to a causal connection. due to infinite number of possible lurking variables, observational studies cannot rly establish causation a lurking of confounding variable is a third variable that is associated with both the explanatory and response variables. i.e. x=firefighters; y=damage caused by a fire. plot would suggest more firefighters causes more fire damage. in reality there is a third confounding variable that influences both, seriousness of the fire. In a study we want to demonstrate that our statistical relationship is valid even after controlling for confounders.
now we will test whether there is a relationship between two quantitative variables, age and cholesterol for this we use the pearson correlation test r, going from -1 to 1 only tells us whether the two variables are linearly related. they may be related in nonlinear ways therefore it's always important to look at r in parallel with a scatterplot of the two variables r squared is a measure of how much variability in one variable can be explained by the other variable to calculate the pearson coefficient we need to remove all missing values Please remember that when two variables are correlated it is possible that: X causes Y or Y causes X Z causes both X and Y X and Y are correlated by chance - a spurious correlation
In [231]:
df_clean_copy.columns
Out[231]:
In [450]:
scat1 = sns.regplot(x='age', y = 'cholesterol', fit_reg=True, data = df_clean_copy)
plt.xlabel('Age')
plt.ylabel('Cholesterol')
plt.suptitle('Relationship between age and cholesterol')
scat1
Out[450]:
In [630]:
#the r coefficient is a measure of association, of how closely points are clustered around a line
#correlations are always between -1 and 1
#r measures solely linear association
#association, not causation
#it is a number without units
#it can mislead in presence of outliers or non linear association -> always draw a scatter plot as well and check visually
#when you look at a scatter plot you look at direction form and strength of relationship
#if you identify a nonlinear association, with one or more curves, then you know you ought to add nonlinear explanatory
#variables to your regression model, to aim to capture this nonlinear association as well.
#ecological correlations based on averages can be misleading and overstate strength of associations for individual
#units
print('association between age and cholesterol')
print(scipy.stats.pearsonr(df_clean_copy['age'], df_clean_copy['cholesterol']))
multivariate linear regression for quantitative response variable logistic regression for binary categorical response variable
Assumptions of these types of models:
Normality: residuals from our linear regression model are normally distributed. if they are not, our model may be misspecified
Linearity: association between explanatory and response variable is linear
Homoscedasticity (or assumption of constant variance): variability in the response variable is the same at all levels of the explanatory variable. i.e. if residuals are spread around the regression line in a similar manner as you move along the x axis (values of the explanatory variable)
Independence: observations are not correlated with each other. Longitudinal data can violate this assumption, as well as hierarchical nesting/clustering data i.e. looking at students by classes. this assumption is the most serious to be violated, and also cannot be fixed by modifying the variables. the data structure itself is the problem.
We have to contend with:
Multicollinearity: explanatory variables are highly correlated with each other. this can mess up your parameter estimates or make them highly unstable. Signs:
Outliers: can affect your regression line
multiple regression model allows us to find the relationship between one explanatory variable and the reponse variable, while controlling (holding constant at 0) all the other variables. for interpretability of our model, each variable needs to include a meaningful value of 0, so as to make it easier to interpret the coefficients (what does it mean to hold cholesterol constant at 0 if its range has no value of 0?) for a categorical variable, we can just recode one of the values to be 0 for a quantitative variable, we have to center it. Centering = subtracting the mean of a variable from the value of the variable. We are therefore recoding it so that its mean=0. if a quantitative explanatory variable includes a meaningful value of 0 already, we may not need to center it. in linear regression we only center explanatory variables not response one in logistic regression we always need to code response binary variable so that 0 means no outcome and 1 outcome occurred this is true whether outcome is positive or negative.
We will create a multiple regression model, investigating the relationship between our explanatory variables and our response variable diagnosis we will first center the explanatory variables. for categorical variables, one of the categories needs to be 0, for quantitative variables, we need to subtract the mean from each value.
Notes: do not center the response variable. is using logistic regression, do recode the binary response variable to make sure one class is coded as 0
In [262]:
df_clean_copy.columns
Out[262]:
In [270]:
#categorical variables: sex, chest_pain, fasting_bs, rest_ecg, exercise_angina
#quantitative variables: age, rest_bp, cholesterol, max_heart_rate, st_depression
In [280]:
df_clean_copy['chest_pain'].value_counts()
Out[280]:
In [281]:
recode_chest_pain = {1:0, 2:1, 3:2, 4:3}
df_clean_copy['chest_pain_p'] = df_clean_copy['chest_pain'].map(recode_chest_pain)
In [282]:
df['fasting_bs'].value_counts()
Out[282]:
In [283]:
df['rest_ecg'].value_counts()
Out[283]:
In [284]:
df['exercise_angina'].value_counts()
Out[284]:
In [285]:
df_clean_copy['age_c'] = df_clean_copy['age'] - df_clean_copy['age'].mean()
In [286]:
df_clean_copy['rest_bp_c'] = df_clean_copy['rest_bp'] - df_clean_copy['rest_bp'].mean()
In [287]:
df_clean_copy['cholesterol_c'] = df_clean_copy['cholesterol'] - df_clean_copy['cholesterol'].mean()
In [288]:
df_clean_copy['max_heart_rate_c'] = df_clean_copy['max_heart_rate'] - df_clean_copy['max_heart_rate'].mean()
In [289]:
df_clean_copy['st_depression_c'] = df_clean_copy['st_depression'] - df_clean_copy['st_depression'].mean()
In [292]:
df_clean_copy.columns
Out[292]:
In [295]:
df_clean_copy_c = df_clean_copy[['age_c', 'sex', 'chest_pain_p', 'rest_bp_c', 'cholesterol_c',
'fasting_bs', 'rest_ecg', 'max_heart_rate_c', 'exercise_angina',
'st_depression_c', 'diagnosis_binary']]
In [296]:
df_clean_copy_c.columns
Out[296]:
In [459]:
model1 = smf.ols(formula = 'age_c ~ sex', data = df_clean_copy_c).fit()
print(model1.summary())
We now add explanatory variables to our model one at a time. In doing this, we keep an eye on what impact adding an explanatory variable will have on our model. Sometimes for example we add a variable, and another variable that was statistically significant (had a p value below 0.05), suddenly becomes insignificant. This means the new variable is a confounder, a moderator, variable in the relationship between that other variable and the response variable.
In [461]:
model2 = smf.ols(formula = 'age_c ~ sex + cholesterol_c', data=df_clean_copy_c).fit()
print(model2.summary())
In [635]:
#here we will skip a couple of steps and create the overall model with all variables.
In [471]:
model3 = smf.ols(formula = 'age_c ~ sex + C(chest_pain_p) + (rest_bp_c) + cholesterol_c + fasting_bs + C(rest_ecg) + \
max_heart_rate_c + exercise_angina + st_depression_c + diagnosis_binary', data = df_clean_copy_c).fit()
print(model3.summary())
We identified some explanatory variables that are associated with age, but our model overall barely explains 27% of the variation in the response variable let's run some diagnostics
In [466]:
#Q-Q plot for normality
fig1 = sm.qqplot(model3.resid, line='r')
#red line represents residuals we would expect if model residuals were normally distributed
#our residuals below deviate somewhat from red line, especially at lower and higher quantiles, meaning they do not
#follow a normal distribution. This means the curvilinear association in our model is not fully explained by our model.
#We could add more explanatory variables in this case to try to better explain any curvilinear association.
In [632]:
#simple plot of residuals
stdres = pd.DataFrame(model3.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
#resid_pearson normalizes our model's residuals
#ls='none' means points will not be connected
#we expect most residuals to fall within 2sd of the mean. More than 2 indicate there are outliers, and more than 3
#extreme outliers.
#if more than 1% of our observations have standardized residuals with an absolute value greater than 2.5, or more than 5%
#have one greater than or equal to 2, there is evidence that the fit of the model is poor. top cause of this is
#ommission of important explanatory variables in our model.
#standardized residuals in linear regression will always be linear, and the line will be horizontal
#normalizing or standardizing residuals amounts to making them have a mean of 0 and sd of 1 so as to fit a normal
#standard distribution
#if residuals show a strong pattern (up, down, polynomial) then it is a good indication of nonlinearity
#in the underlying relationship
Out[632]:
In [469]:
#leverage plot
fig4 = sm.graphics.influence_plot(model3, size=8)
print(fig4)
#leverage, always between 0 and 1. At 0, a standardized residual has no influence on our model.
#leverage is a measure of how much influence a specific residual (and therefore observation) has on our model.
#we see that we have extreme outliers, but they are low leverage, meaning they do not have an undue influence on our
#estimation of the regression model.
#we have an observation that is both high leverage and and outlier, observation 33. We would need to further investigate
#the corresponding observation.
In [470]:
#now let's focus on our actual response variable in the study
#since it is a binary variable, we will need to use a logistic regression model
In [298]:
lreg1 = smf.logit(formula = 'diagnosis_binary ~ C(chest_pain_p)', data = df_clean_copy_c).fit()
print(lreg1.summary())
In [454]:
#again, normally we would add variables one at a time, but here we will go faster.
lreg2 = smf.logit(formula = 'diagnosis_binary ~ age_c + sex + C(chest_pain_p) + cholesterol_c', data = df_clean_copy_c).fit()
print(lreg2.summary())
In [472]:
lreg3 = smf.logit(formula = 'diagnosis_binary ~ age_c + sex + C(chest_pain_p) + rest_bp_c + \
cholesterol_c + fasting_bs + C(rest_ecg) + max_heart_rate_c + \
exercise_angina + st_depression_c', data = df_clean_copy_c).fit()
print(lreg3.summary())
In [302]:
#however, for logistic regression it makes much more sense to calculate the odds ratio
#this is because in binary logistic regression, we only deal with probabilities, of an outcome (response variable)
#being 0 or 1.
#odds are calculated as the exponentiation of our coefficients as calculated in a normal ANOVA table.
#Odds ratio (OR) for an explanatory variable:
#if OR=1, there is no association meaningful association between explanatory and response variables
#if OR<1, the response variable becomes less likely as the explanatory one increases
#if OR>1, the response variable becomes more likely as the explanatory one increases
print('Odds Ratios')
print(np.exp(lreg3.params))
#Interpretation of OR
#Here we would say that based on our sample, women were 3.7 times more likely than men to have a diagnosis of 1.
In [303]:
# odd ratios with 95% confidence intervals
params = lreg3.params
conf = lreg3.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (np.exp(conf))
#we have 95% confidence that the sex odds ratio will be between 2.24 and 6.12 for the population.
In [305]:
#statsmodels offers significantly less residuals analysis options for logistic regression compared to linear one
#yet there is no reason we cannot get information by studying the residuals
#simple plot of residuals
stdres = pd.DataFrame(lreg3.resid_pearson)
plt.plot(stdres, 'o', ls='None')
l = plt.axhline(y=0, color='r')
plt.ylabel('Standardized Residual')
plt.xlabel('Observation Number')
#resid_pearson normalizes our model's residuals
#ls='none' means points will not be connected
#we expect most residuals to fall within 2sd of the mean. More than 2 are outliers, and more than 3 extreme outliers.
#if more than 1% of our observations have standardized residuals with an absolute value greater than 2.5, or more than 5%
#have one greater than or equal to 2, there is evidence that the fit of the model is poor. largest cause of this is ommission
#of important explanatory variables in our model.
#standardized residuals in linear regression will always be linear, and the line will be horizontal
#if residuals show a strong pattern (up, down, polynomial) then it is a good indication of nonlinearity
#in the underlying relationship
Out[305]:
In [673]:
#Statsmodels and scikit learn still offer few and poor 'off the shelf' options for imputing missing values in more
#statistically sound ways.
#As of statsmodels 0.8.0 at least, statsmodels offers the MICE imputation function
#a description is available here: http://www.statsmodels.org/dev/imputation.html
#and the details behind the implementation are here:
#http://www.statsmodels.org/dev/_modules/statsmodels/imputation/mice.html
#to be able to use MICE you will need to update the statsmodels coming with Anaconda, to 0.8.0
#I am yet to find nice tutorials/examples of the mice function
#Let's create here a dataset with some missing values for one of our variables
df_clean_mice = df_red[df_red['rest_bp'].notnull() & df_red['cholesterol'].notnull() & df_red['fasting_bs'].notnull() & \
df['max_heart_rate'].notnull() & df['exercise_angina'].notnull()]
In [674]:
df_clean_mice.isnull().any()
Out[674]:
In [497]:
df_clean_mice.describe()
Out[497]:
In [694]:
#we create a dataframe with only quantitative variables
df_clean_mice = df_clean_mice[['age', 'rest_bp', 'cholesterol', 'max_heart_rate', 'st_depression']]
In [695]:
import statsmodels.imputation.mice as mice
import statsmodels
from statsmodels.base.model import LikelihoodModelResults
from statsmodels.regression.linear_model import OLS
from collections import defaultdict
#we wrap our dataframe in a MICEData object
imp = mice.MICEData(df_clean_mice)
#we specify our analysis model
formula_mice = 'diagnosis_binary ~ age + rest_bp + cholesterol + max_heart_rate + st_depression_c'
#We now take the MICEData object, and the formula specification and run both the imputation and the analysis
mice = mice.MICE(formula_mice, smf.logit, imp)
#various plots and summary statistics are available and/or being developed with the MICE package.
For me machine learning is simply the branch of statistics that has traditionally focused on working with large, heterogenous datasets, characterized by numerous variables that interact in nonlinear ways. It is a collection of tools, just like statistics in general.
An amazing article on traditional parametric statistics vs machine learning is: "Statistical Modeling: The Two Cultures," Leo Breiman, 2001.
Machine learning can be used for:
Accuracy = test error rate. The rate at which an algorithm correctly classifies or estimates. Goal is to minimize test error rate. (in linear regression, which we saw before, accuracy was the mean squared error) (in logistic regression: accuracy = how well the model classifies observations)
Supervised vs unsupervised learning In supervised learning we work with labeled data In unsupervised learning we aim to find patterns in unlabeled data
In machine learning we will regularly face the bias-variance trade off: Variance = change in parameter estimates across different data sets Bias = how far off model estimated values are from true values ideally we want low variance and low bias, but they are negatively associated. As one decreases, the other increases. Generally, as complexity of model increases, this leads to higher variance and lower bias Simple models will have lower variance, but also be more biased.
We will briefly apply three machine learning algorithms here, which we could use in our exploratory data analysis:
Decision trees Random forests Support vector machines
I intend to expand this section in the future. It is difficult today to not take advantage of some of the machine learning tools available, including when it comes to data exploration We often forget that some key applications of machine learning tools are to help us gain insights faster than with more manual methods, and in feature engineering and/or selection
In the data exploration phase, you would use very raw, off the shelf, machine learning algorithms, simply for exploratory/descriptive purposes
Decision trees
Note: decision trees cannot handle missing data!
In [553]:
df_clean_copy.columns
Out[553]:
In [566]:
#although decision trees and random forests are particularly suitable for dealing with categorical variables, and in
#many statistical packages we can input the categorical variables directly, because of how they are implemented in
#scikit learn, in this package we cannot input categorical variables directly. We first have to encode them in a
#process called one hot encoding. This creates a separate column (variable) for each value of our explanatory
#categorical variable. For this we use pandas' get_dummies function
df_sex = pd.get_dummies(df_clean_copy['sex'], prefix = 'sex')
In [569]:
df_chest_pain = pd.get_dummies(df_clean_copy['chest_pain'], prefix = 'chest_pain')
df_fasting_bs = pd.get_dummies(df_clean_copy['fasting_bs'], prefix = 'fasting_bs')
df_rest_ecg = pd.get_dummies(df_clean_copy['rest_ecg'], prefix = 'rest_ecg')
df_exercise_angina = pd.get_dummies(df_clean_copy['exercise_angina'], prefix='exercise_angina')
In [570]:
df_merged = pd.concat([df_clean_copy, df_sex, df_chest_pain, df_fasting_bs, df_rest_ecg, df_exercise_angina], axis=1)
In [571]:
df_merged.columns
Out[571]:
In [574]:
df_dt = df_merged[['age', 'sex_0.0', 'sex_1.0', 'chest_pain_1.0', 'chest_pain_2.0', 'chest_pain_3.0', \
'chest_pain_4.0', 'rest_bp', 'cholesterol', 'fasting_bs_0.0', 'fasting_bs_1.0', \
'rest_ecg_1.0', 'rest_ecg_2.0', 'max_heart_rate', 'exercise_angina_0.0', 'exercise_angina_1.0',\
'st_depression', 'diagnosis_binary']]
In [575]:
#here we select as predictors all variables in our dataset expect for the response one
predictors = df_dt.ix[:, df_dt.columns != 'diagnosis_binary']
In [577]:
predictors.head(5)
Out[577]:
In [578]:
#we select as the target our response variable
target = df_dt['diagnosis_binary']
In [579]:
#we create our training and testing datasets. Each will have its predictors and its target (response variable).
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size = 0.4)
In [580]:
print(pred_train.shape, pred_test.shape, tar_train.shape, tar_test.shape)
In [581]:
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train, tar_train)
In [582]:
predictions = classifier.predict(pred_test)
In [583]:
sklearn.metrics.confusion_matrix(tar_test, predictions)
Out[583]:
In [664]:
#let's look at which of our variables it considers most important:
print(classifier.feature_importances_)
In [645]:
sklearn.metrics.accuracy_score(tar_test, predictions)
Out[645]:
Random forests
Random forests, an ensemble learning method, are a more sophisticated method of using trees Pioneered by Tin Kam Ho, Leo Breiman (who also created 'bagging') and Adele Cutler For a full discussion, please see https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
In [653]:
classifier2 = RandomForestClassifier(n_estimators = 25)
classifier2 = classifier2.fit(pred_train, tar_train)
In [654]:
predictions2 = classifier2.predict(pred_test)
In [655]:
sklearn.metrics.confusion_matrix(tar_test, predictions2)
Out[655]:
In [672]:
#let's look at which of our variables the random forest considers most important:
#(these follow the order in which we input them to the random forest i.e. our dataset)
print(classifier2.feature_importances_)
In [656]:
sklearn.metrics.accuracy_score(tar_test, predictions2)
Out[656]:
random forests typically outperform decision trees, in particular when our decision tree model exhibits instability the lower accuracy here is perhaps a result of our implementation of the random forest - we would need to tune its parameters in order to optimize it for the problem at hand
Support vector machines
SVMs are a set of supervised learning methods, used for classification and regression They handle nonlinearity, through kernalization, much better than a standard logistic classifier
A couple of very good explanations: http://www.kdnuggets.com/2016/07/support-vector-machines-simple-explanation.html http://www.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf http://scikit-learn.org/stable/modules/svm.html
In [657]:
from sklearn import svm
classifier3 = svm.SVC()
classifier3 = classifier3.fit(pred_train, tar_train)
predictions3 = classifier3.predict(pred_test)
In [658]:
sklearn.metrics.confusion_matrix(tar_test, predictions3)
Out[658]:
In [659]:
sklearn.metrics.accuracy_score(tar_test, predictions3)
Out[659]:
we obtain a relatively low accuracy - once more, as for random forests, the more complex a machine learning tool is, the more work we will need to do in order to adapt it to our problem and hopefully reach a high accuracy for our model
MOOCS
This is perhaps the best introductory data analysis course I found online, and will cover many of the topics covered here in more depth, covering both SAS and Pandas https://www.coursera.org/learn/data-visualization
A similarly good course but in R: https://www.coursera.org/specializations/jhu-data-science
Another course I really like is Bill Chambers' Udemy one: https://www.udemy.com/data-analysis-in-python-with-pandas/
For machine learning algorithms and their implementation: https://www.coursera.org/specializations/machine-learning
Statistics books - classics:
Machine learning readings
https://www.amazon.com/Machine-Learning-Python-Techniques-Predictive/dp/1118961749/ref=sr_1_1?ie=UTF8&qid=1499250696&sr=8-1&keywords=machine+learning+in+python+bowles https://www.amazon.com/Machine-Learning-Hands-Developers-Professionals/dp/1118889061/ref=sr_1_1?ie=UTF8&qid=1499250725&sr=8-1&keywords=Machine+learning+hands+on+for+developers+and+technical+professionals+Bell https://www.amazon.com/Fundamentals-Machine-Learning-Predictive-Analytics/dp/0262029448/ref=sr_1_1?ie=UTF8&qid=1499251042&sr=8-1&keywords=fundamentals+of+machine+learning
Following R.S.'s question on how to get more practice/go further after this tutorial: I genuinely believe that there is no teacher like experience. What I find most useful, is to go to a large online repository of datasets i.e.
https://archive.ics.uci.edu/ml/datasets.html
or
https://bigml.com/gallery/datasets
Choose datasets that 1. focus on a topic that you find interesting, and 2. are relatively clean i.e. it’s very frustrating to spend a lot of time just cleaning data - that is an exercise in itself, and something you can learn by itself at your convenience (unless you are already a pro at it of course).
And then take one dataset from the sources above each week, and analyze it: Import it, visualize it, manage your data, analyze it with regression, try to improve your models with feature engineering, and then apply any machine learning methods to it as well. And every week make sure you do something new/add something i.e. one different type of graph, one different type of residuals test for your regression, (feature engineering by definition will be different for each dataset), one new machine learning model.
Document each one of them, build a notebook for each, and upload them to your github/portfolio!
I genuinely believe this is the best, and the most ‘realistic’ way of learning and becoming an applied data analyst as well.
Once you are comfortable with such tasks, you can of course try to compete in Kaggle for example, whether to replicate past performances or in current competitions.
Thank you and please don't hesitate to contact me if I can assist you further.