Analysis of fake data



In [1]:

    
# Making sure Julia is working properly
+(2, 2)









    Out[1]:





4

In this lesson

Import packages
Import dataset (in csv format) using the DataFrames package
Change the coded values
Have a look around the imported dataset
Descriptive statististics including simple plotting using the Gadfly package
Inferential statistics using the HypothesisTests package including how to decide between the use of parametric vs nonparametric tests

Importing packages



In [2]:

    
# Pkg.add("")



In [3]:

    
using DataFrames









    



┌ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
└ @ Base loading.jl:1260



In [4]:

    
using Gadfly









    



┌ Info: Precompiling Gadfly [c91e804a-d5a3-530f-b6f0-dfbca275c004]
└ @ Base loading.jl:1260
WARNING: Method definition dot(Any, Any, Any) in module LinearAlgebra at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\generic.jl:924 overwritten in module Optim at C:\Users\juank\.julia\packages\Optim\Agd3B\src\multivariate\precon.jl:23.
  ** incremental compilation may be fatally broken for this module **

WARNING: Method definition dot(Any, Any, Any) in module LinearAlgebra at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\generic.jl:924 overwritten in module Optim at C:\Users\juank\.julia\packages\Optim\Agd3B\src\multivariate\precon.jl:23.
  ** incremental compilation may be fatally broken for this module **

WARNING: Method definition dot(Any, Any, Any) in module LinearAlgebra at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\generic.jl:924 overwritten in module Optim at C:\Users\juank\.julia\packages\Optim\Agd3B\src\multivariate\precon.jl:23.
  ** incremental compilation may be fatally broken for this module **



In [5]:

    
using StatsBase



In [6]:

    
using HypothesisTests









    



┌ Info: Precompiling HypothesisTests [09f84164-cd44-5f33-b23f-e6b0d136a0d5]
└ @ Base loading.jl:1260



In [7]:

    
using Distributions



In [8]:

    
#using Plotly



In [9]:

    
#include("plotly_credentials.jl")

Importing the data file



In [8]:

    
df = readtable("CCS.csv");









    



┌ Warning: readtable is deprecated, use CSV.read from the CSV package instead
│   caller = top-level scope at In[8]:1
└ @ Core In[8]:1



In [10]:

    
first(df, 6)









    Out[10]:




6 rows × 6 columns
PatientID Cat1 Cat2 Var1 Var2 Var3
Int64⍰ String⍰ String⍰ Float64⍰ Float64⍰ Float64⍰
1 1 A C 38.2568 5.93913 35.0579
2 2 A C 17.8317 5.34754 21.131
3 8 A B 16.0218 6.60709 60.9436
4 9 A C 45.1158 6.00733 21.8797
5 16 A C 20.448 8.54819 20.6623
6 18 A B 28.3549 7.95642 33.1807



In [11]:

    
# Making sure there are no NA-values
# Looking at the data types
# showcols(df)









    



UndefVarError: showcols not defined

Stacktrace:
 [1] top-level scope at In[11]:1

Changing coded values



In [12]:

    
# Calculating the number of rows and columns in the DataFrame
nrows, ncols = size(df)









    Out[12]:





(120, 6)



In [13]:

    
# Results of a specific row and column entry
df[3, 4]









    Out[13]:





16.021847362622296



In [14]:

    
# Column 4 is also :Var1
df[3, :Var1]









    Out[14]:





16.021847362622296



In [15]:

    
# Select some rows and all columns
df[3:5, :]









    Out[15]:




3 rows × 6 columns
PatientID Cat1 Cat2 Var1 Var2 Var3
Int64⍰ String⍰ String⍰ Float64⍰ Float64⍰ Float64⍰
1 8 A B 16.0218 6.60709 60.9436
2 9 A C 45.1158 6.00733 21.8797
3 16 A C 20.448 8.54819 20.6623



In [16]:

    
# Select some rows and some columns
df[3:5, [2, 4]]









    Out[16]:




3 rows × 2 columns
Cat1 Var1
String⍰ Float64⍰
1 A 16.0218
2 A 45.1158
3 A 20.448



In [17]:

    
# Select some rows and some columns
df[3:5, [:Cat1, :Var1]]









    Out[17]:




3 rows × 2 columns
Cat1 Var1
String⍰ Float64⍰
1 A 16.0218
2 A 45.1158
3 A 20.448



In [18]:

    
# More selection
df[[2, 5, 99], 2:4]









    Out[18]:




3 rows × 3 columns
Cat1 Cat2 Var1
String⍰ String⍰ Float64⍰
1 A C 17.8317
2 A C 20.448
3 B F 22.5817



In [19]:

    
# Changing the values of Cat1
# A was minor infections
# B was major infections
for r in 1:nrows # Loop through all the rows
    temp = df[r, :Cat1] # Create a variable called temp
    if isna(temp)
        # do nothing
        elseif temp == "A"
        df[r, :Cat1] = "Minor infection"
        elseif temp == "B"
        df[r, :Cat1] = "Major infection"
    else
        # do nothing
    end
end









    



UndefVarError: isna not defined

Stacktrace:
 [1] top-level scope at .\In[19]:6



In [21]:

    
# Changing the values of Cat2
for r in 1:nrows
    temp = df[r, :Cat2]
    if isna(temp)
        # do nothing
        elseif temp == "C" || temp == "X" || temp == "R" # Using OR
        df[r, :Cat2] = "Female"
        elseif temp == "L" || temp == "B" || temp == "F"
        df[r, :Cat2] = "Male"
    else
        # do nothing
    end
end



In [22]:

    
# Correcting the age
df[:Var1] = df[:Var1] - 5









    Out[22]:





120-element DataArray{Float64,1}:
 33.2568
 12.8317
 11.0218
 40.1158
 15.448 
 23.3549
 17.4497
 43.4125
 35.0075
 15.7181
 12.0396
 37.6687
 14.954 
  ⋮     
 12.0029
 50.3879
 15.2205
 11.4172
 42.6224
 68.0229
 11.4106
 11.2801
 11.8883
 27.3537
 15.1379
 12.6144



In [23]:

    
# Renaming the columns
rename!(df, :Cat1, :Infection)
rename!(df, :Cat2, :Gender)
rename!(df, :Var1, :Age)
rename!(df, :Var2, :HbA1c)
rename!(df, :Var3, :CRP)









    Out[23]:




PatientID Infection Gender Age HbA1c CRP
1 1 Minor infection Female 33.25682170735211 5.939131803063266 35.05790787394423
2 2 Minor infection Female 12.831672926455425 5.3475437647467015 21.130960534087748
3 8 Minor infection Male 11.021847362622296 6.60708739107548 60.94357572800236
4 9 Minor infection Female 40.11578946046756 6.007331523437179 21.879716257527214
5 16 Minor infection Female 15.448024664719128 8.548191553013755 20.662273742223093
6 18 Minor infection Male 23.354866592358434 7.956423420109708 33.180721180524046
7 25 Minor infection Female 17.449698055243154 6.346176553966556 40.23647859806062
8 28 Minor infection Male 43.41249747282861 5.325830066483782 28.89558282117991
9 29 Minor infection Female 35.00749019795842 11.418946000149159 71.59107138476448
10 33 Minor infection Female 15.718078088759942 5.3776825349838875 27.421634761166143
11 37 Minor infection Male 12.039552902790813 5.341678126021321 24.350125791798412
12 38 Minor infection Male 37.668677048130725 5.822836826149952 52.361023970347645
13 41 Minor infection Male 14.954005492731604 5.139109665644092 93.1999049245544
14 42 Minor infection Male 15.616861807456626 5.373775257683365 22.956316044932706
15 45 Minor infection Male 26.96943333536114 7.03175222186184 32.4353920798117
16 48 Minor infection Female 10.896503262657347 6.816307422196048 56.91791495074396
17 50 Minor infection Male 29.126376716232713 10.132043236302893 95.0800494348705
18 54 Minor infection Female 12.266696029416444 6.232851527491789 30.592429272236995
19 56 Minor infection Male 12.65139392415621 6.482234591532695 30.411111164705403
20 57 Minor infection Female 14.411007643097165 6.7381729160524415 51.404061512592136
21 58 Minor infection Male 15.72167664093812 12.544965670716241 23.516481089946932
22 72 Minor infection Female 36.067434948202134 6.790353447911059 25.463601202747274
23 77 Minor infection Female 16.748637652644852 5.331207233871429 53.37901535144456
24 79 Minor infection Male 19.795379875133985 6.421028372159899 67.80454363409564
25 81 Minor infection Female 14.455898596105506 15.58264883701855 42.09359631060543
26 87 Minor infection Male 15.42366215424995 6.0313350908620045 20.315296150337286
27 90 Minor infection Male 27.12531899176456 7.5079670766812505 30.041428426466076
28 91 Minor infection Female 22.80314774867038 5.569594345445192 20.873838118661297
29 93 Minor infection Male 17.519476318918194 8.263168140286755 38.939609071578595
30 96 Minor infection Female 47.42066995572101 10.589195161626716 22.849503575655408
&vellip &vellip &vellip &vellip &vellip &vellip &vellip

Descriptive statistics



In [24]:

    
# Count of number per group of amputation
# Use the values for categorical data analysis
groups = by(df, :Infection, d -> DataFrame(N = size(d, 1)))









    Out[24]:




Infection N
1 Major infection 60
2 Minor infection 60



In [25]:

    
# Count of number per group of gender
gender = by(df, :Gender, d -> DataFrame(N = size(d, 1)))









    Out[25]:




Gender N
1 Female 60
2 Male 60



In [26]:

    
# Calculating the mean of a column
mean(df[:Age])









    Out[26]:





22.967863792237893



In [27]:

    
median(df[:Age])









    Out[27]:





17.68007894647561



In [28]:

    
std(df[:Age])









    Out[28]:





13.116990926253415



In [29]:

    
# Describe the values in a column
describe(df[:Age])









    



Summary Stats:
Mean:         22.967864
Minimum:      10.235601
1st Quartile: 12.967475
Median:       17.680079
3rd Quartile: 29.746373
Maximum:      79.237810



In [30]:

    
describe(df[:HbA1c])









    



Summary Stats:
Mean:         5.921205
Minimum:      3.011733
1st Quartile: 4.065523
Median:       5.642406
3rd Quartile: 6.839651
Maximum:      15.582649



In [31]:

    
describe(df[:CRP])









    



Summary Stats:
Mean:         51.950031
Minimum:      20.315296
1st Quartile: 32.235514
Median:       44.304176
3rd Quartile: 64.858850
Maximum:      147.397402



In [32]:

    
# Using the Gadfly package
plot(df, x = "Infection", y = "Age", Geom.boxplot, Guide.title("Age analysis by type of infection"),
Guide.xlabel("Type of infection"), Guide.ylabel("Age"))









    Out[32]:



In [33]:

    
plot(df, x = "Gender", y = "Age", Geom.boxplot, Guide.title("Age analysis by gender"),
Guide.xlabel("Gender"), Guide.ylabel("Age"), Theme(default_color = colorant"orange"))









    Out[33]:



In [34]:

    
plot(df, x = "Age", color = "Infection", Geom.density, Guide.title("Age distribution by type of infection"), 
Guide.xlabel("Age"), Guide.ylabel("Distribution"))









    Out[34]:



In [35]:

    
plot(df, x = "Age", color = "Gender", Geom.density, Guide.title("Age distribution by gender"), 
Guide.xlabel("Age"), Guide.ylabel("Distribution"))









    Out[35]:



In [36]:

    
plot(df, x = "Infection", y = "HbA1c", Geom.boxplot, Guide.title("HbA1c analysis by type of infection"), 
Guide.xlabel("Type of infection"), Guide.ylabel("HbA1c"))









    Out[36]:



In [37]:

    
plot(df, x = "HbA1c", color = "Infection", Geom.density, Guide.title("HbA1c distribution by type of infection"), 
Guide.xlabel("HbA1c"), Guide.ylabel("Distribution"))









    Out[37]:



In [38]:

    
plot(df, x = "Gender", y = "HbA1c", Geom.boxplot, Guide.title("HbA1c analysis by gender"), 
Guide.xlabel("Gender"), Guide.ylabel("Age"), Theme(default_color = colorant"orange"))









    Out[38]:



In [39]:

    
plot(df, x = "Infection", y = "CRP", Geom.boxplot, Guide.title("CRP analysis by type of infection"), 
Guide.xlabel("Type of infection"), Guide.ylabel("CRP"))









    Out[39]:



In [40]:

    
plot(df, x = "Gender", y = "CRP", Geom.boxplot, Guide.title("CRP analysis by gender"), 
Guide.xlabel("Gender"), Guide.ylabel("CRP"), Theme(default_color = colorant"orange"))









    Out[40]:

Inferential statistics



In [41]:

    
# Creating individual DataFrames
minor = df[df[:Infection] .== "Minor infection", :]
major = df[df[:Infection] .== "Major infection", :]
female = df[df[:Gender] .== "Female", :]
male = df[df[:Gender] .== "Male", :];



In [42]:

    
# Count levels of amputatations by gender
by(minor, :Gender,d -> DataFrame(N = size(d, 1)))









    Out[42]:




Gender N
1 Female 29
2 Male 31



In [43]:

    
by(major, :Gender,d -> DataFrame(N = size(d, 1)))









    Out[43]:




Gender N
1 Female 31
2 Male 29



In [44]:

    
# Combining toe and foot amputations to get 2x2 contingency table
# Using FishersExactTest from the HypothesisTests package
#                     Female         Male
# Minor infection     29             31
# Major infections    31             29
FisherExactTest(29, 31, 31, 29)









    Out[44]:





Fisher's exact test
-------------------
Population details:
    parameter of interest:   Odds ratio
    value under h_0:         1.0
    point estimate:          0.8761040629077481
    95% confidence interval: (0.4021316586321565,1.9032998148973006)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.8552344413446413 (not signficant)

Details:
    contingency table:
        29  31
        31  29



In [45]:

    
# Checking distributions using the Kolmogorov-Smirnov test from the HypothesisTests package
ExactOneSampleKSTest(df[:Age], Normal(mean(df[:Age]), std(df[:Age])))









    Out[45]:





Exact one sample Kolmorov-Smirnov test
--------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.1689987485297917

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.00182487802092679 (very significant)

Details:
    number of observations:   120



In [46]:

    
ExactOneSampleKSTest(df[:HbA1c], Normal(mean(df[:HbA1c]), std(df[:HbA1c])))









    Out[46]:





Exact one sample Kolmorov-Smirnov test
--------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.10542457878357658

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.12916532225692645 (not signficant)

Details:
    number of observations:   120



In [47]:

    
ExactOneSampleKSTest(df[:CRP], Normal(mean(df[:CRP]), std(df[:CRP])))









    Out[47]:





Exact one sample Kolmorov-Smirnov test
--------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.1285224834354674

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.03457928687894718 (significant)

Details:
    number of observations:   120



In [48]:

    
# Using nonparametric tests for two groups
MannWhitneyUTest(minor[:Age], major[:Age])









    Out[48]:





Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          -2.4292312272307086

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.2878502713319637 (not signficant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             1597.0
    rank sums:                            [3427.0,3833.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (-203.0,190.5255888325765)



In [49]:

    
MannWhitneyUTest(minor[:HbA1c], major[:HbA1c])









    Out[49]:





Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          2.204107265879177

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           8.911043220275755e-10 (extremely significant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             2968.0
    rank sums:                            [4798.0,2462.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (1168.0,190.5255888325765)



In [50]:

    
var(minor[:HbA1c]), var(major[:HbA1c])









    Out[50]:





(4.4725594213938695,3.051053184641962)



In [51]:

    
# Using a parametric test
EqualVarianceTTest(minor[:HbA1c], major[:HbA1c])









    Out[51]:





Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          2.2735847881097992
    95% confidence interval: (1.572351556834641,2.9748180193849576)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           2.9598988574158584e-9 (extremely significant)

Details:
    number of observations:   [60,60]
    t-statistic:              6.420569735515418
    degrees of freedom:       118
    empirical standard error: 0.3541095076864366



In [52]:

    
# Checking descriptive statistics for HbA1c in the two groups
describe(minor[:HbA1c])









    



Summary Stats:
Mean:         7.057998
Minimum:      5.007773
1st Quartile: 5.601074
Median:       6.237020
3rd Quartile: 7.977990
Maximum:      15.582649



In [53]:

    
describe(major[:HbA1c])









    



Summary Stats:
Mean:         4.784413
Minimum:      3.011733
1st Quartile: 3.453761
Median:       4.032912
3rd Quartile: 5.712855
Maximum:      11.917937



In [54]:

    
MannWhitneyUTest(minor[:CRP], major[:CRP])









    Out[54]:





Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          -15.866078281489749

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           9.525817333080766e-5 (extremely significant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             1056.0
    rank sums:                            [2886.0,4374.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (-744.0,190.5255888325765)



In [55]:

    
# Using MannWhitneyU test
MannWhitneyUTest(female[:Age], male[:Age])









    Out[55]:





Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          -1.036465493753596

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.9393353532121992 (not signficant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             1785.0
    rank sums:                            [3615.0,3645.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (-15.0,190.5255888325765)



In [56]:

    
MannWhitneyUTest(female[:HbA1c], male[:HbA1c])









    Out[56]:





Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          0.11678243974029012

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.9018354821522864 (not signficant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             1824.0
    rank sums:                            [3654.0,3606.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (24.0,190.5255888325765)



In [57]:

    
MannWhitneyUTest(female[:CRP], male[:CRP])









    Out[57]:





Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          0.7878830756252739

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.7389164706392968 (not signficant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             1736.0
    rank sums:                            [3566.0,3694.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (-64.0,190.5255888325765)



In [ ]:

	PatientID	Cat1	Cat2	Var1	Var2	Var3
	Int64⍰	String⍰	String⍰	Float64⍰	Float64⍰	Float64⍰
1	1	A	C	38.2568	5.93913	35.0579
2	2	A	C	17.8317	5.34754	21.131
3	8	A	B	16.0218	6.60709	60.9436
4	9	A	C	45.1158	6.00733	21.8797
5	16	A	C	20.448	8.54819	20.6623
6	18	A	B	28.3549	7.95642	33.1807

	PatientID	Infection	Gender	Age	HbA1c	CRP
1	1	Minor infection	Female	33.25682170735211	5.939131803063266	35.05790787394423
2	2	Minor infection	Female	12.831672926455425	5.3475437647467015	21.130960534087748
3	8	Minor infection	Male	11.021847362622296	6.60708739107548	60.94357572800236
4	9	Minor infection	Female	40.11578946046756	6.007331523437179	21.879716257527214
5	16	Minor infection	Female	15.448024664719128	8.548191553013755	20.662273742223093
6	18	Minor infection	Male	23.354866592358434	7.956423420109708	33.180721180524046
7	25	Minor infection	Female	17.449698055243154	6.346176553966556	40.23647859806062
8	28	Minor infection	Male	43.41249747282861	5.325830066483782	28.89558282117991
9	29	Minor infection	Female	35.00749019795842	11.418946000149159	71.59107138476448
10	33	Minor infection	Female	15.718078088759942	5.3776825349838875	27.421634761166143
11	37	Minor infection	Male	12.039552902790813	5.341678126021321	24.350125791798412
12	38	Minor infection	Male	37.668677048130725	5.822836826149952	52.361023970347645
13	41	Minor infection	Male	14.954005492731604	5.139109665644092	93.1999049245544
14	42	Minor infection	Male	15.616861807456626	5.373775257683365	22.956316044932706
15	45	Minor infection	Male	26.96943333536114	7.03175222186184	32.4353920798117
16	48	Minor infection	Female	10.896503262657347	6.816307422196048	56.91791495074396
17	50	Minor infection	Male	29.126376716232713	10.132043236302893	95.0800494348705
18	54	Minor infection	Female	12.266696029416444	6.232851527491789	30.592429272236995
19	56	Minor infection	Male	12.65139392415621	6.482234591532695	30.411111164705403
20	57	Minor infection	Female	14.411007643097165	6.7381729160524415	51.404061512592136
21	58	Minor infection	Male	15.72167664093812	12.544965670716241	23.516481089946932
22	72	Minor infection	Female	36.067434948202134	6.790353447911059	25.463601202747274
23	77	Minor infection	Female	16.748637652644852	5.331207233871429	53.37901535144456
24	79	Minor infection	Male	19.795379875133985	6.421028372159899	67.80454363409564
25	81	Minor infection	Female	14.455898596105506	15.58264883701855	42.09359631060543
26	87	Minor infection	Male	15.42366215424995	6.0313350908620045	20.315296150337286
27	90	Minor infection	Male	27.12531899176456	7.5079670766812505	30.041428426466076
28	91	Minor infection	Female	22.80314774867038	5.569594345445192	20.873838118661297
29	93	Minor infection	Male	17.519476318918194	8.263168140286755	38.939609071578595
30	96	Minor infection	Female	47.42066995572101	10.589195161626716	22.849503575655408
&vellip	&vellip	&vellip	&vellip	&vellip	&vellip	&vellip