Analysis of fake data


In [1]:
# Making sure Julia is working properly
+(2, 2)


Out[1]:
4

In this lesson

  • Import packages
  • Import dataset (in csv format) using the DataFrames package
  • Change the coded values
  • Have a look around the imported dataset
  • Descriptive statististics including simple plotting using the Gadfly package
  • Inferential statistics using the HypothesisTests package including how to decide between the use of parametric vs nonparametric tests

Importing packages


In [2]:
# Pkg.add("")

In [3]:
using DataFrames


┌ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
└ @ Base loading.jl:1260

In [4]:
using Gadfly


┌ Info: Precompiling Gadfly [c91e804a-d5a3-530f-b6f0-dfbca275c004]
└ @ Base loading.jl:1260
WARNING: Method definition dot(Any, Any, Any) in module LinearAlgebra at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\generic.jl:924 overwritten in module Optim at C:\Users\juank\.julia\packages\Optim\Agd3B\src\multivariate\precon.jl:23.
  ** incremental compilation may be fatally broken for this module **

WARNING: Method definition dot(Any, Any, Any) in module LinearAlgebra at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\generic.jl:924 overwritten in module Optim at C:\Users\juank\.julia\packages\Optim\Agd3B\src\multivariate\precon.jl:23.
  ** incremental compilation may be fatally broken for this module **

WARNING: Method definition dot(Any, Any, Any) in module LinearAlgebra at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\generic.jl:924 overwritten in module Optim at C:\Users\juank\.julia\packages\Optim\Agd3B\src\multivariate\precon.jl:23.
  ** incremental compilation may be fatally broken for this module **


In [5]:
using StatsBase

In [6]:
using HypothesisTests


┌ Info: Precompiling HypothesisTests [09f84164-cd44-5f33-b23f-e6b0d136a0d5]
└ @ Base loading.jl:1260

In [7]:
using Distributions

In [8]:
#using Plotly

In [9]:
#include("plotly_credentials.jl")

Importing the data file


In [8]:
df = readtable("CCS.csv");


┌ Warning: readtable is deprecated, use CSV.read from the CSV package instead
│   caller = top-level scope at In[8]:1
└ @ Core In[8]:1

In [10]:
first(df, 6)


Out[10]:

6 rows × 6 columns

PatientIDCat1Cat2Var1Var2Var3
Int64⍰String⍰String⍰Float64⍰Float64⍰Float64⍰
11AC38.25685.9391335.0579
22AC17.83175.3475421.131
38AB16.02186.6070960.9436
49AC45.11586.0073321.8797
516AC20.4488.5481920.6623
618AB28.35497.9564233.1807

In [11]:
# Making sure there are no NA-values
# Looking at the data types
# showcols(df)


UndefVarError: showcols not defined

Stacktrace:
 [1] top-level scope at In[11]:1

Changing coded values


In [12]:
# Calculating the number of rows and columns in the DataFrame
nrows, ncols = size(df)


Out[12]:
(120, 6)

In [13]:
# Results of a specific row and column entry
df[3, 4]


Out[13]:
16.021847362622296

In [14]:
# Column 4 is also :Var1
df[3, :Var1]


Out[14]:
16.021847362622296

In [15]:
# Select some rows and all columns
df[3:5, :]


Out[15]:

3 rows × 6 columns

PatientIDCat1Cat2Var1Var2Var3
Int64⍰String⍰String⍰Float64⍰Float64⍰Float64⍰
18AB16.02186.6070960.9436
29AC45.11586.0073321.8797
316AC20.4488.5481920.6623

In [16]:
# Select some rows and some columns
df[3:5, [2, 4]]


Out[16]:

3 rows × 2 columns

Cat1Var1
String⍰Float64⍰
1A16.0218
2A45.1158
3A20.448

In [17]:
# Select some rows and some columns
df[3:5, [:Cat1, :Var1]]


Out[17]:

3 rows × 2 columns

Cat1Var1
String⍰Float64⍰
1A16.0218
2A45.1158
3A20.448

In [18]:
# More selection
df[[2, 5, 99], 2:4]


Out[18]:

3 rows × 3 columns

Cat1Cat2Var1
String⍰String⍰Float64⍰
1AC17.8317
2AC20.448
3BF22.5817

In [19]:
# Changing the values of Cat1
# A was minor infections
# B was major infections
for r in 1:nrows # Loop through all the rows
    temp = df[r, :Cat1] # Create a variable called temp
    if isna(temp)
        # do nothing
        elseif temp == "A"
        df[r, :Cat1] = "Minor infection"
        elseif temp == "B"
        df[r, :Cat1] = "Major infection"
    else
        # do nothing
    end
end


UndefVarError: isna not defined

Stacktrace:
 [1] top-level scope at .\In[19]:6

In [21]:
# Changing the values of Cat2
for r in 1:nrows
    temp = df[r, :Cat2]
    if isna(temp)
        # do nothing
        elseif temp == "C" || temp == "X" || temp == "R" # Using OR
        df[r, :Cat2] = "Female"
        elseif temp == "L" || temp == "B" || temp == "F"
        df[r, :Cat2] = "Male"
    else
        # do nothing
    end
end

In [22]:
# Correcting the age
df[:Var1] = df[:Var1] - 5


Out[22]:
120-element DataArray{Float64,1}:
 33.2568
 12.8317
 11.0218
 40.1158
 15.448 
 23.3549
 17.4497
 43.4125
 35.0075
 15.7181
 12.0396
 37.6687
 14.954 
  ⋮     
 12.0029
 50.3879
 15.2205
 11.4172
 42.6224
 68.0229
 11.4106
 11.2801
 11.8883
 27.3537
 15.1379
 12.6144

In [23]:
# Renaming the columns
rename!(df, :Cat1, :Infection)
rename!(df, :Cat2, :Gender)
rename!(df, :Var1, :Age)
rename!(df, :Var2, :HbA1c)
rename!(df, :Var3, :CRP)


Out[23]:
PatientIDInfectionGenderAgeHbA1cCRP
11Minor infectionFemale33.256821707352115.93913180306326635.05790787394423
22Minor infectionFemale12.8316729264554255.347543764746701521.130960534087748
38Minor infectionMale11.0218473626222966.6070873910754860.94357572800236
49Minor infectionFemale40.115789460467566.00733152343717921.879716257527214
516Minor infectionFemale15.4480246647191288.54819155301375520.662273742223093
618Minor infectionMale23.3548665923584347.95642342010970833.180721180524046
725Minor infectionFemale17.4496980552431546.34617655396655640.23647859806062
828Minor infectionMale43.412497472828615.32583006648378228.89558282117991
929Minor infectionFemale35.0074901979584211.41894600014915971.59107138476448
1033Minor infectionFemale15.7180780887599425.377682534983887527.421634761166143
1137Minor infectionMale12.0395529027908135.34167812602132124.350125791798412
1238Minor infectionMale37.6686770481307255.82283682614995252.361023970347645
1341Minor infectionMale14.9540054927316045.13910966564409293.1999049245544
1442Minor infectionMale15.6168618074566265.37377525768336522.956316044932706
1545Minor infectionMale26.969433335361147.0317522218618432.4353920798117
1648Minor infectionFemale10.8965032626573476.81630742219604856.91791495074396
1750Minor infectionMale29.12637671623271310.13204323630289395.0800494348705
1854Minor infectionFemale12.2666960294164446.23285152749178930.592429272236995
1956Minor infectionMale12.651393924156216.48223459153269530.411111164705403
2057Minor infectionFemale14.4110076430971656.738172916052441551.404061512592136
2158Minor infectionMale15.7216766409381212.54496567071624123.516481089946932
2272Minor infectionFemale36.0674349482021346.79035344791105925.463601202747274
2377Minor infectionFemale16.7486376526448525.33120723387142953.37901535144456
2479Minor infectionMale19.7953798751339856.42102837215989967.80454363409564
2581Minor infectionFemale14.45589859610550615.5826488370185542.09359631060543
2687Minor infectionMale15.423662154249956.031335090862004520.315296150337286
2790Minor infectionMale27.125318991764567.507967076681250530.041428426466076
2891Minor infectionFemale22.803147748670385.56959434544519220.873838118661297
2993Minor infectionMale17.5194763189181948.26316814028675538.939609071578595
3096Minor infectionFemale47.4206699557210110.58919516162671622.849503575655408
&vellip&vellip&vellip&vellip&vellip&vellip&vellip

Descriptive statistics


In [24]:
# Count of number per group of amputation
# Use the values for categorical data analysis
groups = by(df, :Infection, d -> DataFrame(N = size(d, 1)))


Out[24]:
InfectionN
1Major infection60
2Minor infection60

In [25]:
# Count of number per group of gender
gender = by(df, :Gender, d -> DataFrame(N = size(d, 1)))


Out[25]:
GenderN
1Female60
2Male60

In [26]:
# Calculating the mean of a column
mean(df[:Age])


Out[26]:
22.967863792237893

In [27]:
median(df[:Age])


Out[27]:
17.68007894647561

In [28]:
std(df[:Age])


Out[28]:
13.116990926253415

In [29]:
# Describe the values in a column
describe(df[:Age])


Summary Stats:
Mean:         22.967864
Minimum:      10.235601
1st Quartile: 12.967475
Median:       17.680079
3rd Quartile: 29.746373
Maximum:      79.237810

In [30]:
describe(df[:HbA1c])


Summary Stats:
Mean:         5.921205
Minimum:      3.011733
1st Quartile: 4.065523
Median:       5.642406
3rd Quartile: 6.839651
Maximum:      15.582649

In [31]:
describe(df[:CRP])


Summary Stats:
Mean:         51.950031
Minimum:      20.315296
1st Quartile: 32.235514
Median:       44.304176
3rd Quartile: 64.858850
Maximum:      147.397402

In [32]:
# Using the Gadfly package
plot(df, x = "Infection", y = "Age", Geom.boxplot, Guide.title("Age analysis by type of infection"),
Guide.xlabel("Type of infection"), Guide.ylabel("Age"))


Out[32]:
Type of infection Minor infection Major infection -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 Age Age analysis by type of infection

In [33]:
plot(df, x = "Gender", y = "Age", Geom.boxplot, Guide.title("Age analysis by gender"),
Guide.xlabel("Gender"), Guide.ylabel("Age"), Theme(default_color = colorant"orange"))


Out[33]:
Gender Female Male -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 Age Age analysis by gender

In [34]:
plot(df, x = "Age", color = "Infection", Geom.density, Guide.title("Age distribution by type of infection"), 
Guide.xlabel("Age"), Guide.ylabel("Distribution"))


Out[34]:
Age -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 -400 -200 0 200 400 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 Minor infection Major infection Infection -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 -0.060 -0.058 -0.056 -0.054 -0.052 -0.050 -0.048 -0.046 -0.044 -0.042 -0.040 -0.038 -0.036 -0.034 -0.032 -0.030 -0.028 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040 0.042 0.044 0.046 0.048 0.050 0.052 0.054 0.056 0.058 0.060 0.062 0.064 0.066 0.068 0.070 0.072 0.074 0.076 0.078 0.080 0.082 0.084 0.086 0.088 0.090 0.092 0.094 0.096 0.098 0.100 0.102 0.104 0.106 0.108 0.110 0.112 0.114 0.116 0.118 0.120 -0.1 0.0 0.1 0.2 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 Distribution Age distribution by type of infection

In [35]:
plot(df, x = "Age", color = "Gender", Geom.density, Guide.title("Age distribution by gender"), 
Guide.xlabel("Age"), Guide.ylabel("Distribution"))


Out[35]:
Age -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 -400 -200 0 200 400 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 Female Male Gender -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 -0.050 -0.048 -0.046 -0.044 -0.042 -0.040 -0.038 -0.036 -0.034 -0.032 -0.030 -0.028 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040 0.042 0.044 0.046 0.048 0.050 0.052 0.054 0.056 0.058 0.060 0.062 0.064 0.066 0.068 0.070 0.072 0.074 0.076 0.078 0.080 0.082 0.084 0.086 0.088 0.090 0.092 0.094 0.096 0.098 0.100 0.102 -0.10 -0.05 0.00 0.05 0.10 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 Distribution Age distribution by gender

In [36]:
plot(df, x = "Infection", y = "HbA1c", Geom.boxplot, Guide.title("HbA1c analysis by type of infection"), 
Guide.xlabel("Type of infection"), Guide.ylabel("HbA1c"))


Out[36]:
Type of infection Minor infection Major infection -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 -20 0 20 40 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 HbA1c HbA1c analysis by type of infection

In [37]:
plot(df, x = "HbA1c", color = "Infection", Geom.density, Guide.title("HbA1c distribution by type of infection"), 
Guide.xlabel("HbA1c"), Guide.ylabel("Distribution"))


Out[37]:
HbA1c -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 -20 0 20 40 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Minor infection Major infection Infection -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -0.40 -0.38 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 -0.5 0.0 0.5 1.0 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Distribution HbA1c distribution by type of infection

In [38]:
plot(df, x = "Gender", y = "HbA1c", Geom.boxplot, Guide.title("HbA1c analysis by gender"), 
Guide.xlabel("Gender"), Guide.ylabel("Age"), Theme(default_color = colorant"orange"))


Out[38]:
Gender Female Male -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 -20 0 20 40 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Age HbA1c analysis by gender

In [39]:
plot(df, x = "Infection", y = "CRP", Geom.boxplot, Guide.title("CRP analysis by type of infection"), 
Guide.xlabel("Type of infection"), Guide.ylabel("CRP"))


Out[39]:
Type of infection Minor infection Major infection -200 -150 -100 -50 0 50 100 150 200 250 300 350 -150 -145 -140 -135 -130 -125 -120 -115 -110 -105 -100 -95 -90 -85 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 -200 0 200 400 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 CRP CRP analysis by type of infection

In [40]:
plot(df, x = "Gender", y = "CRP", Geom.boxplot, Guide.title("CRP analysis by gender"), 
Guide.xlabel("Gender"), Guide.ylabel("CRP"), Theme(default_color = colorant"orange"))


Out[40]:
Gender Female Male -200 -150 -100 -50 0 50 100 150 200 250 300 350 -150 -145 -140 -135 -130 -125 -120 -115 -110 -105 -100 -95 -90 -85 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 -200 0 200 400 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 CRP CRP analysis by gender

Inferential statistics


In [41]:
# Creating individual DataFrames
minor = df[df[:Infection] .== "Minor infection", :]
major = df[df[:Infection] .== "Major infection", :]
female = df[df[:Gender] .== "Female", :]
male = df[df[:Gender] .== "Male", :];

In [42]:
# Count levels of amputatations by gender
by(minor, :Gender,d -> DataFrame(N = size(d, 1)))


Out[42]:
GenderN
1Female29
2Male31

In [43]:
by(major, :Gender,d -> DataFrame(N = size(d, 1)))


Out[43]:
GenderN
1Female31
2Male29

In [44]:
# Combining toe and foot amputations to get 2x2 contingency table
# Using FishersExactTest from the HypothesisTests package
#                     Female         Male
# Minor infection     29             31
# Major infections    31             29
FisherExactTest(29, 31, 31, 29)


Out[44]:
Fisher's exact test
-------------------
Population details:
    parameter of interest:   Odds ratio
    value under h_0:         1.0
    point estimate:          0.8761040629077481
    95% confidence interval: (0.4021316586321565,1.9032998148973006)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.8552344413446413 (not signficant)

Details:
    contingency table:
        29  31
        31  29

In [45]:
# Checking distributions using the Kolmogorov-Smirnov test from the HypothesisTests package
ExactOneSampleKSTest(df[:Age], Normal(mean(df[:Age]), std(df[:Age])))


Out[45]:
Exact one sample Kolmorov-Smirnov test
--------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.1689987485297917

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.00182487802092679 (very significant)

Details:
    number of observations:   120

In [46]:
ExactOneSampleKSTest(df[:HbA1c], Normal(mean(df[:HbA1c]), std(df[:HbA1c])))


Out[46]:
Exact one sample Kolmorov-Smirnov test
--------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.10542457878357658

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.12916532225692645 (not signficant)

Details:
    number of observations:   120

In [47]:
ExactOneSampleKSTest(df[:CRP], Normal(mean(df[:CRP]), std(df[:CRP])))


Out[47]:
Exact one sample Kolmorov-Smirnov test
--------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.1285224834354674

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.03457928687894718 (significant)

Details:
    number of observations:   120

In [48]:
# Using nonparametric tests for two groups
MannWhitneyUTest(minor[:Age], major[:Age])


Out[48]:
Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          -2.4292312272307086

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.2878502713319637 (not signficant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             1597.0
    rank sums:                            [3427.0,3833.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (-203.0,190.5255888325765)

In [49]:
MannWhitneyUTest(minor[:HbA1c], major[:HbA1c])


Out[49]:
Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          2.204107265879177

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           8.911043220275755e-10 (extremely significant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             2968.0
    rank sums:                            [4798.0,2462.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (1168.0,190.5255888325765)

In [50]:
var(minor[:HbA1c]), var(major[:HbA1c])


Out[50]:
(4.4725594213938695,3.051053184641962)

In [51]:
# Using a parametric test
EqualVarianceTTest(minor[:HbA1c], major[:HbA1c])


Out[51]:
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          2.2735847881097992
    95% confidence interval: (1.572351556834641,2.9748180193849576)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           2.9598988574158584e-9 (extremely significant)

Details:
    number of observations:   [60,60]
    t-statistic:              6.420569735515418
    degrees of freedom:       118
    empirical standard error: 0.3541095076864366

In [52]:
# Checking descriptive statistics for HbA1c in the two groups
describe(minor[:HbA1c])


Summary Stats:
Mean:         7.057998
Minimum:      5.007773
1st Quartile: 5.601074
Median:       6.237020
3rd Quartile: 7.977990
Maximum:      15.582649

In [53]:
describe(major[:HbA1c])


Summary Stats:
Mean:         4.784413
Minimum:      3.011733
1st Quartile: 3.453761
Median:       4.032912
3rd Quartile: 5.712855
Maximum:      11.917937

In [54]:
MannWhitneyUTest(minor[:CRP], major[:CRP])


Out[54]:
Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          -15.866078281489749

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           9.525817333080766e-5 (extremely significant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             1056.0
    rank sums:                            [2886.0,4374.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (-744.0,190.5255888325765)

In [55]:
# Using MannWhitneyU test
MannWhitneyUTest(female[:Age], male[:Age])


Out[55]:
Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          -1.036465493753596

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.9393353532121992 (not signficant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             1785.0
    rank sums:                            [3615.0,3645.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (-15.0,190.5255888325765)

In [56]:
MannWhitneyUTest(female[:HbA1c], male[:HbA1c])


Out[56]:
Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          0.11678243974029012

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.9018354821522864 (not signficant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             1824.0
    rank sums:                            [3654.0,3606.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (24.0,190.5255888325765)

In [57]:
MannWhitneyUTest(female[:CRP], male[:CRP])


Out[57]:
Approximate Mann-Whitney U test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          0.7878830756252739

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.7389164706392968 (not signficant)

Details:
    number of observations in each group: [60,60]
    Mann-Whitney-U statistic:             1736.0
    rank sums:                            [3566.0,3694.0]
    adjustment for ties:                  0.0
    normal approximation (μ, σ):          (-64.0,190.5255888325765)

In [ ]: