Epilepsy Comorbidities (using Brown MySQL server)

This script run a PubMed-Comorbidities pipeline using the following characteristics:

Main MeSH Heading: Epilepsy
UMLS filtering concept: Disease or Syndrome
Articles analysed: All MEDLINE 2017AA articles tagged with the as a MeSH Heading. Note that this is equivalent to searching PubMed using [MH:noexp] Total number of articles found: 66720
UMLS concept filtering: Comorbidities are analysed on all other MeSH descriptors associated with the specified UMLS concept
This script uses Brown MySQL databases:
- medline
- umls_meta
- pubmed_miner



In [1]:

    
addprocs(2);



In [2]:

    
using Revise #used during development to detect changes in module
using PubMedMiner
 
#Settings
const mh = "Epilepsy"
const concepts = ("Disease or Syndrome", "Mental or Behavioral Dysfunction", "Neoplastic Process");

The folllowing code is designed to save to the pubmed_miner database a table containing the list of pmids and mesh descriptors that match the specified filtering criteria.



In [3]:

    
overwrite = false
@time save_semantic_occurrences(mh, concepts...; overwrite = overwrite)









    



INFO: 66720 Articles related to MH:Epilepsy
INFO: ----------------------------------------
INFO: Start all articles
INFO: Using concept table: MESH_T047
INFO: Using results table: epilepsy_mesh_t047
INFO: Table exists and will remain unchanged
INFO: Using concept table: MESH_T048
INFO: Using results table: epilepsy_mesh_t048
INFO: Table doesn't exist, create
INFO: Using concept table: MESH_T191
INFO: Using results table: epilepsy_mesh_t191
INFO: Table doesn't exist, create






    



632.862412 seconds (14.74 M allocations: 475.459 MiB, 0.02% gc time)

2. Retrieve results and analyse simple occurrences and co-occurrences



In [5]:

    
using FreqTables

occurrence_df = get_semantic_occurrences_df(mh, concepts...)
@time mesh_frequencies = freqtable(occurrence_df, :pmid, :descriptor);
info("Found ", size(occurrence_df, 1), " related descriptors")









    



INFO: Using concept table: MESH_T047
INFO: Using results table: epilepsy_mesh_t047
INFO: Using concept table: MESH_T048
INFO: Using results table: epilepsy_mesh_t048
INFO: Using concept table: MESH_T191
INFO: Using results table: epilepsy_mesh_t191






    



  0.771777 seconds (2.10 M allocations: 1.032 GiB, 19.26% gc time)






    



INFO: Found 136136 related descriptors



In [6]:

    
using PlotlyJS
using NamedArrays

# Visualize frequency 
topn = 50
mesh_counts = vec(sum(mesh_frequencies, 1))
count_perm = sortperm(mesh_counts, rev=true)
mesh_names = collect(keys(mesh_frequencies.dicts[2]))

#traces
#most frequent is epilepsy - remove from plot for better scaling
freq_trace = PlotlyJS.bar(; x = mesh_names[count_perm[2:topn]], y= mesh_counts[count_perm[2:topn]], marker_color="orange")

data = [freq_trace]
layout = Layout(;title="$(topn)-Most Frequent MeSH ",
                 showlegend=false,
                 margin= Dict(:t=> 70, :r=> 0, :l=> 50, :b=>200),
                 xaxis_tickangle = 90,)
plot(data, layout)









    






 
 Plotly javascript loaded.
 To load again call 
init_notebook(true)

 






    



WARNING: deprecated syntax "abstract Shell".
Use "abstract type Shell end" instead.

WARNING: deprecated syntax "abstract Shell".
Use "abstract type Shell end" instead.

WARNING: deprecated syntax "abstract Shell".
Use "abstract type Shell end" instead.






    Out[6]:

3. Pair Statistics

Mutual information
Chi-Square
Co-occurrance matrix



In [10]:

    
using BCBIStats.COOccur
using StatsBase

#co-occurrance matrix - only for topp MeSH 
# min_frequency = 5 -- alternatively compute topn based on min-frequency
top_occ = mesh_frequencies.array[:, count_perm[2:topn]]
top_mesh_labels = mesh_names[count_perm[2:topn]]
top_occ_sp = sparse(top_occ)
top_coo_sp = top_occ_sp' * top_occ_sp

#Point Mutual Information
pmi_sp = BCBIStats.COOccur.pmi_mat(top_coo_sp)
#chi2
top_chi2= BCBIStats.COOccur.chi2_mat(top_occ, min_freq=0);



In [12]:

    
using PlotlyJSFactory

p = create_chord_plot(top_coo_sp, labels = top_mesh_labels)
relayout!(p, title="Co-occurrances between top 50 MeSH terms")
JupyterPlot(p)









    Out[12]:

Association Rules

Compute using apriori algorithm (eclat version)



In [13]:

    
using ARules
using DataTables









    



WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/isa/.julia/v0.6/NullableArrays/src/operators.jl:99.
WARNING: Method definition append!(NullableArrays.NullableArray{WeakRefStrings.WeakRefString{T}, 1}, NullableArrays.NullableArray{WeakRefStrings.WeakRefString{T}, 1}) in module Data at /Users/isa/.julia/v0.6/DataStreams/src/DataStreams.jl:344 overwritten in module DataTables at /Users/isa/.julia/v0.6/DataTables/src/abstractdatatable/io.jl:318.
WARNING: Method definition describe(AbstractArray{T, N} where N where T) in module StatsBase at /Users/isa/.julia/v0.6/StatsBase/src/scalarstats.jl:559 overwritten in module DataTables at /Users/isa/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:381.
WARNING: Method definition ==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/isa/.julia/v0.6/NullableArrays/src/operators.jl:99WARNING: Method definition .
==(Base.Nullable{S}, Base.Nullable{T}) in module Base at nullable.jl:238 overwritten in module NullableArrays at /Users/isa/.julia/v0.6/NullableArrays/src/operators.jl:99.
WARNING: Method definition append!(NullableArrays.NullableArray{WeakRefStrings.WeakRefString{T}, 1}, NullableArrays.NullableArray{WeakRefStrings.WeakRefString{T}, 1}) in module Data at /Users/isa/.julia/v0.6/DataStreams/src/DataStreams.jl:344 overwritten in module DataTables at /Users/isa/.julia/v0.6/DataTables/src/abstractdatatable/io.jl:318.
WARNING: Method definition append!(NullableArrays.NullableArray{WeakRefStrings.WeakRefString{T}, 1}, NullableArrays.NullableArray{WeakRefStrings.WeakRefString{T}, 1}) in module Data at /Users/isa/.julia/v0.6/DataStreams/src/DataStreams.jl:344 overwritten in module DataTables at /Users/isa/.julia/v0.6/DataTables/src/abstractdatatable/io.jl:318.
WARNING: Method definition describe(AbstractArray{T, N} where N where T) in module StatsBase at /Users/isa/.julia/v0.6/StatsBase/src/scalarstats.jl:559 overwritten in module DataTables at /Users/isa/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:381.
WARNING: Method definition describe(AbstractArray{T, N} where N where T) in module StatsBase at /Users/isa/.julia/v0.6/StatsBase/src/scalarstats.jl:559 overwritten in module DataTables at /Users/isa/.julia/v0.6/DataTables/src/abstractdatatable/abstractdatatable.jl:381.
WARNING: Compat.AsyncCondition is deprecated, use Base.AsyncCondition instead.
  likely near /Users/isa/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.AsyncCondition is deprecated, use Base.AsyncCondition instead.
  likely near /Users/isa/.julia/v0.6/IJulia/src/kernel.jl:31



In [14]:

    
# Remove serch MeSH (set column to 0). Rules of lenght 2 with search term correspond to histogram, 
# since it is in every transaction
mh_occ = convert(BitArray{2}, mesh_frequencies.array)

mh_col = mesh_frequencies.dicts[2][mh]
mh_occ[:, mh_col] = zeros(size(mh_occ,1))

mh_lkup = convert(DataStructures.OrderedDict{String,Int16}, mesh_frequencies.dicts[2]) 
@time mh_rules = apriori(mh_occ, supp = 0.001, conf = 0.1, maxlen = 9)

#Pretty print of rules
mh_lkup = Dict(zip(values(mesh_frequencies.dicts[2]), keys(mesh_frequencies.dicts[2])))
rules_dt= ARules.rules_to_datatable(mh_rules, mh_lkup, join_str = " | ");









    



  0.915391 seconds (510.08 k allocations: 171.878 MiB, 5.38% gc time)



In [15]:

    
println(head(rules_dt))
println("Found ", size(rules_dt, 1), " rules")









    



6×5 DataTables.DataTable
│ Row │ lhs                 │ rhs                     │ supp       │ conf     │
├─────┼─────────────────────┼─────────────────────────┼────────────┼──────────┤
│ 1   │ {Alcoholism}        │ Mental Disorders        │ 0.00146882 │ 0.212121 │
│ 2   │ {Alcoholism}        │ Schizophrenia           │ 0.00110911 │ 0.160173 │
│ 3   │ {Astrocytoma}       │ Brain Neoplasms         │ 0.00289269 │ 0.846491 │
│ 4   │ {Autistic Disorder} │ Intellectual Disability │ 0.00185851 │ 0.286374 │
│ 5   │ {Bipolar Disorder}  │ Mental Disorders        │ 0.00125899 │ 0.270968 │
│ 6   │ {Bipolar Disorder}  │ Psychotic Disorders     │ 0.00121403 │ 0.26129  │

│ Row │ lift    │
├─────┼─────────┤
│ 1   │ 6.91727 │
│ 2   │ 12.3833 │
│ 3   │ 28.7566 │
│ 4   │ 7.78284 │
│ 5   │ 8.83625 │
│ 6   │ 20.4137 │
Found 112 rules

Frequent item sets



In [ ]:

    
supp_int = round(Int, 0.001 * size(mh_occ, 1))
@time root = frequent_item_tree(mh_occ, supp_int, 9);

supp_lkup = gen_support_dict(root, size(mh_occ, 1))
item_lkup = mesh_frequencies.dicts[2]
item_lkup_t = Dict(zip(values(item_lkup), keys(item_lkup)))
freq = ARules.suppdict_to_datatable(supp_lkup, item_lkup_t);

Visualization

Basic visualization of frequent item sets using Sankey diagram



In [ ]:

    
function fill_sankey_data!(node, sources, targets, vals)
    if length(node.item_ids) >1
        push!(sources, node.item_ids[end-1]-1)
        push!(targets, node.item_ids[end]-1)
        push!(vals, node.supp)
    end
    if has_children(node)     
        for nd in node.children
            fill_sankey_data!(nd,  sources, targets, vals)
        end
    end
end



In [ ]:

    
sources = []
targets = []
vals = []
fill_sankey_data!(root, sources, targets, vals)



In [ ]:

    
trace=sankey(orientation="h",
             node = attr(domain=attr(x=[0,1], y=[0,1]), pad=1/size(epilepsy_occ, 1), thickness=1/size(epilepsy_occ, 1), line = attr(color="black", width= 0.5),
                         label=mesh_names), 
             link = attr(source=sources, target=targets, value = vals))
plot([trace])