Colonic Neoplasms Comorbidities (using Brown MySQL server)

This script run a PubMed-Comorbidities pipeline using the following characteristics:

  • Main MeSH Heading: Colonic Neoplasms
  • UMLS filtering concept: "Disease or Syndrome", "Mental or Behavioral Dysfunction" or "Neoplastic Process"
  • Articles analysed: All MEDLINE 2017AA articles tagged with the as a MeSH Heading. Note that this is equivalent to searching PubMed using [MH:noexp] Total number of articles found: 63304
  • UMLS concept filtering: Comorbidities are analysed on all other MeSH descriptors associated with the specified UMLS concepts
  • This script uses Brown MySQL databases:
    • medline
    • umls_meta
    • pubmed_miner

In [1]:
#Optional for running Step 1. However, sometimes the concurrancy of DatabaBase creates errors. 
#If that happens, reduce the number of process or comment out the line completely. 
#The fewer the processes the longer the task. At times I've successfully used 12 workers, at others only 2
# addprocs(2);

In [2]:
using Revise #used during development to detect changes in module - unknown behavior if using multiple processes
using PubMedMiner

In [3]:
#Settings
const mh = "Colonic Neoplasms"
const concepts = ("Disease or Syndrome", "Mental or Behavioral Dysfunction", "Neoplastic Process");

The folllowing code is designed to save to the pubmed_miner database a table containing the list of pmids and mesh descriptors that match the specified filtering criteria.


In [4]:
overwrite = false
@time save_semantic_occurrences(mh, concepts...; overwrite = overwrite)


INFO: 65256 Articles related to MH:Colonic Neoplasms
INFO: ----------------------------------------
INFO: Start all articles
INFO: Using concept table: MESH_T047
INFO: Using results table: colonic_neoplasms_mesh_t047
INFO: Table exists and will remain unchanged
  5.003936 seconds (2.79 M allocations: 144.687 MiB, 0.98% gc time)
INFO: Using concept table: MESH_T048
INFO: Using results table: colonic_neoplasms_mesh_t048
INFO: Table exists and will remain unchanged
INFO: Using concept table: MESH_T191
INFO: Using results table: colonic_neoplasms_mesh_t191
INFO: Table exists and will remain unchanged

2. Retrieve results and analyze simple occurrences and co-occurrences


In [5]:
using FreqTables

@time occurrence_df = get_semantic_occurrences_df(mh, concepts...);


INFO: Using concept table: MESH_T047
INFO: Using results table: colonic_neoplasms_mesh_t047
INFO: Using concept table: MESH_T048
INFO: Using results table: colonic_neoplasms_mesh_t048
INFO: Using concept table: MESH_T191
INFO: Using results table: colonic_neoplasms_mesh_t191
  2.479173 seconds (3.02 M allocations: 127.288 MiB, 2.27% gc time)

In [6]:
#temporary fix - FreqTables breaking with Unions of Missings
#see https://github.com/nalimilan/FreqTables.jl/issues/23

occurrence_df[:descriptor] = convert(Vector{String}, occurrence_df[:descriptor])
occurrence_df[:pmid] = convert(Vector{Int}, occurrence_df[:pmid])

@time mesh_frequencies = freqtable(occurrence_df, :pmid, :descriptor);

info("Found ", size(occurrence_df, 1), " related descriptors")


  1.927919 seconds (3.32 M allocations: 960.874 MiB, 5.41% gc time)
INFO: Found 181379 related descriptors

In [7]:
using PlotlyJS
using NamedArrays

# Visualize frequency 
topn = 50
mesh_counts = vec(sum(mesh_frequencies, 1))
count_perm = sortperm(mesh_counts, rev=true)
mesh_names = collect(keys(mesh_frequencies.dicts[2]))

#traces
#most frequent is epilepsy - remove from plot for better scaling
freq_trace = PlotlyJS.bar(; x = mesh_names[count_perm[2:topn]], y= mesh_counts[count_perm[2:topn]], marker_color="orange")

data = [freq_trace]
layout = Layout(;title="$(topn)-Most Frequent MeSH ",
                 showlegend=false,
                 margin= Dict(:t=> 70, :r=> 0, :l=> 50, :b=>200),
                 xaxis_tickangle = 90,)
plot(data, layout)


Plotly javascript loaded.

To load again call

init_notebook(true)

Out[7]:

3. Pair Statistics

  • Mutual information
  • Chi-Square
  • Co-occurrance matrix

In [8]:
using BCBIStats.COOccur
using StatsBase

#co-occurrance matrix - only for topp MeSH 
# min_frequency = 5 -- alternatively compute topn based on min-frequency
top_occ = mesh_frequencies.array[:, count_perm[2:topn]]
top_mesh_labels = mesh_names[count_perm[2:topn]]
top_occ_sp = sparse(top_occ)
top_coo_sp = top_occ_sp' * top_occ_sp


#Point Mutual Information
pmi_sp = BCBIStats.COOccur.pmi_mat(top_coo_sp)
#chi2
top_chi2= BCBIStats.COOccur.chi2_mat(top_occ, min_freq=0);
#correlation
corrcoef = BCBIStats.COOccur.corrcoef(top_occ);

Plot Matrix of Pair Statistics


In [9]:
function plot_stat_mat(stat_mat, labels)
    stat_trace = heatmap(x=labels, y=labels, z=full(stat_mat- spdiagm(diag(stat_mat))))

    data = [stat_trace]
    layout = Layout(;
                     showlegend=false,
                     height = 900, width=900,
                     margin= Dict(:t=> 300, :r=> 0, :l=> 200, :b=>0),
                     xaxis_tickangle = 90, xaxis_autotick=false, yaxis_autotick=false,
                     yaxis_autorange = "reversed",
                     xaxis_side = "top", 
                     xaxis_ticks = "", yaxis_ticks = "")
    plot(data, layout)
end


Out[9]:
plot_stat_mat (generic function with 1 method)

Correlation Coefficient


In [10]:
plot_stat_mat(corrcoef, top_mesh_labels)


Out[10]:

Pointwise Mutual Information


In [11]:
plot_stat_mat(pmi_sp, top_mesh_labels)


Out[11]:

In [12]:
using PlotlyJSFactory

p = create_chord_plot(top_coo_sp, labels = top_mesh_labels)
relayout!(p, title="Co-occurrances between top 50 MeSH terms")
JupyterPlot(p)


Out[12]:

Association Rules

  • Compute using apriori algorithm (eclat version)

In [13]:
using ARules
using DataFrames

In [14]:
# Remove serch MeSH (set column to 0). Rules of lenght 2 with search term correspond to histogram, 
# since it is in every transaction
mh_occ = convert(BitArray{2}, mesh_frequencies.array)

mh_col = mesh_frequencies.dicts[2][mh]
mh_occ[:, mh_col] = zeros(size(mh_occ,1))

mh_lkup = convert(DataStructures.OrderedDict{String,Int16}, mesh_frequencies.dicts[2]) 
@time mh_rules = apriori(mh_occ, supp = 0.001, conf = 0.1, maxlen = 9)

#Pretty print of rules
mh_lkup = Dict(zip(values(mesh_frequencies.dicts[2]), keys(mesh_frequencies.dicts[2])))
rules_dt= ARules.rules_to_dataframe(mh_rules, mh_lkup, join_str = " | ");


  1.164144 seconds (541.16 k allocations: 175.056 MiB, 17.62% gc time)

In [15]:
println(head(rules_dt))
println("Found ", size(rules_dt, 1), " rules")


6×5 DataFrames.DataFrame. Omitted printing of 3 columns
│ Row │ lhs                                           │ rhs                    │
├─────┼───────────────────────────────────────────────┼────────────────────────┤
│ 1   │ {Acute Disease}                               │ Intestinal Obstruction │
│ 2   │ {Adenocarcinoma, Mucinous}                    │ Adenocarcinoma         │
│ 3   │ {Adenocarcinoma, Mucinous | Rectal Neoplasms} │ Adenocarcinoma         │
│ 4   │ {Adenocarcinoma | Adenocarcinoma, Mucinous}   │ Rectal Neoplasms       │
│ 5   │ {Adenoma}                                     │ Adenocarcinoma         │
│ 6   │ {Adenoma | Neoplasms, Multiple Primary}       │ Adenocarcinoma         │
Found 816 rules

Frequent Item Sets


In [16]:
supp_int = round(Int, 0.001 * size(mh_occ, 1))
@time root = frequent_item_tree(mh_occ, supp_int, 9);

supp_lkup = gen_support_dict(root, size(mh_occ, 1))
item_lkup = mesh_frequencies.dicts[2]
item_lkup_t = Dict(zip(values(item_lkup), keys(item_lkup)))
freq = ARules.suppdict_to_dataframe(supp_lkup, item_lkup_t);


  0.112915 seconds (86.07 k allocations: 148.683 MiB, 33.42% gc time)

In [17]:
println(head(freq))
println("Found ", size(freq, 1), " frequent itemsets")


6×2 DataFrames.DataFrame
│ Row │ itemset                                                 │ supp │
├─────┼─────────────────────────────────────────────────────────┼──────┤
│ 1   │ {Adenocarcinoma, Mucinous,Rectal Neoplasms}             │ 203  │
│ 2   │ {Obesity}                                               │ 281  │
│ 3   │ {Diverticulitis}                                        │ 139  │
│ 4   │ {Carcinoma, Squamous Cell,Melanoma}                     │ 68   │
│ 5   │ {Adenocarcinoma,Pancreatic Neoplasms,Stomach Neoplasms} │ 81   │
│ 6   │ {Carcinoma, Squamous Cell,Rectal Neoplasms}             │ 154  │
Found 679 frequent itemsets

Visualization of Frequent Item Sets

  • Basic visualization of frequent item sets using Sankey diagram (experimental - use with caution)
  • Future work inclused better layout for more links as well as the ability to dinamically change the number of of itemsets

In [18]:
function fill_sankey_data!(node, sources, targets, vals)
    if length(node.item_ids) >1
        push!(sources, node.item_ids[end-1]-1)
        push!(targets, node.item_ids[end]-1)
        push!(vals, node.supp)
    end
    if has_children(node)     
        for nd in node.children
            fill_sankey_data!(nd,  sources, targets, vals)
        end
    end
end


Out[18]:
fill_sankey_data! (generic function with 1 method)

In [19]:
sources = []
targets = []
vals = []
fill_sankey_data!(root, sources, targets, vals);

In [20]:
# size(sources)
topn_links = 50
freq_vals_perm = sortperm(vals, rev=true)
s = sources[freq_vals_perm[1:topn_links]]
t = targets[freq_vals_perm[1:topn_links]]
v = vals[freq_vals_perm[1:topn_links]]
l = mesh_names

println("Found, ", length(sources), "links, showing ", topn_links)


Found, 513links, showing 50

In [21]:
pad = 1e-7
trace=sankey(orientation="h",
             node = attr(domain=attr(x=[0,1], y=[0,1]), pad=pad, thickness=pad, line = attr(color="black", width= 0.5),
                         label=l), 
             link = attr(source=s, target=t, value = v))

layout = Layout(width=900, height=1100)

plot([trace], layout)


Out[21]:

In [ ]: