In [192]:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pyspark.sql import HiveContext

sentences = sc.textFile("practice_fusion/sentences_nlp").map(lambda row: row.split(" "))
word2vec = Word2Vec()
word2vec.setSeed(0)
word2vec.setVectorSize(100)
model = word2vec.fit(sentences)

In [193]:
def normalize_icd9(icd9):
    first_part = icd9[0:3].lower()
    second_part = icd9[3:]
    if len(second_part) > 0:
        return first_part + '.' + second_part
    else:
        return first_part
    
def read_diag_map():
    ret = {}
    with open('/root/clinical2vec/CMS32_DESC_LONG_DX.txt') as f:
        content = f.readlines()
    for line in content:
        key = normalize_icd9(line[0:6].strip())
        value = line[6:].strip()
        ret[key] = value
    return ret


diag_map = read_diag_map()

def pretty_print(concept):
    tokens = concept.split('::')
    if tokens[0] == 'dx':
        diag = tokens[1]
        if diag[-1] == '.':
            diag = diag[0:len(diag) - 1]
        try:
            return 'dx: {} -- {}'.format(diag, diag_map[diag])
        except KeyError:
            if '.' not in diag:
                first_try = pretty_print('dx::' + diag + '.0')
                if first_try.startswith('dx::'):
                    return pretty_print('dx::' + diag + '.00')
                else:
                    return first_try
            else:
                if diag.endswith('00'):
                    return concept
                else:
                    return pretty_print('dx::' + diag + '0')
    else:
        return concept

    
def print_synonyms_filt(clinical_concept, model, prefix):
    synonyms = model.findSynonyms(clinical_concept, 10000)
    i = 0
    for word, cosine_distance in synonyms:
        if prefix is None or word.startswith(prefix):
            print "{}: {}".format(cosine_distance, pretty_print(word))
            i = i+1
        if i > 10:
            return

def print_synonyms(clinical_concept, model):
    print_synonyms_filt(clinical_concept, model, None)

Atherosclerosis of the Aorta

Also known as heart disease or hardening of the arteries. This disease is the number one killer of Americans.


In [200]:
print_synonyms('dx::440.0', model)


0.930721402168: dx: v12.71 -- Personal history of peptic ulcer disease
0.926115810871: dx: 533.40 -- Chronic or unspecified peptic ulcer of unspecified site with hemorrhage, without mention of obstruction
0.91034334898: dx: 153.6 -- Malignant neoplasm of ascending colon
0.90947073698: dx: 238.75 -- Myelodysplastic syndrome, unspecified
0.907130658627: dx: 389.10 -- Sensorineural hearing loss, unspecified
0.90490090847: dx: 428.30 -- Diastolic heart failure, unspecified
0.902494549751: dx: v43.65 -- Knee joint replacement
0.898817896843: dx: 396.2 -- Mitral valve insufficiency and aortic valve stenosis
0.89858096838: dx: 433.10 -- Occlusion and stenosis of carotid artery without mention of cerebral infarction
0.890106379986: dx: 405.11 -- Benign renovascular hypertension
0.889486372471: dx: 433.10 -- Occlusion and stenosis of carotid artery without mention of cerebral infarction

Peptic Ulcers

There have been long-standing connections noticed between ulcers and atherosclerosis. Partiaully due to smokers having a higher than average incidence of peptic ulcers and atherosclerosis. You can see an editorial in the British Medical Journal all the way back in the 1970's discussing this.

Hearing Loss

From an article from the Journal of Atherosclerosis in 2012:

Sensorineural hearing loss seemed to be associated with vascular endothelial dysfunction and an increased cardiovascular risk

Knee Joint Replacements

These procedures are common among those with osteoarthritis and there has been a solid correlation between osteoarthritis and atherosclerosis in the literature.

Crohn's Disease

Crohn's disease is a type of inflammatory bowel disease that is caused by a combination of environmental, immune and bacterial factors. Let's see if we can recover some of these connections from the data.


In [194]:
#Crohn's Disease
print_synonyms('dx::555.9', model)


0.870913982391: dx: 274.03 -- Chronic gouty arthropathy with tophus (tophi)
0.869603157043: dx: 522.5 -- Periapical abscess without sinus
0.863405406475: dx: 579.3 -- Other and unspecified postsurgical nonabsorption
0.859003782272: dx: 135 -- Sarcoidosis
0.85546463728: dx: 112.3 -- Candidiasis of skin and nails
0.853841125965: dx: v16.42 -- Family history of malignant neoplasm of prostate
0.852528512478: dx: 478.8 -- Upper respiratory tract hypersensitivity reaction, site unspecified
0.850185573101: dx::287.400
0.849603950977: dx: 339.10 -- Tension type headache, unspecified
0.848721027374: dx: 728.0 -- Infective myositis
0.84819817543: dx: 786.51 -- Precordial pain

Arthritis

From the Crohn's and Colitis Foundation of America:

Arthritis, or inflammation of the joints, is the most common extraintestinal complication of IBD. It may affect as many as 25% of people with Crohn’s disease or ulcerative colitis. Although arthritis is typically associated with advancing age, in IBD it often strikes the youngest patients.

Dental Abscesses

While not much medical literature exists with a specific link to dental abscesses and Crohn's (there are general oral issues noticed here), you do see lengthy discussions on the Crohn's forums about abscesses being a common occurance with Crohn's.

Yeast Infections

Candidiasis of skin and nails is a form of yeast infection on the skin. From the journal "Critical Review of Microbiology" here.

It is widely accepted that Candidia could result from an inappropriate inflammatory response to intestinal microorganisms in a genetically susceptible host. Most studies to date have concerned the involvement of bacteria in disease progression. In addition to bacteria, there appears to be a possible link between the commensal yeast Candida albicans and disease development.

Drugs associated with HIV/AIDS

The notion of a 'synonym' can also find connections between clinical data types. Here we look for the drugs most associated with HIV/AIDS


In [195]:
print_synonyms_filt('dx::042', model, 'rx')


0.510371923447: rx::efavirenz/emtricitabine/tenofovir
0.40061840415: rx::benzoyl_peroxide_topical
0.36375400424: rx::morphine
0.354891896248: rx::emollients,_topical
0.350577175617: rx::ziconotide
0.347965329885: rx::scopolamine
0.34552615881: rx::tobramycin
0.342438489199: rx::oxycodone
0.337265789509: rx::fentanyl
0.331533819437: rx::tizanidine
0.330983161926: rx::apap/butalbital/caffeine/codeine

From the list above, we see

  • The sets of anti-retrovirals that are commonly used to treat HIV.
  • Benzoyl peroxide and topical emollients, used to treat the skin issues that are effects of the medication to treat HIV
  • A selection of pain relievers