Welcome to the Jupyter Notebook describing the class BioTechTopics!

Here, I will show a brief introduction to the application. Thanks for visiting!

-Ryan Davis

All of the code is wrapped in a class called Topics.

Here we will creat an instance of Topics called t.

This will read the corpus (in JSON) using pandas, train a matrix representation of the corpus using CountVectorizer, then perform Latent Dirichlet Allocation (LDA). It's done when "Topics instance ready" is printed.


In [1]:
from BioTechTopics import Topics
t=Topics()


Reading corpus
Corpus contains 2419 unique files

Training Count Vectorizer
Done training after 93.5439291 seconds

Training tfidf Vectorizer
Done training after 98.9199659824 seconds

Training LDA probability distributions
Done training after 41.9361331463 seconds

Topics instance ready

Now we can show Word Cloud representations of the corpus!

t.showWordCloud(x) will show the wordcloud for topic number x made by LDA. The size of the words in the wordcloud are proportional to the word count within the corpus multiplied by P(word|topic=x) (from LDA). Words for other topics will be shown below.


In [2]:
t.showTopicWordCloud(0)


Here, this topic 0 appears to be about liquiud biopsy since it is referencing blood tests, fingerpricks, and Theranos. Future work will conduct sentiment analysis for sentences containing these keywords, and display the sentiment as color on these word maps.

We can view the most important words in all topics

Here, "most important words" means the words with the highest values of P(word|topic=x) for each topic x. For some of the topics I manually assigned a name, which is given in all caps.


In [3]:
t.printTopWords(10)


Topic #0:  BLOOD TESTS AND HEALTHCARE PAYERS: therano, sleep, lab, medicar, medicaid, walgreen, holm, medicar medicaid, blood test, center medicar

Topic #1:  MICROBIOME AND BACTERIA: finch, jazz, difficil, celat, draper, c difficil, vyxeo, draper esprit, esprit, microbi

Topic #2:  NEW DRUGS: sarepta, eteplirsen, medivir, woodcock, dr, rubiu, flagship, warp, warp drive, epstein

Topic #3:  CLINICAL DRUG DEVELOPMENT: compani, patient, said, drug, cancer, year, million, develop, new, use

Topic #4:  IMMUNOTHERAPIES: phase, trial, data, patient, drug, studi, treatment, year, endpoint, dose

Topic #5:   : dynavax, zymework, alcon, lens, cataract, gammadelta, len, heplisavb, oxstem, synthego

Topic #6:  HEALTH DATA: elligo, kush, realworld data, data differ, differ sourc, use case, allow data, structur allow, esourc, provid safeti

Topic #7:   : winter, silicon therapeut, silicon, come sanofi, cluster head, bostonbas compani, look scale, overcom immunosuppress, diseas challeng, head overcom

Topic #8:  DEVICES AND DIAGNOSTICS: abbott, uniqur, aler, jude, st jude, st, fix, jude medic, acquisit, diagnost

Topic #9:  TOPIC 9: opioid, quintil, curevac, im, rotat, im health, biontech, p53, mrna, cuff
()

Here we use the "Who's Who" function to retrieve the named entities returned for the query "microbiome". Some examples include:

Indigo Agriculture: Indigo Agriculture is a startup using plant microbiomes to strengthen crops against disease and drought to increase crop yield for farmers.

Noubar Afeyan: Cofounder of Evelo Therapeutics, which is using microbiome therapy to treat cancer.

Asit Parikh, MD, PhD: Head of Takeda's gastroenterology therapeutic area unit, which just acquired NuBiyota. NuBiyota is a microbiome therapeutic company that focuses on gastrointestinal indications.


In [4]:
t.ww('microbiome')