Welcome to the Jupyter Notebook describing the class BioTechTopics (BTT)!

Purpose: To enable biotech business professionals to quickly identify prominent individuals and companies in the life science industry


In [4]:
from BioTechTopics import Topics
from plotBokehJpnb2 import plotBokehInJpnb
import time

# make instance of Topics object and load the data
t=Topics()
t.load() # unpickles LDA, tf, and tf-idf representations, puts text data from JSON into pandas dataframe 
#plotBokehInJpnb(t,'antibody')


Topics instance ready
Corpus has 33884 documents

Section 1: Introduction & Contents

Section: 2 How it works:

BTT is trained on the entire 100 Mb corpus of biotech business articles from Fiercebiotech.com

Section: 3 Performance:

Query result is returned in about one second.

Section: 4 Business takeaways:

Digital health topics are gainining visibility and prominence in the life science business world.

Section 2: How BTT works

Basic operation:

BTT input: search query.

Output: interactive scatterplot that identifies prominent individuals and companys related to query.

Most code is wrapped up in the Topics class. After creating an instance, tf-idf representations and TextRank keywords are loaded from JSON and pickle files using the load() function. See the above for an example of this code.

Step-by-step

1) tf-idf is used to find documents related to query

2) pandas is used to search a JSON file for pre-computed Named Entities of each document. Named entities are only returned if they have in the top 50 percentile of TextRank score. Call these TextRank-weighted named entities "prominent entities"

3) Each prominent entity is given a score equal to (cosine similarity of document that entity is found in)*(TextRank score). Top 200 are plotted in Bokeh. The y-axis value is equal to the product of the TextRank and tf-idf scores.

User experience

I currently interact through this app using a Bokeh server. The user can type in a query and see the results in real time (See below screen shot). Soon this will be on Heroku.

Section 3: Performance speed

I pickled the tf-idf representation and threw all of the TextRank keywords into a JSON file (read with pandas) so that identification of prominent individuals could be done very quickly.


In [5]:
start = time.time()

# these next two lines allow you submit a query to the algorithm
t.ww2('antibody') # Who's who? function - does information retrieval
data_scatter_dict = t.formatSearchResults(format='tfidf_tf_product',return_top_n=200) #user can format data in various ways
end = time.time()
print 'Query took ' + str(end-start) + ' seconds to execute'
print 'Some hits:'
print [str(data_scatter_dict['keywords'][x]) for x in [0,11,20]]


Found 463 documents relevant to query "antibody"
Query took 1.18653202057 seconds to execute
Some hits:
['tesaro conference call webcast', 'xoma ltd', 'stop kyprolis']

Fast named entity return is possible because all NLP (TextRank, tfidf, named entity recognition) is done offline and stored in pickle or JSON format and loaded later.

User can mouseover the scatterplot data to identify the named entity.

User can track trends and named entities relevant to their query as a function of time.

Key packages: Scikit-learn, pandas, nltk, Bokeh, Scrapy

Section 4: Business Take-away

The below plot (bottom half) shows that digital health is gaining visibility and attention in the life science industry. That plot shows for each year the sum of the cosine similarity between the phrase "digital health" and each document in that year, normalized to the total number of documents in that year. Thus, the plot can be interpreted as showing that digital health is occuping more and more attention among life science business professionals. This is therefore an exciting time for biomedical researchers with data-handling skills like myself! I believe that my unique combination of programming and life science industry exposure would compound well with the additional training from the Data Incubator, resulting in a quick offer from one of the Data Incubator's partner companies after completing the program.

Data points in the upper right quadrant include:

IBM Watson: Health AI

England's National Health Service: recently launched NHS Digital Academy, a health informatics training program

Launchpad Digital Health: Incubator/VC for digital health companies


In [6]:
#plotBokehInJpnb(t,'digital health')

Summary: Key Takaways

Performance (compared to semi-final)

1) BioTechTopics is faster:

All keyword extraction, named entity recognition, and LDA is performed before hand and the results are pickled or put into JSON. This vastly speeds the process up so that NLP does not have to be done in real time.

2) BioTechTopics is bigger:

Scrapy was used to scrape the entire fiercebiotech.com website , resulting in 99Mb of text data and 32,000 separate documents.

3) BioTechTopics is more user friendly:

Interactive bokeh plots were implemented.

Business take-away

4) Digital health is consistently gaining attention in the life science industry