In [4]:
from BioTechTopics import Topics
from plotBokehJpnb2 import plotBokehInJpnb
import time
# make instance of Topics object and load the data
t=Topics()
t.load() # unpickles LDA, tf, and tf-idf representations, puts text data from JSON into pandas dataframe
#plotBokehInJpnb(t,'antibody')
BTT is trained on the entire 100 Mb corpus of biotech business articles from Fiercebiotech.com
Query result is returned in about one second.
Digital health topics are gainining visibility and prominence in the life science business world.
BTT input: search query.
Output: interactive scatterplot that identifies prominent individuals and companys related to query.
Most code is wrapped up in the Topics class. After creating an instance, tf-idf representations and TextRank keywords are loaded from JSON and pickle files using the load() function. See the above for an example of this code.
1) tf-idf is used to find documents related to query
2) pandas is used to search a JSON file for pre-computed Named Entities of each document. Named entities are only returned if they have in the top 50 percentile of TextRank score. Call these TextRank-weighted named entities "prominent entities"
3) Each prominent entity is given a score equal to (cosine similarity of document that entity is found in)*(TextRank score). Top 200 are plotted in Bokeh. The y-axis value is equal to the product of the TextRank and tf-idf scores.
I currently interact through this app using a Bokeh server. The user can type in a query and see the results in real time (See below screen shot). Soon this will be on Heroku.
I pickled the tf-idf representation and threw all of the TextRank keywords into a JSON file (read with pandas) so that identification of prominent individuals could be done very quickly.
In [5]:
start = time.time()
# these next two lines allow you submit a query to the algorithm
t.ww2('antibody') # Who's who? function - does information retrieval
data_scatter_dict = t.formatSearchResults(format='tfidf_tf_product',return_top_n=200) #user can format data in various ways
end = time.time()
print 'Query took ' + str(end-start) + ' seconds to execute'
print 'Some hits:'
print [str(data_scatter_dict['keywords'][x]) for x in [0,11,20]]
Fast named entity return is possible because all NLP (TextRank, tfidf, named entity recognition) is done offline and stored in pickle or JSON format and loaded later.
User can mouseover the scatterplot data to identify the named entity.
User can track trends and named entities relevant to their query as a function of time.
Key packages: Scikit-learn, pandas, nltk, Bokeh, Scrapy
The below plot (bottom half) shows that digital health is gaining visibility and attention in the life science industry. That plot shows for each year the sum of the cosine similarity between the phrase "digital health" and each document in that year, normalized to the total number of documents in that year. Thus, the plot can be interpreted as showing that digital health is occuping more and more attention among life science business professionals. This is therefore an exciting time for biomedical researchers with data-handling skills like myself! I believe that my unique combination of programming and life science industry exposure would compound well with the additional training from the Data Incubator, resulting in a quick offer from one of the Data Incubator's partner companies after completing the program.
Data points in the upper right quadrant include:
IBM Watson: Health AI
England's National Health Service: recently launched NHS Digital Academy, a health informatics training program
Launchpad Digital Health: Incubator/VC for digital health companies
In [6]:
#plotBokehInJpnb(t,'digital health')
All keyword extraction, named entity recognition, and LDA is performed before hand and the results are pickled or put into JSON. This vastly speeds the process up so that NLP does not have to be done in real time.
Scrapy was used to scrape the entire fiercebiotech.com website , resulting in 99Mb of text data and 32,000 separate documents.
Interactive bokeh plots were implemented.