As a biomedical engineer and healthcare strategy consultant, I often need to quickly identify important companies and individuals relevant to a client's business. For example, I may need to understand which companies are aggregating electronic medical record data and how they are monitizing it. Or maybe an immunotherapy client wants an list of companies with PD-1 inhibitors in their pipeline (by tomorrow morning of course!).
Identifying prominent entities in a given area could take hours of combing through biotech business literature, but what if the search process could be completed in seconds? The need for quick identification of important companies, people, and drugs within the BioTech business literature inspired me to create the app Who's Who in BioTech (WWBT). WWBT uses natural language processing, machine learning, and information retrieval to quickly identify important entities related to the user's query.
Simply enter a search term (e.g. electronic medical records, PD1, antibiotics, etc.) and you'll see a scatterplot where each dot represents a named entity related to your query. Mouse over the dots to see the named entity and the context in which they were mentioned. Click on the dot to visit the original article where the entity was mentioned.
38k articles published between 2008-2017 were scraped from FierceBiotech.com
WWBT uses a tf-idf representation of the corpus to identify the articles most related to the user's query. For each article, named entity recognition and TextRank were used to identify the most important people, companies, and drugs within the article. The relevence score (y-axis of the WWBT scatter plot) is the tf-idf score of the article multiplied by the TextRank score of the named entity within that article. Thus WWBT shows the user named entities with a high TextRank score in documents most related to the user's query. WWBT returns the 200 named entities with the highest relevance score.
Visualization: Bokeh
Publishing: Flask, git, github, Heroku, Django, Jinja2
Natural language processing: nltk, summa
Information retrieval: sklearn (for tf-idf), pandas
Web Scraping: Scrapy
Thanks for visiting!
Ryan M Davis, PhD