Exercise 01: extract text from HTML

The course repo has a subdirectory called html which includes some example HTML files.


In [1]:
%sx ls html/


Out[1]:
['article1.html', 'article2.html', 'article3.html', 'article4.html']

Select one of those files to use as an example, and take a look at its HTML content.


In [2]:
file = "html/article1.html"
print(open(file, "r").readlines())


['<!DOCTYPE html>\n', '<html lang="en">\n', '<head>\n', " <title>The current state of machine intelligence 3.0 - O'Reilly Media</title>\n", '</head>\n', '<body>\n', '<div id="article-body">\n', '<p>Almost a year ago, we published our now-annual <a href="https://www.oreilly.com/ideas/the-current-state-of-machine-intelligence-2-0">landscape</a> of machine intelligence companies, and goodness have we seen a lot of activity since then. This year\'s landscape has <em>a third more companies</em> than our first one did two years ago, and it feels even more futile to try to be comprehensive, since this just scratches the surface of all of the activity out there.</p>\n', '\n', '<p>As has been the case for the last couple of years, our fund still obsesses over "problem first" machine intelligence -- we\'ve invested in 35 machine intelligence companies solving 35 meaningful problems in areas from security to recruiting to software development. (Our fund focuses on the future of work, so there are some machine intelligence domains where we invest more than others.)</p>\n', '\n', '<p>At the same time, the hype around machine intelligence methods continues to grow: the words "deep learning" now equally represent a series of meaningful breakthroughs (wonderful) but also a hyped phrase like "big data" (not so good!). We care about whether a founder uses the right method to solve a problem, not the fanciest one. We favor those who apply technology thoughtfully.</p>\n', '\n', '<p>What\'s the biggest change in the last year? We are getting inbound inquiries from a different mix of people. For <a href="https://medium.com/@shivon/the-current-state-of-machine-intelligence-f76c20db2fe1#.ei67436b7">v1.0</a>, we heard almost exclusively from founders and academics. Then came a healthy mix of investors, both private and public. Now overwhelmingly we have heard from existing companies trying to figure out how to transform their businesses using machine intelligence.</p>\n', '</div>\n', '\n', '<div class="text-group">\n', ' <p class="dek">Shivon Zilis is a partner and founding member of Bloomberg Beta, an early-stage VC firm that invests in startups making work better, with a focus on machine intelligence. She\'s particularly fascinated by intelligence tools and industry applications. Like any good Canadian, she spends her spare time playing hockey and snowboarding. She holds a degree in economics and philosophy from Yale.</p>\n', '    \n', ' <p class="dek">James Cham is a Partner at Bloomberg Beta based in Palo Alto. He invests in data-centric and machine learning-related companies. He was a principal at Trinity Ventures and Vice President at Bessemer Venture Partners, where he worked with investments like Dropcam, Twilio, and LifeLock. He\'s a former software developer and management consultant. He has an M.B.A. from the Massachusetts Institute of Technology and was an undergraduate in computer science at Harvard College.</p>\n', '\n', ' <p class="copyright">&copy; 2016 O\'Reilly Media, Inc. All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.</p><br>\n', '</div>\n', '\n', '</body>\n', '</html>\n']

Next, use Beautiful Soup to extract text out of the HTML. Following the DOM structure of the HTML document, select the <div/> that encloses the article text, then iterate through the <p/> paragraphs to extract the text from each.


In [4]:
from bs4 import BeautifulSoup

with open(file, "r") as f:
  soup = BeautifulSoup(f, "html.parser")

  for div in soup.find_all("div", id="article-body"):
    for p in div.find_all("p"):
      print(p.get_text(), "\n")


Almost a year ago, we published our now-annual landscape of machine intelligence companies, and goodness have we seen a lot of activity since then. This year's landscape has a third more companies than our first one did two years ago, and it feels even more futile to try to be comprehensive, since this just scratches the surface of all of the activity out there. 

As has been the case for the last couple of years, our fund still obsesses over "problem first" machine intelligence -- we've invested in 35 machine intelligence companies solving 35 meaningful problems in areas from security to recruiting to software development. (Our fund focuses on the future of work, so there are some machine intelligence domains where we invest more than others.) 

At the same time, the hype around machine intelligence methods continues to grow: the words "deep learning" now equally represent a series of meaningful breakthroughs (wonderful) but also a hyped phrase like "big data" (not so good!). We care about whether a founder uses the right method to solve a problem, not the fanciest one. We favor those who apply technology thoughtfully. 

What's the biggest change in the last year? We are getting inbound inquiries from a different mix of people. For v1.0, we heard almost exclusively from founders and academics. Then came a healthy mix of investors, both private and public. Now overwhelmingly we have heard from existing companies trying to figure out how to transform their businesses using machine intelligence.