Extracting text from HTML file

There are lots of data sources from which we might want to extract information, such as initial public offerings for various companies. E.g., Tesla's IPO prospectus. One can imagine trying to mine such documents in an effort to predict which IPOs will do poorly or well.

HTML has both text as well as so-called markup like <b>, which is used to specify formatting information.

We will use the well-known Beautiful soup Python library to extract text.

First, either do a "save as" or do what the cool kids do:


In [44]:
! curl https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2306k    0 2306k    0     0  4186k      0 --:--:-- --:--:-- --:--:-- 4178k

If you then do open /tmp/TeslaIPO.html from the command line, it will pop up in your browser window. Also take a look at what HTML looks like in the wild:


In [45]:
! head -15 /tmp/TeslaIPO.html


<DOCUMENT>
<TYPE>S-1
<SEQUENCE>1
<FILENAME>ds1.htm
<DESCRIPTION>REGISTRATION STATEMENT ON FORM S-1
<TEXT>
<HTML><HEAD>
<TITLE>Registration Statement on Form S-1</TITLE>
</HEAD>
 <BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>

 <P STYLE="margin-top:0px;margin-bottom:0px" ALIGN="center"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B>As filed with the Securities and Exchange Commission on January 29, 2010 </B></FONT></P>
<P STYLE="margin-top:0px;margin-bottom:0px" ALIGN="right"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B>Registration No.&nbsp;333-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</B></FONT></P>
<P STYLE="font-size:2px;margin-top:0px;margin-bottom:0px">&nbsp;</P> <P STYLE="line-height:0px;margin-top:0px;margin-bottom:0px;border-bottom:0.5pt solid #000000">&nbsp;</P> <P

Main script

Our main program accepts a file name parameter from the commandline, opens it, gets its text, converts the HTML to text, and close the file. Our first attempt, after looking at the documentation, might be the following (file ipo-text.py):


In [46]:
import sys
from bs4 import BeautifulSoup

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
soup = BeautifulSoup(html_text, 'html.parser')
text = soup.get_text()
print(text[0:300])


S-1
1
ds1.htm
REGISTRATION STATEMENT ON FORM S-1


Registration Statement on Form S-1


Table of Contents
As filed with the Securities and Exchange Commission on January 29, 2010 
Registration No. 333-                
      UNITED STATES  SECURITIES AND EXCHANGE COMMISSION  Washington, D.C. 20549  

Tidy up

Let's improve our program by creating a function to extract text from HTML text:


In [17]:
def html2text(html_text):
    soup = BeautifulSoup(html_text, 'html.parser')
    text = soup.get_text()
    return text

Then, our main program looks like:


In [18]:
import sys
from bs4 import BeautifulSoup

def html2text(html_text):
    soup = BeautifulSoup(html_text, 'html.parser')
    text = soup.get_text()
    return text

with open("/tmp/TeslaIPO.html", "r") as f:
    html_text = f.read()
text = html2text(html_text)
print(text[0:1000])


S-1
1
ds1.htm
REGISTRATION STATEMENT ON FORM S-1


Registration Statement on Form S-1


Table of Contents
As filed with the Securities and Exchange Commission on January 29, 2010 
Registration No. 333-                
      UNITED STATES  SECURITIES AND EXCHANGE COMMISSION  Washington, D.C. 20549      FORM S-1 
 REGISTRATION STATEMENT  UNDER  THE SECURITIES ACT OF 1933      Tesla Motors, Inc.  (Exact name of Registrant as
specified in its charter)       








Delaware
 
3711
 
91-2197729

 (State or other jurisdiction of incorporation or organization)
 
 (Primary Standard Industrial Classification Code Number)
 
 (I.R.S. Employer Identification Number) 3500 Deer Creek Road
 Palo Alto, California 94304  (650) 413-4000  (Address, including zip code, and telephone number,
including area code, of Registrant’s principal executive offices)      Elon Musk 
 Chief Executive Officer  Tesla Motors, Inc.  3500 Deer Creek Road  Palo Alto, California 94304  (650) 413-4000  (Name, address, inclu

Exercise

Copy that program into a Python file called ipo-text.py and run it from the command line. You will notice that there is weird stuff in the output like: Registrant<U+0092>s. That 92 is the character code, in hexadecimal, for the fancy single quote: . You will have to download the TeslaIPO.html file.

Converting non-ASCII char

We should clean up the text extracted from the HTML so that the non-ASCII characters are stripped or converted.


In [19]:
text = [c for c in text if ord(c)<=127]
text = ''.join(text)
print(text[:300])


S-1
1
ds1.htm
REGISTRATION STATEMENT ON FORM S-1


Registration Statement on Form S-1


Table of Contents
As filed with the Securities and Exchange Commission on January 29, 2010 
Registration No.333-
   UNITED STATES  SECURITIES AND EXCHANGE COMMISSION  Washington, D.C. 20549    FORM S-1 
 REGISTR

Exercise

Print out the number of unique words in the document (split on whitespace). For Tesla's IPO, I get 10602 unique words.


In [20]:
len(set(text.split()))


Out[20]:
10602

Exercise

Create a histogram using a dictionary that maps words to the word count. I use defaultdict(int) to define my histogram; very convenient. Sort and print out the list of tuples from items()


In [27]:
from collections import defaultdict
counts = defaultdict(int)
for w in text.split():
    counts[w] += 1
list(counts.items())[:10]


Out[27]:
[('S-1', 4),
 ('1', 17),
 ('ds1.htm', 1),
 ('REGISTRATION', 3),
 ('STATEMENT', 2),
 ('ON', 1),
 ('FORM', 2),
 ('Registration', 20),
 ('Statement', 6),
 ('on', 739)]

In [29]:
sorted(counts.items())[:5]


Out[29]:
[('$', 362), ('$,', 1), ('$.', 4), ('$0', 2), ('$0,', 3)]

In [38]:
def thecount(pair): return pair[1]
histo = sorted(counts.items(), key=thecount, reverse=True)
#histo = sorted(counts.items(), key=lambda x: x[1], reverse=True)
for p in histo[0:10]:
    print(f"{p[1]} {p[0]}")


6455 the
5762 of
4265 and
3814 to
2502 our
2380 in
1689 a
1280 we
1264 for
1194 or

Exercise

Now, create the histogram the easy way using Counter. If you print that object, it will show you Counter({'the': 6483, 'of': 5788, 'and': 4274, ....


In [43]:
from collections import Counter
counts = Counter(text.split())
print(str(counts)[0:400])


Counter({'the': 6455, 'of': 5762, 'and': 4265, 'to': 3814, 'our': 2502, 'in': 2380, 'a': 1689, 'we': 1280, 'for': 1264, 'or': 1194, 'as': 965, 'that': 862, 'be': 838, 'on': 739, 'with': 734, 'are': 701, 'We': 676, 'have': 648, 'will': 642, 'by': 616, 'stock': 606, 'is': 588, 'an': 575, 'shares': 568, 'not': 536, 'may': 531, 'Tesla': 529, 'from': 524, 'which': 523, 'The': 520, 'electric': 436, 'thi

Stripping char beyond 255 from commandline

If there are characters within the file that are non-ASCII and larger than 255, we can convert the file using the command line. Here's a simple version of the problem I put into file /tmp/foo.html:

<html>
<body></body>
</html>

I deliberately injected a Unicode code point > 255, which requires two bytes to store. Most of the characters require just one byte. Here is first part of file:

$ od -c -t xC /tmp/t.html
0000000    <   h   t   m   l   >  \n   <   b   o   d   y   >  \n   གྷ  **
           3c  68  74  6d  6c  3e  0a  3c  62  6f  64  79  3e  0a  e0  bd
...

Here is how you could strip any non-one-byte characters from the file before processing:

$ iconv -c -f utf-8 -t ascii /tmp/foo.html 
<html>
<body>

</body>
</html>