There are lots of data sources from which we might want to extract information, such as initial public offerings for various companies. E.g., Tesla's IPO prospectus. One can imagine trying to mine such documents in an effort to predict which IPOs will do poorly or well.
HTML has both text as well as so-called markup like <b>
, which is used to specify formatting information.
We will use the well-known Beautiful soup Python library to extract text.
First, either do a "save as" or do what the cool kids do:
In [44]:
! curl https://www.sec.gov/Archives/edgar/data/1318605/000119312510017054/ds1.htm > /tmp/TeslaIPO.html
If you then do open /tmp/TeslaIPO.html
from the command line, it will pop up in your browser window. Also take a look at what HTML looks like in the wild:
In [45]:
! head -15 /tmp/TeslaIPO.html
In [46]:
import sys
from bs4 import BeautifulSoup
with open("/tmp/TeslaIPO.html", "r") as f:
html_text = f.read()
soup = BeautifulSoup(html_text, 'html.parser')
text = soup.get_text()
print(text[0:300])
In [17]:
def html2text(html_text):
soup = BeautifulSoup(html_text, 'html.parser')
text = soup.get_text()
return text
Then, our main program looks like:
In [18]:
import sys
from bs4 import BeautifulSoup
def html2text(html_text):
soup = BeautifulSoup(html_text, 'html.parser')
text = soup.get_text()
return text
with open("/tmp/TeslaIPO.html", "r") as f:
html_text = f.read()
text = html2text(html_text)
print(text[0:1000])
Copy that program into a Python file called ipo-text.py
and run it from the command line. You will notice that there is weird stuff in the output like: Registrant<U+0092>s
. That 92 is the character code, in hexadecimal, for the fancy single quote: ’
. You will have to download the TeslaIPO.html file.
In [19]:
text = [c for c in text if ord(c)<=127]
text = ''.join(text)
print(text[:300])
In [20]:
len(set(text.split()))
Out[20]:
In [27]:
from collections import defaultdict
counts = defaultdict(int)
for w in text.split():
counts[w] += 1
list(counts.items())[:10]
Out[27]:
In [29]:
sorted(counts.items())[:5]
Out[29]:
In [38]:
def thecount(pair): return pair[1]
histo = sorted(counts.items(), key=thecount, reverse=True)
#histo = sorted(counts.items(), key=lambda x: x[1], reverse=True)
for p in histo[0:10]:
print(f"{p[1]} {p[0]}")
In [43]:
from collections import Counter
counts = Counter(text.split())
print(str(counts)[0:400])
If there are characters within the file that are non-ASCII and larger than 255, we can convert the file using the command line. Here's a simple version of the problem I put into file /tmp/foo.html
:
<html>
<body>
གྷ
</body>
</html>
I deliberately injected a Unicode code point > 255, which requires two bytes to store. Most of the characters require just one byte. Here is first part of file:
$ od -c -t xC /tmp/t.html
0000000 < h t m l > \n < b o d y > \n གྷ **
3c 68 74 6d 6c 3e 0a 3c 62 6f 64 79 3e 0a e0 bd
...
Here is how you could strip any non-one-byte characters from the file before processing:
$ iconv -c -f utf-8 -t ascii /tmp/foo.html
<html>
<body>
</body>
</html>