Example: http://seclists.org/fulldisclosure/2017/Jan/0
With each reply, we'll attempt to parse out the following:
In [1]:
import re
import requests
from bs4 import BeautifulSoup
We'll gather the contents of a single message. 2017_Jan_0 is one that includes a personal signature, as well as the standard Full Disclosure footer.
2017_Jan_45 is a message that includes a PGP signature.
In [30]:
year = '2005'
month = 'Jan'
id = '0'
url = 'http://seclists.org/fulldisclosure/' + year + '/' + month + '/' + id
r = requests.get(url)
content = r.text
from IPython.display import Pretty
Pretty(content)
Out[30]:
Each message in the FD list is wrapped in seclists.org code, including navigation, ads, and trackers, all irrelevant to us. The body of the reply is contained between two comments, <!--X-Body-of-Message--> and <!--X-Body-of-Message-End-->.
BeautifulSoup isn't great at handling comments, so we first use simple indexing to extract the relevant chars. We'll then send it through BeautifulSoup so we can use its .text property to strip out the html tags. BS4 automatically adds tags to create valid html, so remember to parse using the generated <body> tags.
What we end up with is a plaintext version of the message's body.
In [45]:
start = content.index('<!--X-Body-of-Message-->') + 24
end = content.index('<!--X-Body-of-Message-End-->')
body = content[start:end]
soup = BeautifulSoup(body, 'html5lib')
bodyhtml = soup.find('body')
raw = bodyhtml.text
Pretty(raw)
Out[45]:
Messages to the FD list usually end with a common footer:
2002-2005:
_______________________________________________
Full-Disclosure - We believe in it.
Charter: http://lists.netsys.com/full-disclosure-charter.html
2005-2014:
_______________________________________________
Full-Disclosure - We believe in it.
Charter: http://lists.grok.org.uk/full-disclosure-charter.html
Hosted and sponsored by Secunia - http://secunia.com/
2014-onward:
_______________________________________________
Sent through the Full Disclosure mailing list
http://nmap.org/mailman/listinfo/fulldisclosure
Web Archives & RSS: http://seclists.org/fulldisclosure/
We'll look for the first line (47 underscores), then test the lines below to make sure it's a match. If so, we'll strip out that footer from our content.
In [60]:
workcopy = raw
footers = [m.start() for m in re.finditer('_{47}', workcopy)]
for f in reversed(footers):
possible = workcopy[f:f+190]
lines = possible.splitlines()
if(len(lines) == 4
and lines[1][0:15] == 'Full-Disclosure'
and lines[2][0:8] == 'Charter:'
and lines[3][0:20] == 'Hosted and sponsored'):
workcopy = workcopy[:f] + workcopy[f+213:]
continue
if(len(lines) == 4
and lines[1][0:16] == 'Sent through the'
and lines[2][0:17] == 'https://nmap.org/'
and lines[3][0:14] == 'Web Archives &'):
workcopy = workcopy[:f] + workcopy[f+211:]
continue
possible = workcopy[f:f+146]
lines = possible.splitlines()
if(len(lines) == 3
and lines[1][0:15] == 'Full-Disclosure'
and lines[2][0:8] == 'Charter:'):
workcopy = workcopy[:f] + workcopy[f+146:]
continue
print(workcopy)
As can be expected, many messages offer a PGP signature validation. This isn't useful to our processing, so we'll take it out. First, we define get_raw_message with code we've used previously. We then create strip_pgp, looking for the PGP signature. We can just use simple text searches again, with an exception of using RE for the Hash, which can change.
http://seclists.org/fulldisclosure/2017/Oct/11 is a message that includes a PGP signature, so we'll use that to test.
In [13]:
def get_raw_message(url):
r = requests.get(url)
content = r.text
start = content.index('<!--X-Body-of-Message-->') + 24
end = content.index('<!--X-Body-of-Message-End-->')
body = content[start:end]
soup = BeautifulSoup(body, 'html5lib')
bodyhtml = soup.find('body')
return bodyhtml.text
#rawmsg = get_raw_message('http://seclists.org/fulldisclosure/2017/Oct/11')
rawmsg = get_raw_message('http://seclists.org/fulldisclosure/2005/Jan/719')
def strip_pgp(raw):
try:
pgp_sig_start = raw.index('-----BEGIN PGP SIGNATURE-----')
pgp_sig_end = raw.index('-----END PGP SIGNATURE-----') + 27
cleaned = raw[:pgp_sig_start] + raw[pgp_sig_end:]
# if we find a public key block, then strip that out
try:
pgp_pk_start = raw.index('-----BEGIN PGP PUBLIC KEY BLOCK-----')
pgp_pk_end = raw.index('-----END PGP PUBLIC KEY BLOCK-----') + 35
cleaned = cleaned[:pgp_pk_start] + cleaned[pgp_pk_end:]
except ValueError as ve:
pass
# finally, try to remove the signed message header
pgp_msg = raw.index('-----BEGIN PGP SIGNED MESSAGE-----')
pgp_hash = re.search('Hash:(.)+\n', raw)
if pgp_hash is not None:
first_hash = pgp_hash.span(0)
if first_hash[0] == pgp_msg + 35:
#if we found a hash designation immediately after the header, strip that too
cleaned = cleaned[:pgp_msg] + cleaned[first_hash[1]:]
else:
#just strip the header
cleaned = cleaned[:pgp_msg] + cleaned[pgp_msg + 34:]
else:
cleaned = cleaned[:pgp_msg] + cleaned[pgp_msg + 34:]
return cleaned
except ValueError as ve:
return raw
unpgp = strip_pgp(rawmsg)
Pretty(unpgp)
#Pretty(strip_pgp(raw))
Out[13]:
In [28]:
import talon
from talon.signature.bruteforce import extract_signature
reply, signature = extract_signature(raw)
if(not signature is None):
Pretty(signature)
In [29]:
Pretty(reply)
Out[29]:
At least for 2017_Jan_0, it is pretty effective. 2017_Jan_45 was not successful at all. Now, we'll try the machine learning style, to compare.
In [8]:
talon.init()
from talon import signature
reply_ml, sig_ml = signature.extract(raw, sender="dawid@legalhackers.com")
print(sig_ml)
#reply_ml
This doesn't seem to output anything. I'm unclear whether or not this library is already trained; documentation states that it was trained on the authors' personal email and an ENRON set. There is an open issue on github https://github.com/mailgun/talon/issues/143 from July asking about the same thing. We will stick with the "brute force" method for now, and continue to look for more libraries.
We'll use a fairly simple regex to extract any tags from the reply.
<([^\s>]+)(\s|/>)+
[^\s>]+ one or more non-whitespace characters, followed by:\s|/ either a whitespace character, or a slash (/) for self-closing tags.We then use a dictionary to count the instances of each unique tag.
In [9]:
rx = re.compile('<([^\s>]+)(\s|/>)+')
tags = {}
for tag in rx.findall(str(bodyhtml)):
tagtype = tag[0]
if not tagtype.startswith('/'):
if tagtype in tags:
tags[tagtype] = tags[tagtype] + 1
else:
tags[tagtype] = 1
print(tags)
In [10]:
from urllib.parse import urlparse
sites = {}
atags = bodyhtml.find_all('a')
hrefs = [link.get('href') for link in atags]
for link in hrefs:
parsedurl = urlparse(link)
site = parsedurl.netloc
if site in sites:
sites[site] = sites[site] + 1
else:
sites[site] = 1
sites
Out[10]: