In [ ]:
# Display as slides with the Jupyter notebook RISE extension
# https://github.com/damianavila/RISE
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
              'theme': 'sans-serif',
              'transition': 'default',
              'start_slideshow_at': 'selected',
})

Intro to Web Scraping

 

Matt Bauman

July 6, 2016

What is HTML?

  • Human and machine-readable text
  • Supposed to be the semantic structure of a document
  • Horribly abused
  • Often terribly malformed
  • Frequently unreadable by humans and just barely readable by machines
  • It's a miracle ton of effort that makes browsers work at all

Okay, but what is it?

  • Plain-text markup that wraps content in tags
  • Tags are marked in brackets like <body>
  • And everything that follows is considered part of body until it's closed with a </body>.
  • Tags can be nested
  • Can be closed immediately without enclosing any content <div />.
  • Can have attributes to modify their behavior or name them

In [ ]:
from IPython.core.display import display, HTML
display(HTML('<p style="color:red;">Hello, world</h1>'))

In [ ]:
import requests
#print(requests.get('http://www.nytimes.com/').text)

Important tags for scraping

  • div - major sections
  • table - broken down into tr (rows) and td (datum)
  • form - contains input tags that get submitted
  • ul/ol - lists (ordered and unordered), contains li (list items)

Important attributes for scraping

  • id and class
  • They name tags; web developers use these names for styling and interactivity
  • ids are unique; classes are groups

Why web scraping is terrible

Invalid pages and incompatibilities

  • w3c (WWW Consortium) sets standards for HTML, CSS, XML, etc.
  • They have a validator to ensure that pages meet their specs

HTML can be extremely hard to read

  • Fortunately, web inspector tools can make your life easier
  • Check out The NY Times in the browser

Some sites require javascript to work

  • There aren't any libraries (that I'm aware of) that implement Javascript
  • Try turning off Javascript in your browser and make sure the site still works
  • You can often emulate the Javascript code to make the same requests... but it's a pain

It's fragile

  • While the markup is machine readable, that just specifies page layout
  • The same content can be coded in HTML in an infinite number of ways and still look identical
  • Web authors can change their code at any point...

Working around the terrible-ness

  • Don't worry about parsing yourself -- no regexes or string searches!
  • Don't worry about traversing individual nested levels (e.g., inside two divs and ...)

Instead...

  • Think of each webpage as a "tag soup"
  • Try to find a way to describe the tags you're looking for in a minimal way
  • And use a good library

Scraping in five lines:


In [4]:
# Look for headlines in the NYTimes
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.nytimes.com/')
soup = BeautifulSoup(r.text)
tags = soup.find_all(attrs={'class': 'story-heading'})

In [5]:
for tag in tags: display(HTML(str(tag)))

Hedging your bets

  • There are lots of ways to specify a search through the tag soup
  • Some methods may be more robust than others...
  • But it's not worth spending too much time trying to out-wit whatever might be updating the site on the other side

In [6]:
# Another way to get the headlines
articles = soup.find_all('article')
import re
[article.find_all(re.compile('^h\d')) for article in articles]


Out[6]:
[[<h2 class="css-km70tz esl82me0">Listen to ‘The Daily’</h2>],
 [<h2 class="css-km70tz esl82me0">The Daily Mini Crossword</h2>],
 [<h2 class="css-km70tz esl82me0">Got a confidential news tip?</h2>],
 [<h2 class="css-1qwxefa esl82me0"><span>In China, Some Fear the End of ‘Chimerica’</span></h2>],
 [<h2 class="css-n2blzn esl82me0">Why the U.S.-China Trade War Could Be Long and Painful</h2>],
 [<h2 class="css-1qwxefa esl82me0"><span>White House Reviews Military Plans Against Iran, in Echoes of Iraq War</span></h2>],
 [<h2 class="css-1qwxefa esl82me0"><span>How House Democrats in Key Districts Plan to Keep Their Seats</span></h2>],
 [<h2 class="css-n2blzn esl82me0">Elizabeth Warren refused to participate in a town hall on Fox News, which she called a “a hate-for-profit racket.”</h2>],
 [<h2 class="css-n2blzn esl82me0">Beto O’Rourke tried to reset his flagging campaign, saying on “The View” that he regretted his “born to be in it” comment.</h2>],
 [<h2 class="css-1qwxefa esl82me0"><span>Orban’s ‘Double Game’ on Anti-Semitism</span></h2>],
 [<h2 class="css-1qwxefa esl82me0"><span>Tim Conway, Beloved TV Bumbler, Is Dead at 85</span></h2>],
 [<h2 class="css-14bttnj esl82me0"><span>This Gen X Mess</span></h2>],
 [<h2 class="css-o2lisy esl82me0">Don’t Visit Your Doctor in the Afternoon</h2>],
 [<h2 class="css-1m5bs2v esl82me0">Supreme Court Liberals Raise Alarm Bells About Roe v. Wade</h2>],
 [<h2 class="css-1m5bs2v esl82me0">This Could Be Your Legacy, Governor</h2>],
 [<h2 class="css-1m5bs2v esl82me0">The Rise of the Haphazard Self</h2>],
 [<h2 class="css-1m5bs2v esl82me0">How a Pristine Ecosystem Was Wrecked Almost Overnight</h2>],
 [<h2 class="css-1m5bs2v esl82me0">In Baltimore, Police Officers Are the Bad Guys With Guns</h2>],
 [<h2 class="css-1m5bs2v esl82me0">Twitter Isn’t Real Life (if You’re a Democrat)</h2>],
 [<h2 class="css-1m5bs2v esl82me0">‘You Can’t Escape the World of Facebook’</h2>],
 [<h2 class="css-1m5bs2v esl82me0">It’s Time for the Leaders of Saudi Arabia and Iran to Talk</h2>],
 [<h2 class="css-1m5bs2v esl82me0">A de Gaulle of Our Own</h2>],
 [<h2 class="css-1m5bs2v esl82me0">A Million Americans Need This Drug. Trump’s Deal Won’t Help Enough of Them.</h2>],
 [<h2 class="css-14bttnj esl82me0"><span>The Fusion Reactor Next Door</span></h2>],
 [<h2 class="css-n2blzn esl82me0">Megan Mullally and Stephanie Hunt Are Giving Cabaret a Jolt</h2>],
 [<h2 class="css-n2blzn esl82me0">A New Book Argues That Generic Drugs Are Poisoning Us</h2>]]

Advanced topics: HTTP

  • HTTP specifies how you ask for and retrieve content
  • Also specifies metadata in headers that control caching, redirects, sessions, and more

In [7]:
r = requests.get('http://google.com/')
r.headers


Out[7]:
{'Date': 'Tue, 14 May 2019 19:14:48 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'Content-Length': '4900', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2019-05-14-19; expires=Thu, 13-Jun-2019 19:14:48 GMT; path=/; domain=.google.com, NID=183=FUUJhrZssgRPnIV2AIR2bX1hnftGj3H4O_97-UaZLCwakWFN0geeMv8dUCz0adnmV-V1_Lg058gyxApxOhTe9RSBs7S3L2K2FpVGy4p3kPYndj8CU-GYoEHwPHF1SZZPRrfJBcL1GMbd0H4J-ChraOM_8ha4mkaUeLvxzgLKcbg; expires=Wed, 13-Nov-2019 19:14:48 GMT; path=/; domain=.google.com; HttpOnly'}

Searches and forms

  • Typically, the most interesting things to scrape are hidden behind searches and forms
  • How do you enter text into Google's search box via Python?

In [9]:
soup = BeautifulSoup(requests.get('http://google.com').text)
print(soup.find('form').prettify())


<form action="/search" name="f">
 <table cellpadding="0" cellspacing="0">
  <tr valign="top">
   <td width="25%">
   </td>
   <td align="center" nowrap="">
    <input name="ie" type="hidden" value="ISO-8859-1"/>
    <input name="hl" type="hidden" value="en"/>
    <input name="source" type="hidden" value="hp"/>
    <input name="biw" type="hidden"/>
    <input name="bih" type="hidden"/>
    <div class="ds" style="height:32px;margin:4px 0">
     <input autocomplete="off" class="lst" maxlength="2048" name="q" size="57" style="color:#000;margin:0;padding:5px 8px 0 6px;vertical-align:top" title="Google Search" value=""/>
    </div>
    <br style="line-height:0"/>
    <span class="ds">
     <span class="lsbb">
      <input class="lsb" name="btnG" type="submit" value="Google Search"/>
     </span>
    </span>
    <span class="ds">
     <span class="lsbb">
      <input class="lsb" name="btnI" onclick="if(this.form.q.value)this.checked=1; else top.location='/doodles/'" type="submit" value="I'm Feeling Lucky"/>
     </span>
    </span>
   </td>
   <td align="left" class="fl sblc" nowrap="" width="25%">
    <a href="/advanced_search?hl=en&amp;authuser=0">
     Advanced search
    </a>
    <a href="/language_tools?hl=en&amp;authuser=0">
     Language tools
    </a>
   </td>
  </tr>
 </table>
 <input id="gbv" name="gbv" type="hidden" value="1"/>
 <script nonce="fOv169/vkrRXlxafv1ykyA==">
  (function(){var a,b="1";if(document&&document.getElementById)if("undefined"!=typeof XMLHttpRequest)b="2";else if("undefined"!=typeof ActiveXObject){var c,d,e=["MSXML2.XMLHTTP.6.0","MSXML2.XMLHTTP.3.0","MSXML2.XMLHTTP","Microsoft.XMLHTTP"];for(c=0;d=e[c++];)try{new ActiveXObject(d),b="2"}catch(h){}}a=b;if("2"==a&&-1==location.search.indexOf("&gbv=2")){var f=google.gbvu,g=document.getElementById("gbv");g&&(g.value=a);f&&window.setTimeout(function(){location.href=f},0)};}).call(this);
 </script>
</form>


In [10]:
r = requests.get('http://google.com/search', 
                 params={'q':  'how long does a walrus live?',
                         'btnI': "I'm Feeling Lucky"})

Types of requests

  • requests.get is actually doing a GET
    • It encodes the parameters (if any) directly into the url: ?param=value&param2=value2...
    • This means that it gets saved into your browser history
    • Back buttons, refresh may send the same parameters again

Other HTTP verbs:

  • POST is the other most common method
    • Just like GET, except that it sends its parameters hidden in a header
    • Often used for purchases, posts, etc, that you don't want to submit twice
  • There's others (PUT, DELETE, HEAD, ...), but they're rarer

A slightly more complicated example


In [11]:
# Scrape the times that the ISS is visible
r = requests.get('http://heavens-above.com/PassSummary.aspx?satid=25544&lat=41.8781&lng=-87.6298&loc=Chicago&alt=181&tz=CST')
def scrape_times(text):
    soup = BeautifulSoup(text)
    rows = soup.find_all('tr', attrs={'class':'clickableRow'})
    times = []
    for row in rows:
        cols = row.find_all('td')
        times.append(cols[0].text + ' ' + cols[2].text)
    return times
scrape_times(r.text)


Out[11]:
['14 May 02:19:24',
 '14 May 03:55:48',
 '15 May 01:29:57',
 '15 May 03:04:56',
 '15 May 04:41:09',
 '16 May 00:40:20',
 '16 May 02:13:58',
 '16 May 03:50:22',
 '17 May 01:23:02',
 '17 May 02:59:33',
 '17 May 04:35:50',
 '17 May 21:16:59',
 '17 May 22:53:16',
 '18 May 00:31:29',
 '18 May 02:08:43',
 '18 May 03:44:55',
 '18 May 22:02:06',
 '18 May 23:40:02',
 '19 May 01:17:47',
 '19 May 02:54:03',
 '19 May 04:30:48',
 '19 May 21:11:07',
 '19 May 22:48:33',
 '20 May 00:26:41',
 '20 May 02:03:13',
 '20 May 03:39:36',
 '20 May 21:57:08',
 '20 May 23:35:24',
 '21 May 01:12:21',
 '21 May 21:05:50',
 '21 May 22:43:57',
 '22 May 00:21:25',
 '22 May 21:52:26',
 '22 May 23:30:23',
 '23 May 01:06:45',
 '23 May 21:00:56',
 '23 May 22:39:09']

In [18]:
# Get the next page
r = requests.get('http://heavens-above.com/PassSummary.aspx?satid=25544&lat=41.8781&lng=-87.6298&loc=Chicago&alt=181&tz=CST')
def get_next_page(r):
    soup = BeautifulSoup(r.text)
    inputs = soup.find_all('input')
    d = {input.attrs['name']: input.attrs['value'] for input in inputs}
    d.pop('ctl00$cph1$btnPrev')
    d['ctl00_cph1_radioAll'] = 'radioVisible'
    from urllib.parse import urlparse, urljoin
    url = urljoin(r.url, soup.find('form').attrs['action'])
    return requests.post(url, d)
scrape_times(get_next_page(r).text)


Out[18]:
['24 May 00:15:52',
 '24 May 01:52:08',
 '24 May 18:33:13',
 '24 May 20:09:31',
 '24 May 21:47:44',
 '24 May 23:24:56',
 '25 May 01:01:07',
 '25 May 02:38:38',
 '25 May 17:43:09',
 '25 May 19:18:15',
 '25 May 20:56:11',
 '25 May 22:33:55',
 '26 May 00:10:11',
 '26 May 01:46:56',
 '26 May 18:27:11',
 '26 May 20:04:38',
 '26 May 21:42:44',
 '26 May 23:19:16',
 '27 May 00:55:38',
 '27 May 17:36:24',
 '27 May 19:13:07',
 '27 May 20:51:23',
 '27 May 22:28:20',
 '28 May 00:04:32',
 '28 May 01:42:53',
 '28 May 16:45:58',
 '28 May 18:21:44',
 '28 May 19:59:51',
 '28 May 21:37:19',
 '28 May 23:13:31',
 '29 May 00:50:34',
 '29 May 15:56:24',
 '29 May 17:30:32',
 '29 May 19:08:15',
 '29 May 20:46:11',
 '29 May 22:22:33',
 '29 May 23:59:05',
 '30 May 16:39:33',
 '30 May 18:16:39',
 '30 May 19:54:52',
 '30 May 21:31:36',
 '30 May 23:07:52',
 '31 May 15:48:53',
 '31 May 17:25:10',
 '31 May 19:03:22',
 '31 May 20:40:35',
 '31 May 22:16:46',
 '31 May 23:54:15',
 '01 Jun 14:58:43',
 '01 Jun 16:33:49',
 '01 Jun 18:11:45',
 '01 Jun 19:49:29',
 '01 Jun 21:25:44',
 '01 Jun 23:02:28',
 '02 Jun 15:42:40',
 '02 Jun 17:20:06',
 '02 Jun 18:58:13',
 '02 Jun 20:34:44',
 '02 Jun 22:11:06']

In [21]:
# Get the next 10 pages!
from tqdm import tqdm
r = requests.get('http://heavens-above.com/PassSummary.aspx?satid=25544&lat=41.8781&lng=-87.6298&loc=Chicago&alt=181&tz=CST')

times = []
for i in tqdm(range(10)):
    times.extend(scrape_times(r.text))
    r = get_next_page(r)
times


100%|██████████| 10/10 [00:12<00:00,  1.22s/it]
Out[21]:
['14 May 02:19:24',
 '14 May 03:55:48',
 '15 May 01:29:57',
 '15 May 03:04:56',
 '15 May 04:41:09',
 '16 May 00:40:20',
 '16 May 02:13:58',
 '16 May 03:50:22',
 '17 May 01:23:02',
 '17 May 02:59:33',
 '17 May 04:35:50',
 '17 May 21:16:59',
 '17 May 22:53:16',
 '18 May 00:31:29',
 '18 May 02:08:43',
 '18 May 03:44:55',
 '18 May 22:02:06',
 '18 May 23:40:02',
 '19 May 01:17:47',
 '19 May 02:54:03',
 '19 May 04:30:48',
 '19 May 21:11:07',
 '19 May 22:48:33',
 '20 May 00:26:41',
 '20 May 02:03:13',
 '20 May 03:39:36',
 '20 May 21:57:08',
 '20 May 23:35:24',
 '21 May 01:12:21',
 '21 May 21:05:50',
 '21 May 22:43:57',
 '22 May 00:21:25',
 '22 May 21:52:26',
 '22 May 23:30:23',
 '23 May 01:06:45',
 '23 May 21:00:56',
 '23 May 22:39:09',
 '24 May 00:15:52',
 '24 May 01:52:08',
 '24 May 18:33:13',
 '24 May 20:09:31',
 '24 May 21:47:44',
 '24 May 23:24:56',
 '25 May 01:01:07',
 '25 May 02:38:38',
 '25 May 17:43:09',
 '25 May 19:18:15',
 '25 May 20:56:11',
 '25 May 22:33:55',
 '26 May 00:10:11',
 '26 May 01:46:56',
 '26 May 18:27:11',
 '26 May 20:04:38',
 '26 May 21:42:44',
 '26 May 23:19:16',
 '27 May 00:55:38',
 '27 May 17:36:24',
 '27 May 19:13:07',
 '27 May 20:51:23',
 '27 May 22:28:20',
 '28 May 00:04:32',
 '28 May 01:42:53',
 '28 May 16:45:58',
 '28 May 18:21:44',
 '28 May 19:59:51',
 '28 May 21:37:19',
 '28 May 23:13:31',
 '29 May 00:50:34',
 '29 May 15:56:24',
 '29 May 17:30:32',
 '29 May 19:08:15',
 '29 May 20:46:11',
 '29 May 22:22:33',
 '29 May 23:59:05',
 '30 May 16:39:33',
 '30 May 18:16:39',
 '30 May 19:54:52',
 '30 May 21:31:36',
 '30 May 23:07:52',
 '31 May 15:48:53',
 '31 May 17:25:10',
 '31 May 19:03:22',
 '31 May 20:40:35',
 '31 May 22:16:46',
 '31 May 23:54:15',
 '01 Jun 14:58:43',
 '01 Jun 16:33:49',
 '01 Jun 18:11:45',
 '01 Jun 19:49:29',
 '01 Jun 21:25:44',
 '01 Jun 23:02:28',
 '02 Jun 15:42:40',
 '02 Jun 17:20:06',
 '02 Jun 18:58:13',
 '02 Jun 20:34:44',
 '02 Jun 22:11:06',
 '03 Jun 14:51:48',
 '03 Jun 16:28:31',
 '03 Jun 18:06:45',
 '03 Jun 19:43:42',
 '03 Jun 21:19:55',
 '03 Jun 22:58:14',
 '04 Jun 14:01:19',
 '04 Jun 15:37:03',
 '04 Jun 17:15:08',
 '04 Jun 18:52:37',
 '04 Jun 20:28:49',
 '04 Jun 22:05:50',
 '05 Jun 13:11:41',
 '05 Jun 14:45:45',
 '05 Jun 16:23:27',
 '05 Jun 18:01:24',
 '05 Jun 19:37:46',
 '05 Jun 21:14:17',
 '06 Jun 13:54:41',
 '06 Jun 15:31:47',
 '06 Jun 17:10:00',
 '06 Jun 18:46:43',
 '06 Jun 20:22:59',
 '07 Jun 13:03:57',
 '07 Jun 14:40:12',
 '07 Jun 16:18:24',
 '07 Jun 17:55:37',
 '07 Jun 19:31:48',
 '07 Jun 21:09:15',
 '08 Jun 12:13:44',
 '08 Jun 13:48:47',
 '08 Jun 15:26:42',
 '08 Jun 17:04:26',
 '08 Jun 18:40:41',
 '08 Jun 20:17:24',
 '09 Jun 12:57:34',
 '09 Jun 14:34:58',
 '09 Jun 16:13:05',
 '09 Jun 17:49:36',
 '09 Jun 19:25:58',
 '10 Jun 12:06:37',
 '10 Jun 13:43:18',
 '10 Jun 15:21:32',
 '10 Jun 16:58:30',
 '10 Jun 18:34:41',
 '10 Jun 20:12:55',
 '11 Jun 11:16:03',
 '11 Jun 12:51:44',
 '11 Jun 14:29:50',
 '11 Jun 16:07:20',
 '11 Jun 17:43:31',
 '11 Jun 19:20:31',
 '12 Jun 10:26:26',
 '12 Jun 12:00:23',
 '12 Jun 13:38:04',
 '12 Jun 15:16:01',
 '12 Jun 16:52:23',
 '12 Jun 18:28:53',
 '13 Jun 11:09:15',
 '13 Jun 12:46:18',
 '13 Jun 14:24:31',
 '13 Jun 16:01:16',
 '13 Jun 17:37:30',
 '14 Jun 10:18:27',
 '14 Jun 11:54:39',
 '14 Jun 13:32:50',
 '14 Jun 15:10:05',
 '14 Jun 16:46:15',
 '14 Jun 18:23:38',
 '15 Jun 09:28:12',
 '15 Jun 11:03:09',
 '15 Jun 12:41:03',
 '15 Jun 14:18:48',
 '15 Jun 15:55:04',
 '15 Jun 17:31:45',
 '16 Jun 10:11:52',
 '16 Jun 11:49:14',
 '16 Jun 13:27:21',
 '16 Jun 15:03:53',
 '16 Jun 16:40:14',
 '17 Jun 09:20:50',
 '17 Jun 10:57:28',
 '17 Jun 12:35:43',
 '17 Jun 14:12:42',
 '17 Jun 15:48:53',
 '17 Jun 17:26:59',
 '18 Jun 08:30:14',
 '18 Jun 10:05:51',
 '18 Jun 11:43:55',
 '18 Jun 13:21:26',
 '18 Jun 14:57:38',
 '18 Jun 16:34:35',
 '19 Jun 07:40:40',
 '19 Jun 09:14:25',
 '19 Jun 10:52:03',
 '19 Jun 12:30:02',
 '19 Jun 14:06:25',
 '19 Jun 15:42:53',
 '20 Jun 08:23:13',
 '20 Jun 10:00:14',
 '20 Jun 11:38:27',
 '20 Jun 13:15:12',
 '20 Jun 14:51:26',
 '21 Jun 07:32:22',
 '21 Jun 09:08:30',
 '21 Jun 10:46:40',
 '21 Jun 12:23:56',
 '21 Jun 14:00:06',
 '21 Jun 15:37:26',
 '22 Jun 06:42:05',
 '22 Jun 08:16:55',
 '22 Jun 09:54:47',
 '22 Jun 11:32:34',
 '22 Jun 13:08:50',
 '22 Jun 14:45:29',
 '23 Jun 07:25:34',
 '23 Jun 09:02:53',
 '23 Jun 10:41:02',
 '23 Jun 12:17:35',
 '23 Jun 13:53:54',
 '24 Jun 06:34:30',
 '24 Jun 08:11:03',
 '24 Jun 09:49:17',
 '24 Jun 11:26:19',
 '24 Jun 13:02:29',
 '24 Jun 14:40:25',
 '25 Jun 05:43:51',
 '25 Jun 07:19:22',
 '25 Jun 08:57:24',
 '25 Jun 10:34:57',
 '25 Jun 12:11:09',
 '25 Jun 13:48:02',
 '26 Jun 04:54:29',
 '26 Jun 06:27:52',
 '26 Jun 08:05:27',
 '26 Jun 09:43:27',
 '26 Jun 11:19:51',
 '26 Jun 12:56:18',
 '27 Jun 05:36:37',
 '27 Jun 07:13:33',
 '27 Jun 08:51:46',
 '27 Jun 10:28:33',
 '27 Jun 12:04:46',
 '28 Jun 04:45:43',
 '28 Jun 06:21:45',
 '28 Jun 07:59:54',
 '28 Jun 09:37:12',
 '28 Jun 11:13:22',
 '28 Jun 12:50:36',
 '29 Jun 03:55:27',
 '29 Jun 05:30:07',
 '29 Jun 07:07:56',
 '29 Jun 08:45:44',
 '29 Jun 10:22:01',
 '29 Jun 11:58:37',
 '30 Jun 04:38:42',
 '30 Jun 06:15:57',
 '30 Jun 07:54:06',
 '30 Jun 09:30:41',
 '30 Jun 11:06:58',
 '01 Jul 03:47:34',
 '01 Jul 05:24:03',
 '01 Jul 07:02:16',
 '01 Jul 08:39:19',
 '01 Jul 10:15:29',
 '01 Jul 11:53:15',
 '02 Jul 02:56:55',
 '02 Jul 04:32:18',
 '02 Jul 06:10:17',
 '02 Jul 07:47:53',
 '02 Jul 09:24:05',
 '02 Jul 11:00:55',
 '03 Jul 03:40:44',
 '03 Jul 05:18:15',
 '03 Jul 06:56:17',
 '03 Jul 08:32:42',
 '03 Jul 10:09:06',
 '04 Jul 02:49:26',
 '04 Jul 04:26:16',
 '04 Jul 06:04:29',
 '04 Jul 07:41:20',
 '04 Jul 09:17:31',
 '05 Jul 01:58:29',
 '05 Jul 03:34:24',
 '05 Jul 05:12:31',
 '05 Jul 06:49:53',
 '05 Jul 08:26:02',
 '05 Jul 10:03:11',
 '06 Jul 01:08:16',
 '06 Jul 02:42:42',
 '06 Jul 04:20:28',
 '06 Jul 05:58:19',
 '06 Jul 07:34:37',
 '06 Jul 09:11:10',
 '07 Jul 01:51:15',
 '07 Jul 03:28:25',
 '07 Jul 05:06:35',
 '07 Jul 06:43:12',
 '07 Jul 08:19:28',
 '08 Jul 01:00:05',
 '08 Jul 02:36:27',
 '08 Jul 04:14:38',
 '08 Jul 05:51:45',
 '08 Jul 07:27:55',
 '08 Jul 09:05:30',
 '09 Jul 00:09:25',
 '09 Jul 01:44:38',
 '09 Jul 03:22:34',
 '09 Jul 05:00:13',
 '09 Jul 06:36:25',
 '09 Jul 08:13:10',
 '10 Jul 00:53:01',
 '10 Jul 02:30:27',
 '10 Jul 04:08:31',
 '10 Jul 05:44:58',
 '10 Jul 07:21:20',
 '11 Jul 00:01:39',
 '11 Jul 01:38:24',
 '11 Jul 03:16:37',
 '11 Jul 04:53:30',
 '11 Jul 06:29:40',
 '11 Jul 08:08:11',
 '11 Jul 23:10:42',
 '12 Jul 00:46:28',
 '12 Jul 02:24:33',
 '12 Jul 04:01:58',
 '12 Jul 05:38:07',
 '12 Jul 07:15:09',
 '12 Jul 22:20:35',
 '13 Jul 01:32:24',
 '13 Jul 03:10:18',
 '13 Jul 04:46:38',
 '13 Jul 06:23:07',
 '13 Jul 23:03:12',
 '14 Jul 00:40:16',
 '14 Jul 02:18:27',
 '14 Jul 03:55:08',
 '14 Jul 05:31:21',
 '14 Jul 22:12:01',
 '14 Jul 23:48:14',
 '15 Jul 01:26:25',
 '15 Jul 03:03:36',
 '15 Jul 04:39:44',
 '15 Jul 06:17:09',
 '15 Jul 21:21:23',
 '15 Jul 22:56:22',
 '16 Jul 00:34:15',
 '16 Jul 02:11:57',
 '16 Jul 03:48:10',
 '16 Jul 05:24:51',
 '16 Jul 22:04:41',
 '16 Jul 23:42:03',
 '17 Jul 01:20:09',
 '17 Jul 02:56:39',
 '17 Jul 04:32:57',
 '17 Jul 21:13:19',
 '17 Jul 22:49:56',
 '18 Jul 00:28:09',
 '18 Jul 02:05:06',
 '18 Jul 03:41:15',
 '18 Jul 05:19:20',
 '18 Jul 20:22:21',
 '18 Jul 21:57:56',
 '18 Jul 23:35:58',
 '19 Jul 01:13:28',
 '19 Jul 02:49:37',
 '19 Jul 04:26:33',
 '19 Jul 19:32:30',
 '19 Jul 21:06:08',
 '19 Jul 22:43:44',
 '20 Jul 00:21:42',
 '20 Jul 01:58:03',
 '20 Jul 03:34:29',
 '20 Jul 20:14:35',
 '20 Jul 21:51:32',
 '20 Jul 23:29:44',
 '21 Jul 01:06:29',
 '21 Jul 02:42:40',
 '21 Jul 19:23:23',
 '21 Jul 20:59:27',
 '21 Jul 22:37:35',
 '22 Jul 00:14:51',
 '22 Jul 01:50:58',
 '22 Jul 03:28:14',
 '22 Jul 18:32:48',
 '22 Jul 20:07:31',
 '22 Jul 21:45:20',
 '22 Jul 23:23:06',
 '23 Jul 00:59:21',
 '23 Jul 02:35:56',
 '23 Jul 19:15:48',
 '23 Jul 20:53:03',
 '23 Jul 22:31:11',
 '24 Jul 00:07:44',
 '24 Jul 01:44:00',
 '24 Jul 18:24:24',
 '24 Jul 20:00:52',
 '24 Jul 21:39:04',
 '24 Jul 23:16:05',
 '25 Jul 00:52:14',
 '25 Jul 02:30:01',
 '25 Jul 17:33:28',
 '25 Jul 19:08:49',
 '25 Jul 20:46:48',
 '25 Jul 22:24:22',
 '26 Jul 00:00:32',
 '26 Jul 01:37:21',
 '26 Jul 18:16:58',
 '26 Jul 19:54:29',
 '26 Jul 21:32:29',
 '26 Jul 23:08:53',
 '27 Jul 00:45:16',
 '27 Jul 17:25:23',
 '27 Jul 19:02:12',
 '27 Jul 20:40:25',
 '27 Jul 22:17:13',
 '27 Jul 23:53:23',
 '28 Jul 16:34:11',
 '28 Jul 18:10:03',
 '28 Jul 19:48:09',
 '28 Jul 21:25:30',
 '28 Jul 23:01:38',
 '29 Jul 00:38:44',
 '29 Jul 15:43:44',
 '29 Jul 17:18:05',
 '29 Jul 18:55:48',
 '29 Jul 20:33:39',
 '29 Jul 22:09:56',
 '29 Jul 23:46:27',
 '30 Jul 16:26:20',
 '30 Jul 18:03:27',
 '30 Jul 19:41:37',
 '30 Jul 21:18:14',
 '30 Jul 22:54:28',
 '31 Jul 15:34:55',
 '31 Jul 17:11:12',
 '31 Jul 18:49:23',
 '31 Jul 20:26:31',
 '31 Jul 22:02:38',
 '31 Jul 23:40:08',
 '01 Aug 14:44:02',
 '01 Aug 16:19:07',
 '01 Aug 17:57:01',
 '01 Aug 19:34:41',
 '01 Aug 21:10:52',
 '01 Aug 22:47:34',
 '02 Aug 15:27:13',
 '02 Aug 17:04:37',
 '02 Aug 18:42:41',
 '02 Aug 20:19:08',
 '02 Aug 21:55:27',
 '03 Aug 14:35:37',
 '03 Aug 16:12:16',
 '03 Aug 17:50:29',
 '03 Aug 19:27:23',
 '03 Aug 21:03:31',
 '03 Aug 22:41:46',
 '04 Aug 13:44:26',
 '04 Aug 15:20:04',
 '04 Aug 16:58:07',
 '04 Aug 18:35:34',
 '04 Aug 20:11:42',
 '04 Aug 21:48:39',
 '05 Aug 12:54:16',
 '05 Aug 14:28:04',
 '05 Aug 16:05:40',
 '05 Aug 17:43:36',
 '05 Aug 19:19:56',
 '05 Aug 20:56:22',
 '06 Aug 13:36:18',
 '06 Aug 15:13:15',
 '06 Aug 16:51:27',
 '06 Aug 18:28:09',
 '06 Aug 20:04:20',
 '07 Aug 12:44:52',
 '07 Aug 14:20:57',
 '07 Aug 15:59:05',
 '07 Aug 17:36:19',
 '07 Aug 19:12:26',
 '07 Aug 20:49:43',
 '08 Aug 11:54:05',
 '08 Aug 13:28:48',
 '08 Aug 15:06:37',
 '08 Aug 16:44:23',
 '08 Aug 18:20:37',
 '08 Aug 19:57:12',
 '09 Aug 12:36:53',
 '09 Aug 14:14:09',
 '09 Aug 15:52:16',
 '09 Aug 17:28:48',
 '09 Aug 19:05:03',
 '10 Aug 11:45:17',
 '10 Aug 13:21:45',
 '10 Aug 14:59:56',
 '10 Aug 16:36:57',
 '10 Aug 18:13:05',
 '10 Aug 19:50:52',
 '11 Aug 10:54:09',
 '11 Aug 12:29:30',
 '11 Aug 14:07:28',
 '11 Aug 15:45:01',
 '11 Aug 17:21:11',
 '11 Aug 18:57:59',
 '12 Aug 11:37:27',
 '12 Aug 13:14:57',
 '12 Aug 14:52:56',
 '12 Aug 16:29:20',
 '12 Aug 18:05:41',
 '13 Aug 10:45:40',
 '13 Aug 12:22:27',
 '13 Aug 14:00:39',
 '13 Aug 15:37:28',
 '13 Aug 17:13:37',
 '14 Aug 09:54:16',
 '14 Aug 11:30:06',
 '14 Aug 13:08:11',
 '14 Aug 14:45:32',
 '14 Aug 16:21:40',
 '14 Aug 17:58:44',
 '15 Aug 09:03:40',
 '15 Aug 10:37:56',
 '15 Aug 12:15:38',
 '15 Aug 13:53:29',
 '15 Aug 15:29:45',
 '15 Aug 17:06:14',
 '16 Aug 09:45:59',
 '16 Aug 11:23:05',
 '16 Aug 13:01:14',
 '16 Aug 14:37:52',
 '16 Aug 16:14:04',
 '17 Aug 08:54:23',
 '17 Aug 10:30:37',
 '17 Aug 12:08:47',
 '17 Aug 13:45:55',
 '17 Aug 15:22:02',
 '17 Aug 16:59:29',
 '18 Aug 08:03:20',
 '18 Aug 09:38:20',
 '18 Aug 11:16:12',
 '18 Aug 12:53:53',
 '18 Aug 14:30:05',
 '18 Aug 16:06:45',
 '19 Aug 08:46:16',
 '19 Aug 10:23:36',
 '19 Aug 12:01:41',
 '19 Aug 13:38:09',
 '19 Aug 15:14:26',
 '20 Aug 07:54:29',
 '20 Aug 09:31:04',
 '20 Aug 11:09:16',
 '20 Aug 12:46:11',
 '20 Aug 14:22:19',
 '20 Aug 16:00:23',
 '21 Aug 07:03:08',
 '21 Aug 08:38:40',
 '21 Aug 10:16:40',
 '21 Aug 11:54:10',
 '21 Aug 13:30:18',
 '21 Aug 15:07:10']

In [ ]: