Acquire the Data

Sources of Data

We want to understand what are the important trends in Machine Learning at the moment. So we want to get a list of articles about Machine Learning that people are talking about. We can do that from many sources, but we decided to pick three sources to do that.

  1. Reddit.com - Machine Learning - Reddit is a user generated discussion forum where recent articles and topics on Maching Learning are discussed by the community.

  2. Data Tau- Data Tau is the hacker news for machine learning. Users post articles about latest trends in data science and machine learning and can have discussion arount it.

  3. Twitter #machinelearning - We can also look at Twitter with #machinelearning tags to find the latest articles and post about machine learning that are being discussed in the social media.

Working with Data Tau

Let us start with Data Tau site and scrape the data to acquire it.

We will want to scrape the title and date for each of the article in this page


In [1]:
import requests
from bs4 import BeautifulSoup 
import re
import pandas as pd

In [2]:
base_url = 'http://www.datatau.com'

Understand the HTML Structure


In [3]:
#Let us use request to get the url
dataTau = requests.get(base_url)

In [4]:
# Check if the page has been scraped - we should see Response 200
dataTau


Out[4]:
<Response [200]>

In [5]:
dataTau = open('dataTau.html', 'rb').read()

In [6]:
# Let us see the text content of the page
dataTau


Out[6]:
b'<html><head><link rel="stylesheet" type="text/css" href="news.css">\n<link rel="shortcut icon" href="http://www.iconj.com/ico/d/x/dxo02ap56v.ico">\n<script>\nfunction byId(id) {\n  return document.getElementById(id);\n}\n\nfunction vote(node) {\n  var v = node.id.split(/_/);   // {\'up\', \'123\'}\n  var item = v[1]; \n\n  // adjust score\n  var score = byId(\'score_\' + item);\n  var newscore = parseInt(score.innerHTML) + (v[0] == \'up\' ? 1 : -1);\n  score.innerHTML = newscore + (newscore == 1 ? \' point\' : \' points\');\n\n  // hide arrows\n  byId(\'up_\'   + item).style.visibility = \'hidden\';\n  byId(\'down_\' + item).style.visibility = \'hidden\';\n\n  // ping server\n  var ping = new Image();\n  ping.src = node.href;\n\n  return false; // cancel browser nav\n} </script><script>\n\n  (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)\n  })(window,document,\'script\',\'//www.google-analytics.com/analytics.js\',\'ga\');\n\n  ga(\'create\', \'UA-46326769-1\', \'datatau.com\');\n  ga(\'send\', \'pageview\');\n\n</script><title>DataTau</title></head><body><center><table border=0 cellpadding=0 cellspacing=0 width="85%" bgcolor=#f6f6ef><tr><td bgcolor=#00b4b4><table border=0 cellpadding=0 cellspacing=0 width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="http://www.datatau.com"><img src="arc.png" width=18 height=18 style="border:1px #b4b400 solid;"></img></a></td><td style="line-height:12pt; height:10px;"><span class="pagetop"><b><a href="news">DataTau</a></b><img src="s.gif" height=1 width=10><a href="newest">new</a> | <a href="newcomments">comments</a> | <a href="leaders">leaders</a> | <a href="submit">submit</a></span></td><td style="text-align:right;padding-right:4px;"><span class="pagetop"><a href="/x?fnid=fDLzOSbeCa">login</a></span></td></tr></table></td></tr><tr style="height:10px"></tr><tr><td><table border=0 cellpadding=0 cellspacing=0><tr><td align=right valign=top class="title">1.</td><td><center><a id=nil href="vote?for=11989&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11989></span></center></td><td class="title"><a href="https://www.springboard.com/blog/eat-rate-love-an-exploration-of-r-yelp-and-the-search-for-good-indian-food/" rel="nofollow">An Exploration of R, Yelp, and the Search for Good Indian Food</a><span class="comhead"> (springboard.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11989>5 points</span> by <a href="user?id=Rogerh91">Rogerh91</a> 4 hours ago  | <a href="item?id=11989">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">2.</td><td><center><a id=nil href="vote?for=11986&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11986></span></center></td><td class="title"><a href="http://blog.insightdatalabs.com/spark-pipelines-elegant-yet-powerful/" rel="nofollow">Spark Pipelines: Elegant Yet Powerful</a><span class="comhead"> (insightdatalabs.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11986>3 points</span> by <a href="user?id=aouyang1">aouyang1</a> 7 hours ago  | <a href="item?id=11986">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">3.</td><td><center><a id=nil href="vote?for=11973&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11973></span></center></td><td class="title"><a href="https://www.youtube.com/watch?v=KeJINHjyzOU">Deep Advances in Generative Modeling</a><span class="comhead"> (youtube.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11973>7 points</span> by <a href="user?id=gwulfs">gwulfs</a> 13 hours ago  | <a href="item?id=11973">1 comment</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">4.</td><td><center><a id=nil href="vote?for=11980&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11980></span></center></td><td class="title"><a href="http://www.buzzfeed.com/westleyargentum/stuff-vcs-say#.lk1wooEBL" rel="nofollow">Shit VCs Say</a><span class="comhead"> (buzzfeed.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11980>3 points</span> by <a href="user?id=Argentum01">Argentum01</a> 8 hours ago  | <a href="item?id=11980">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">5.</td><td><center><a id=nil href="vote?for=11967&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11967></span></center></td><td class="title"><a href="http://sebastianraschka.com/blog/2015/why-python.html" rel="nofollow">Python, Machine Learning, and Language Wars</a><span class="comhead"> (sebastianraschka.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11967>4 points</span> by <a href="user?id=pmigdal">pmigdal</a> 15 hours ago  | <a href="item?id=11967">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">6.</td><td><center><a id=nil href="vote?for=11975&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11975></span></center></td><td class="title"><a href="https://iamtrask.github.io/2015/07/12/basic-python-network/" rel="nofollow">A Neural Network in 11 lines of Python </a><span class="comhead"> (github.io) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11975>3 points</span> by <a href="user?id=dekhtiar">dekhtiar</a> 13 hours ago  | <a href="item?id=11975">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">7.</td><td><center><a id=nil href="vote?for=11955&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11955></span></center></td><td class="title"><a href="http://setosa.io/ev/markov-chains/">Markov Chains Explained Visually</a><span class="comhead"> (setosa.io) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11955>13 points</span> by <a href="user?id=zeroviscosity">zeroviscosity</a> 1 day ago  | <a href="item?id=11955">1 comment</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">8.</td><td><center><a id=nil href="vote?for=11952&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11952></span></center></td><td class="title"><a href="https://github.com/dodger487/dplython">Dplython: Dplyr for Python</a><span class="comhead"> (github.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11952>13 points</span> by <a href="user?id=thenaturalist">thenaturalist</a> 1 day ago  | <a href="item?id=11952">3 comments</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">9.</td><td><center><a id=nil href="vote?for=11940&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11940></span></center></td><td class="title"><a href="http://research.google.com/pubs/pub41854.html">Inferring causal impact using Bayesian structural time-series models</a><span class="comhead"> (google.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11940>8 points</span> by <a href="user?id=Homunculiheaded">Homunculiheaded</a> 1 day ago  | <a href="item?id=11940">1 comment</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">10.</td><td><center><a id=nil href="vote?for=11948&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11948></span></center></td><td class="title"><a href="http://tech.marksblogg.com/billion-nyc-taxi-rides-spark-emr.html" rel="nofollow">A Billion Taxi Rides on Amazon EMR running Spark</a><span class="comhead"> (marksblogg.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11948>5 points</span> by <a href="user?id=marklit">marklit</a> 1 day ago  | <a href="item?id=11948">1 comment</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">11.</td><td><center><a id=nil href="vote?for=11946&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11946></span></center></td><td class="title"><a href="http://trendct.org/2016/03/18/tutorial-web-scraping-and-mapping-breweries-with-import-io-and-r/" rel="nofollow">Tutorial: Web scraping and mapping breweries with import.io and R</a><span class="comhead"> (trendct.org) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11946>4 points</span> by <a href="user?id=jasdumas">jasdumas</a> 1 day ago  | <a href="item?id=11946">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">12.</td><td><center><a id=nil href="vote?for=11939&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11939></span></center></td><td class="title"><a href="http://yanirseroussi.com/2016/03/20/the-rise-of-greedy-robots/" rel="nofollow">The rise of greedy robots</a><span class="comhead"> (yanirseroussi.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11939>4 points</span> by <a href="user?id=yanir">yanir</a> 2 days ago  | <a href="item?id=11939">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">13.</td><td><center><a id=nil href="vote?for=11905&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11905></span></center></td><td class="title"><a href="https://github.com/jmportilla/Python-for-Algorithms--Data-Structures--and-Interviews">Python for Data Structures, Algorithms, and Interviews</a><span class="comhead"> (github.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11905>18 points</span> by <a href="user?id=kokoubaby">kokoubaby</a> 4 days ago  | <a href="item?id=11905">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">14.</td><td><center><a id=nil href="vote?for=11956&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11956></span></center></td><td class="title"><a href="http://techblog.netflix.com/2016/03/extracting-image-metadata-at-scale.html" rel="nofollow">Extracting image metadata at scale</a><span class="comhead"> (netflix.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11956>2 points</span> by <a href="user?id=zachwill">zachwill</a> 1 day ago  | <a href="item?id=11956">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">15.</td><td><center><a id=nil href="vote?for=11909&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11909></span></center></td><td class="title"><a href="http://blog.datalifebalance.com/lift-charts-a-data-scientists-secret-weapon/">Lift charts - A data scientist\'s secret weapon</a><span class="comhead"> (datalifebalance.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11909>14 points</span> by <a href="user?id=datenheini">datenheini</a> 4 days ago  | <a href="item?id=11909">2 comments</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">16.</td><td><center><a id=nil href="vote?for=11934&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11934></span></center></td><td class="title"><a href="http://swanintelligence.com/how-to-become-a-machine-learning-expert-in-one-simple-step.html" rel="nofollow">How To Become A Machine Learning Expert In One Simple Step</a><span class="comhead"> (swanintelligence.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11934>4 points</span> by <a href="user?id=swanint">swanint</a> 2 days ago  | <a href="item?id=11934">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">17.</td><td><center><a id=nil href="vote?for=11910&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11910></span></center></td><td class="title"><a href="http://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/">Engineers Shouldn\xe2\x80\x99t Write ETL: High Functioning Data Science Departments</a><span class="comhead"> (stitchfix.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11910>10 points</span> by <a href="user?id=legel">legel</a> 4 days ago  | <a href="item?id=11910">3 comments</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">18.</td><td><center><a id=nil href="vote?for=11937&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11937></span></center></td><td class="title"><a href="http://www.willmcginnis.com/2016/03/15/simple-estimation-hierarchical-events-petersburg/" rel="nofollow">Simple estimation of hierarchical events with petersburg</a><span class="comhead"> (willmcginnis.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11937>3 points</span> by <a href="user?id=wdm0006">wdm0006</a> 2 days ago  | <a href="item?id=11937">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">19.</td><td><center><a id=nil href="vote?for=11938&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11938></span></center></td><td class="title"><a href="item?id=11938">Data Science Side Project</a></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11938>6 points</span> by <a href="user?id=yashpatel5400">yashpatel5400</a> 2 days ago  | <a href="item?id=11938">8 comments</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">20.</td><td><center><a id=nil href="vote?for=11920&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11920></span></center></td><td class="title"><a href="http://multithreaded.stitchfix.com/blog/2016/02/04/computer-vision-state-of-the-art/">Unsupervised Computer Vision: The Current State of the Art</a><span class="comhead"> (stitchfix.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11920>6 points</span> by <a href="user?id=carlosfaham">carlosfaham</a> 3 days ago  | <a href="item?id=11920">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">21.</td><td><center><a id=nil href="vote?for=11882&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11882></span></center></td><td class="title"><a href="https://drive.google.com/file/d/0BxGB59WxQI5oTXpQd09jbVpvalE/view">Data Engineering at Slack: Twelve Mistakes I\'ve Made In My First Three Months</a><span class="comhead"> (google.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11882>14 points</span> by <a href="user?id=gwulfs">gwulfs</a> 6 days ago  | <a href="item?id=11882">2 comments</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">22.</td><td><center><a id=nil href="vote?for=11931&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11931></span></center></td><td class="title"><a href="http://www.randalolson.com/2016/03/11/what-data-visualization-tools-do-rdataisbeautiful-oc-creators-use/" rel="nofollow">What data visualization tools do /r/DataIsBeautiful OC creators use?</a><span class="comhead"> (randalolson.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11931>3 points</span> by <a href="user?id=pmigdal">pmigdal</a> 2 days ago  | <a href="item?id=11931">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">23.</td><td><center><a id=nil href="vote?for=11917&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11917></span></center></td><td class="title"><a href="https://nikolaygrozev.wordpress.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/">Reshaping in Pandas</a><span class="comhead"> (nikolaygrozev.wordpress.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11917>6 points</span> by <a href="user?id=carlosgg">carlosgg</a> 4 days ago  | <a href="item?id=11917">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">24.</td><td><center><a id=nil href="vote?for=11923&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11923></span></center></td><td class="title"><a href="http://blackboxchallenge.com/eng" rel="nofollow">An unusual interactive machine learning challenge</a><span class="comhead"> (blackboxchallenge.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11923>4 points</span> by <a href="user?id=gglumov">gglumov</a> 3 days ago  | <a href="item?id=11923">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">25.</td><td><center><a id=nil href="vote?for=11922&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11922></span></center></td><td class="title"><a href="http://blog.datumbox.com/datumbox-machine-learning-framework-0-7-0-released/" rel="nofollow">Datumbox Machine Learning Framework 0.7.0 Released</a><span class="comhead"> (datumbox.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11922>4 points</span> by <a href="user?id=datumbox">datumbox</a> 3 days ago  | <a href="item?id=11922">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">26.</td><td><center><a id=nil href="vote?for=11865&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11865></span></center></td><td class="title"><a href="http://p.migdal.pl/2016/03/15/data-science-intro-for-math-phys-background.html">Data science intro for math/phys background</a><span class="comhead"> (p.migdal.pl) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11865>14 points</span> by <a href="user?id=pmigdal">pmigdal</a> 7 days ago  | <a href="item?id=11865">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">27.</td><td><center><a id=nil href="vote?for=11837&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11837></span></center></td><td class="title"><a href="http://lumiverse.io/series/neural-networks-demystified">Neural Networks demystified</a><span class="comhead"> (lumiverse.io) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11837>16 points</span> by <a href="user?id=elyase">elyase</a> 8 days ago  | <a href="item?id=11837">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">28.</td><td><center><a id=nil href="vote?for=11880&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11880></span></center></td><td class="title"><a href="http://insighthealthdata.com/blog/HealthyBeats/">What machines can learn from Apple Watch: detecting undiagnosed heart condition</a><span class="comhead"> (insighthealthdata.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11880>9 points</span> by <a href="user?id=koukouhappy">koukouhappy</a> 6 days ago  | <a href="item?id=11880">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">29.</td><td><center><a id=nil href="vote?for=11862&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11862></span></center></td><td class="title"><a href="http://blog.dominodatalab.com/open-source-winning-against-proprietary-data-science-vendors/">Data Science Tools: The Biggest Winners and Losers</a><span class="comhead"> (dominodatalab.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11862>12 points</span> by <a href="user?id=AnnaOnTheWeb">AnnaOnTheWeb</a> 7 days ago  | <a href="item?id=11862">discuss</a></td></tr><tr style="height:5px"></tr><tr><td align=right valign=top class="title">30.</td><td><center><a id=nil href="vote?for=11868&dir=up&whence=%6e%65%77%73"><img src="grayarrow.gif" border=0 vspace=3 hspace=2></a><span id=down_11868></span></center></td><td class="title"><a href="https://medium.com/@thomasrorystone/10-years-of-open-source-machine-learning-64bb6fb18eb2#.vp3r4try5">10 Years of Open Source Machine Learning</a><span class="comhead"> (medium.com) </span></td></tr><tr><td colspan=2></td><td class="subtext"><span id=score_11868>9 points</span> by <a href="user?id=tstonez">tstonez</a> 6 days ago  | <a href="item?id=11868">1 comment</a></td></tr><tr style="height:5px"></tr><tr style="height:10px"></tr><tr><td colspan=2></td><td class="title"><a href="/x?fnid=CSS821ucAs" rel="nofollow">More</a></td></tr></table></td></tr><tr><td><img src="s.gif" height=10 width=0><table width="100%" cellspacing=0 cellpadding=1><tr><td bgcolor=#00b4b4></td></tr></table><br>\n<center></center></td></tr></table></center><center><a href="http://www.datatau.com/rss">RSS\n</a><a href="http://www.datatau.com/item?id=1">| Announcements\n</a></center></body></html>'

In [7]:
# Start the beautifulsoup library and create a soup!
soup = BeautifulSoup(dataTau,'html.parser')

In [8]:
# See the pretty form HTML - Not so pretty though!
print (soup.prettify())


<html>
 <head>
  <link href="news.css" rel="stylesheet" type="text/css">
   <link href="http://www.iconj.com/ico/d/x/dxo02ap56v.ico" rel="shortcut icon">
    <script>
     function byId(id) {
  return document.getElementById(id);
}

function vote(node) {
  var v = node.id.split(/_/);   // {'up', '123'}
  var item = v[1]; 

  // adjust score
  var score = byId('score_' + item);
  var newscore = parseInt(score.innerHTML) + (v[0] == 'up' ? 1 : -1);
  score.innerHTML = newscore + (newscore == 1 ? ' point' : ' points');

  // hide arrows
  byId('up_'   + item).style.visibility = 'hidden';
  byId('down_' + item).style.visibility = 'hidden';

  // ping server
  var ping = new Image();
  ping.src = node.href;

  return false; // cancel browser nav
}
    </script>
    <script>
     (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-46326769-1', 'datatau.com');
  ga('send', 'pageview');
    </script>
    <title>
     DataTau
    </title>
   </link>
  </link>
 </head>
 <body>
  <center>
   <table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" width="85%">
    <tr>
     <td bgcolor="#00b4b4">
      <table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%">
       <tr>
        <td style="width:18px;padding-right:4px">
         <a href="http://www.datatau.com">
          <img height="18" src="arc.png" style="border:1px #b4b400 solid;" width="18"/>
         </a>
        </td>
        <td style="line-height:12pt; height:10px;">
         <span class="pagetop">
          <b>
           <a href="news">
            DataTau
           </a>
          </b>
          <img height="1" src="s.gif" width="10">
           <a href="newest">
            new
           </a>
           |
           <a href="newcomments">
            comments
           </a>
           |
           <a href="leaders">
            leaders
           </a>
           |
           <a href="submit">
            submit
           </a>
          </img>
         </span>
        </td>
        <td style="text-align:right;padding-right:4px;">
         <span class="pagetop">
          <a href="/x?fnid=fDLzOSbeCa">
           login
          </a>
         </span>
        </td>
       </tr>
      </table>
     </td>
    </tr>
    <tr style="height:10px">
    </tr>
    <tr>
     <td>
      <table border="0" cellpadding="0" cellspacing="0">
       <tr>
        <td align="right" class="title" valign="top">
         1.
        </td>
        <td>
         <center>
          <a href="vote?for=11989&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11989">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="https://www.springboard.com/blog/eat-rate-love-an-exploration-of-r-yelp-and-the-search-for-good-indian-food/" rel="nofollow">
          An Exploration of R, Yelp, and the Search for Good Indian Food
         </a>
         <span class="comhead">
          (springboard.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11989">
          5 points
         </span>
         by
         <a href="user?id=Rogerh91">
          Rogerh91
         </a>
         4 hours ago  |
         <a href="item?id=11989">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         2.
        </td>
        <td>
         <center>
          <a href="vote?for=11986&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11986">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://blog.insightdatalabs.com/spark-pipelines-elegant-yet-powerful/" rel="nofollow">
          Spark Pipelines: Elegant Yet Powerful
         </a>
         <span class="comhead">
          (insightdatalabs.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11986">
          3 points
         </span>
         by
         <a href="user?id=aouyang1">
          aouyang1
         </a>
         7 hours ago  |
         <a href="item?id=11986">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         3.
        </td>
        <td>
         <center>
          <a href="vote?for=11973&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11973">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="https://www.youtube.com/watch?v=KeJINHjyzOU">
          Deep Advances in Generative Modeling
         </a>
         <span class="comhead">
          (youtube.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11973">
          7 points
         </span>
         by
         <a href="user?id=gwulfs">
          gwulfs
         </a>
         13 hours ago  |
         <a href="item?id=11973">
          1 comment
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         4.
        </td>
        <td>
         <center>
          <a href="vote?for=11980&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11980">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://www.buzzfeed.com/westleyargentum/stuff-vcs-say#.lk1wooEBL" rel="nofollow">
          Shit VCs Say
         </a>
         <span class="comhead">
          (buzzfeed.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11980">
          3 points
         </span>
         by
         <a href="user?id=Argentum01">
          Argentum01
         </a>
         8 hours ago  |
         <a href="item?id=11980">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         5.
        </td>
        <td>
         <center>
          <a href="vote?for=11967&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11967">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://sebastianraschka.com/blog/2015/why-python.html" rel="nofollow">
          Python, Machine Learning, and Language Wars
         </a>
         <span class="comhead">
          (sebastianraschka.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11967">
          4 points
         </span>
         by
         <a href="user?id=pmigdal">
          pmigdal
         </a>
         15 hours ago  |
         <a href="item?id=11967">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         6.
        </td>
        <td>
         <center>
          <a href="vote?for=11975&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11975">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="https://iamtrask.github.io/2015/07/12/basic-python-network/" rel="nofollow">
          A Neural Network in 11 lines of Python
         </a>
         <span class="comhead">
          (github.io)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11975">
          3 points
         </span>
         by
         <a href="user?id=dekhtiar">
          dekhtiar
         </a>
         13 hours ago  |
         <a href="item?id=11975">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         7.
        </td>
        <td>
         <center>
          <a href="vote?for=11955&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11955">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://setosa.io/ev/markov-chains/">
          Markov Chains Explained Visually
         </a>
         <span class="comhead">
          (setosa.io)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11955">
          13 points
         </span>
         by
         <a href="user?id=zeroviscosity">
          zeroviscosity
         </a>
         1 day ago  |
         <a href="item?id=11955">
          1 comment
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         8.
        </td>
        <td>
         <center>
          <a href="vote?for=11952&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11952">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="https://github.com/dodger487/dplython">
          Dplython: Dplyr for Python
         </a>
         <span class="comhead">
          (github.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11952">
          13 points
         </span>
         by
         <a href="user?id=thenaturalist">
          thenaturalist
         </a>
         1 day ago  |
         <a href="item?id=11952">
          3 comments
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         9.
        </td>
        <td>
         <center>
          <a href="vote?for=11940&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11940">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://research.google.com/pubs/pub41854.html">
          Inferring causal impact using Bayesian structural time-series models
         </a>
         <span class="comhead">
          (google.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11940">
          8 points
         </span>
         by
         <a href="user?id=Homunculiheaded">
          Homunculiheaded
         </a>
         1 day ago  |
         <a href="item?id=11940">
          1 comment
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         10.
        </td>
        <td>
         <center>
          <a href="vote?for=11948&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11948">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://tech.marksblogg.com/billion-nyc-taxi-rides-spark-emr.html" rel="nofollow">
          A Billion Taxi Rides on Amazon EMR running Spark
         </a>
         <span class="comhead">
          (marksblogg.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11948">
          5 points
         </span>
         by
         <a href="user?id=marklit">
          marklit
         </a>
         1 day ago  |
         <a href="item?id=11948">
          1 comment
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         11.
        </td>
        <td>
         <center>
          <a href="vote?for=11946&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11946">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://trendct.org/2016/03/18/tutorial-web-scraping-and-mapping-breweries-with-import-io-and-r/" rel="nofollow">
          Tutorial: Web scraping and mapping breweries with import.io and R
         </a>
         <span class="comhead">
          (trendct.org)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11946">
          4 points
         </span>
         by
         <a href="user?id=jasdumas">
          jasdumas
         </a>
         1 day ago  |
         <a href="item?id=11946">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         12.
        </td>
        <td>
         <center>
          <a href="vote?for=11939&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11939">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://yanirseroussi.com/2016/03/20/the-rise-of-greedy-robots/" rel="nofollow">
          The rise of greedy robots
         </a>
         <span class="comhead">
          (yanirseroussi.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11939">
          4 points
         </span>
         by
         <a href="user?id=yanir">
          yanir
         </a>
         2 days ago  |
         <a href="item?id=11939">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         13.
        </td>
        <td>
         <center>
          <a href="vote?for=11905&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11905">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="https://github.com/jmportilla/Python-for-Algorithms--Data-Structures--and-Interviews">
          Python for Data Structures, Algorithms, and Interviews
         </a>
         <span class="comhead">
          (github.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11905">
          18 points
         </span>
         by
         <a href="user?id=kokoubaby">
          kokoubaby
         </a>
         4 days ago  |
         <a href="item?id=11905">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         14.
        </td>
        <td>
         <center>
          <a href="vote?for=11956&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11956">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://techblog.netflix.com/2016/03/extracting-image-metadata-at-scale.html" rel="nofollow">
          Extracting image metadata at scale
         </a>
         <span class="comhead">
          (netflix.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11956">
          2 points
         </span>
         by
         <a href="user?id=zachwill">
          zachwill
         </a>
         1 day ago  |
         <a href="item?id=11956">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         15.
        </td>
        <td>
         <center>
          <a href="vote?for=11909&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11909">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://blog.datalifebalance.com/lift-charts-a-data-scientists-secret-weapon/">
          Lift charts - A data scientist's secret weapon
         </a>
         <span class="comhead">
          (datalifebalance.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11909">
          14 points
         </span>
         by
         <a href="user?id=datenheini">
          datenheini
         </a>
         4 days ago  |
         <a href="item?id=11909">
          2 comments
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         16.
        </td>
        <td>
         <center>
          <a href="vote?for=11934&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11934">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://swanintelligence.com/how-to-become-a-machine-learning-expert-in-one-simple-step.html" rel="nofollow">
          How To Become A Machine Learning Expert In One Simple Step
         </a>
         <span class="comhead">
          (swanintelligence.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11934">
          4 points
         </span>
         by
         <a href="user?id=swanint">
          swanint
         </a>
         2 days ago  |
         <a href="item?id=11934">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         17.
        </td>
        <td>
         <center>
          <a href="vote?for=11910&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11910">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/">
          Engineers Shouldn’t Write ETL: High Functioning Data Science Departments
         </a>
         <span class="comhead">
          (stitchfix.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11910">
          10 points
         </span>
         by
         <a href="user?id=legel">
          legel
         </a>
         4 days ago  |
         <a href="item?id=11910">
          3 comments
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         18.
        </td>
        <td>
         <center>
          <a href="vote?for=11937&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11937">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://www.willmcginnis.com/2016/03/15/simple-estimation-hierarchical-events-petersburg/" rel="nofollow">
          Simple estimation of hierarchical events with petersburg
         </a>
         <span class="comhead">
          (willmcginnis.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11937">
          3 points
         </span>
         by
         <a href="user?id=wdm0006">
          wdm0006
         </a>
         2 days ago  |
         <a href="item?id=11937">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         19.
        </td>
        <td>
         <center>
          <a href="vote?for=11938&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11938">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="item?id=11938">
          Data Science Side Project
         </a>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11938">
          6 points
         </span>
         by
         <a href="user?id=yashpatel5400">
          yashpatel5400
         </a>
         2 days ago  |
         <a href="item?id=11938">
          8 comments
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         20.
        </td>
        <td>
         <center>
          <a href="vote?for=11920&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11920">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://multithreaded.stitchfix.com/blog/2016/02/04/computer-vision-state-of-the-art/">
          Unsupervised Computer Vision: The Current State of the Art
         </a>
         <span class="comhead">
          (stitchfix.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11920">
          6 points
         </span>
         by
         <a href="user?id=carlosfaham">
          carlosfaham
         </a>
         3 days ago  |
         <a href="item?id=11920">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         21.
        </td>
        <td>
         <center>
          <a href="vote?for=11882&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11882">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="https://drive.google.com/file/d/0BxGB59WxQI5oTXpQd09jbVpvalE/view">
          Data Engineering at Slack: Twelve Mistakes I've Made In My First Three Months
         </a>
         <span class="comhead">
          (google.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11882">
          14 points
         </span>
         by
         <a href="user?id=gwulfs">
          gwulfs
         </a>
         6 days ago  |
         <a href="item?id=11882">
          2 comments
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         22.
        </td>
        <td>
         <center>
          <a href="vote?for=11931&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11931">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://www.randalolson.com/2016/03/11/what-data-visualization-tools-do-rdataisbeautiful-oc-creators-use/" rel="nofollow">
          What data visualization tools do /r/DataIsBeautiful OC creators use?
         </a>
         <span class="comhead">
          (randalolson.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11931">
          3 points
         </span>
         by
         <a href="user?id=pmigdal">
          pmigdal
         </a>
         2 days ago  |
         <a href="item?id=11931">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         23.
        </td>
        <td>
         <center>
          <a href="vote?for=11917&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11917">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="https://nikolaygrozev.wordpress.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/">
          Reshaping in Pandas
         </a>
         <span class="comhead">
          (nikolaygrozev.wordpress.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11917">
          6 points
         </span>
         by
         <a href="user?id=carlosgg">
          carlosgg
         </a>
         4 days ago  |
         <a href="item?id=11917">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         24.
        </td>
        <td>
         <center>
          <a href="vote?for=11923&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11923">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://blackboxchallenge.com/eng" rel="nofollow">
          An unusual interactive machine learning challenge
         </a>
         <span class="comhead">
          (blackboxchallenge.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11923">
          4 points
         </span>
         by
         <a href="user?id=gglumov">
          gglumov
         </a>
         3 days ago  |
         <a href="item?id=11923">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         25.
        </td>
        <td>
         <center>
          <a href="vote?for=11922&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11922">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://blog.datumbox.com/datumbox-machine-learning-framework-0-7-0-released/" rel="nofollow">
          Datumbox Machine Learning Framework 0.7.0 Released
         </a>
         <span class="comhead">
          (datumbox.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11922">
          4 points
         </span>
         by
         <a href="user?id=datumbox">
          datumbox
         </a>
         3 days ago  |
         <a href="item?id=11922">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         26.
        </td>
        <td>
         <center>
          <a href="vote?for=11865&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11865">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://p.migdal.pl/2016/03/15/data-science-intro-for-math-phys-background.html">
          Data science intro for math/phys background
         </a>
         <span class="comhead">
          (p.migdal.pl)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11865">
          14 points
         </span>
         by
         <a href="user?id=pmigdal">
          pmigdal
         </a>
         7 days ago  |
         <a href="item?id=11865">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         27.
        </td>
        <td>
         <center>
          <a href="vote?for=11837&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11837">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://lumiverse.io/series/neural-networks-demystified">
          Neural Networks demystified
         </a>
         <span class="comhead">
          (lumiverse.io)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11837">
          16 points
         </span>
         by
         <a href="user?id=elyase">
          elyase
         </a>
         8 days ago  |
         <a href="item?id=11837">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         28.
        </td>
        <td>
         <center>
          <a href="vote?for=11880&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11880">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://insighthealthdata.com/blog/HealthyBeats/">
          What machines can learn from Apple Watch: detecting undiagnosed heart condition
         </a>
         <span class="comhead">
          (insighthealthdata.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11880">
          9 points
         </span>
         by
         <a href="user?id=koukouhappy">
          koukouhappy
         </a>
         6 days ago  |
         <a href="item?id=11880">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         29.
        </td>
        <td>
         <center>
          <a href="vote?for=11862&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11862">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="http://blog.dominodatalab.com/open-source-winning-against-proprietary-data-science-vendors/">
          Data Science Tools: The Biggest Winners and Losers
         </a>
         <span class="comhead">
          (dominodatalab.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11862">
          12 points
         </span>
         by
         <a href="user?id=AnnaOnTheWeb">
          AnnaOnTheWeb
         </a>
         7 days ago  |
         <a href="item?id=11862">
          discuss
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr>
        <td align="right" class="title" valign="top">
         30.
        </td>
        <td>
         <center>
          <a href="vote?for=11868&amp;dir=up&amp;whence=%6e%65%77%73" id="nil">
           <img border="0" hspace="2" src="grayarrow.gif" vspace="3"/>
          </a>
          <span id="down_11868">
          </span>
         </center>
        </td>
        <td class="title">
         <a href="https://medium.com/@thomasrorystone/10-years-of-open-source-machine-learning-64bb6fb18eb2#.vp3r4try5">
          10 Years of Open Source Machine Learning
         </a>
         <span class="comhead">
          (medium.com)
         </span>
        </td>
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="subtext">
         <span id="score_11868">
          9 points
         </span>
         by
         <a href="user?id=tstonez">
          tstonez
         </a>
         6 days ago  |
         <a href="item?id=11868">
          1 comment
         </a>
        </td>
       </tr>
       <tr style="height:5px">
       </tr>
       <tr style="height:10px">
       </tr>
       <tr>
        <td colspan="2">
        </td>
        <td class="title">
         <a href="/x?fnid=CSS821ucAs" rel="nofollow">
          More
         </a>
        </td>
       </tr>
      </table>
     </td>
    </tr>
    <tr>
     <td>
      <img height="10" src="s.gif" width="0">
       <table cellpadding="1" cellspacing="0" width="100%">
        <tr>
         <td bgcolor="#00b4b4">
         </td>
        </tr>
       </table>
       <br>
        <center>
        </center>
       </br>
      </img>
     </td>
    </tr>
   </table>
  </center>
  <center>
   <a href="http://www.datatau.com/rss">
    RSS
   </a>
   <a href="http://www.datatau.com/item?id=1">
    | Announcements
   </a>
  </center>
 </body>
</html>

Get the title in each page

We have 30 articles on each page. Let us see if we can get the html tag and attribute to get this data

Let us see which html tag we need the 'td .title'


In [9]:
title_class = soup.select('td .title')

In [10]:
len(title_class)


Out[10]:
61

We are getting double the number -> Let us see why by examining the first two elements in the list


In [11]:
title_class[0:2]


Out[11]:
[<td align="right" class="title" valign="top">1.</td>,
 <td class="title"><a href="https://www.springboard.com/blog/eat-rate-love-an-exploration-of-r-yelp-and-the-search-for-good-indian-food/" rel="nofollow">An Exploration of R, Yelp, and the Search for Good Indian Food</a><span class="comhead"> (springboard.com) </span></td>]

In [12]:
title_class[-1]


Out[12]:
<td class="title"><a href="/x?fnid=CSS821ucAs" rel="nofollow">More</a></td>

Aha - We are getting both the number and the title name. We need to be even more specific and pick only the one with <a>


In [13]:
title_class = soup.select('td .title a')

In [14]:
len(title_class)


Out[14]:
31

Why do we get 31 and not 30 articles... Lets check


In [15]:
title_class[0]


Out[15]:
<a href="https://www.springboard.com/blog/eat-rate-love-an-exploration-of-r-yelp-and-the-search-for-good-indian-food/" rel="nofollow">An Exploration of R, Yelp, and the Search for Good Indian Food</a>

In [16]:
title_class[0].get_text()


Out[16]:
'An Exploration of R, Yelp, and the Search for Good Indian Food'

In [17]:
title_class[-1]


Out[17]:
<a href="/x?fnid=CSS821ucAs" rel="nofollow">More</a>

Ok... so the last link is the link to the "More" - which is the next page. That is good. We can use it to get the link to the next url to scrape

NOTE: Taking care of the edge cases

When we run this on multiple pages, we find that sometimes there are more than one <a> link in the title. To take of this we re-write the selection criterion to only pick the first <a> link in the title only


In [18]:
title_class = soup.select('td .title > a:nth-of-type(1)')

In [19]:
title_class[0].get_text()


Out[19]:
'An Exploration of R, Yelp, and the Search for Good Indian Food'

Get the date for each title

To get the date for each title, we need html tag and class - 'td .subtext'


In [20]:
date_class = soup.select('.subtext')

In [21]:
len(date_class)


Out[21]:
30

In [22]:
date_class[0]


Out[22]:
<td class="subtext"><span id="score_11989">5 points</span> by <a href="user?id=Rogerh91">Rogerh91</a> 4 hours ago  | <a href="item?id=11989">discuss</a></td>

In [23]:
date_class[0].get_text()


Out[23]:
'5 points by Rogerh91 4 hours ago  | discuss'

Automate the Scraping Process

We now write a function which starts with first page, gets all the title and date string and puts it in to a dataframe and then moves to the next page.


In [24]:
# Let us create an empty dataframe to store the data
df = pd.DataFrame(columns=['title','date'])
df.count()


Out[24]:
title    0
date     0
dtype: int64

In [25]:
def get_data_from_tau(url):
    print(url)
    dataTau = requests.get(url)
    soup = BeautifulSoup(dataTau.content,'html.parser')
    title_class = soup.select('td .title > a:nth-of-type(1)')
    date_class = soup.select('.subtext')
    print(len(title_class),len(date_class))
    for i in range(len(title_class)-1):
        df.loc[df.shape[0]] = [title_class[i].get_text(),date_class[i].get_text()]
    print('updated df with data')
    return title_class[len(title_class) - 1]

In [26]:
url = base_url
for i in range(0,6):
    more_url = get_data_from_tau(url)
    url = base_url+more_url['href']


http://www.datatau.com
31 30
updated df with data
http://www.datatau.com/x?fnid=aFffLhBQyN
31 30
updated df with data
http://www.datatau.com/x?fnid=0urLeo7gjV
31 30
updated df with data
http://www.datatau.com/x?fnid=uMJcXgJIJs
31 30
updated df with data
http://www.datatau.com/x?fnid=qyftRfLQ6D
31 30
updated df with data
http://www.datatau.com/x?fnid=VcML5GXJiJ
31 30
updated df with data

In [27]:
df.shape


Out[27]:
(180, 2)

In [28]:
df.head()


Out[28]:
title date
0 An Exploration of R, Yelp, and the Search for ... 5 points by Rogerh91 6 hours ago | discuss
1 Deep Advances in Generative Modeling 7 points by gwulfs 15 hours ago | 1 comment
2 Spark Pipelines: Elegant Yet Powerful 3 points by aouyang1 9 hours ago | discuss
3 Shit VCs Say 3 points by Argentum01 10 hours ago | discuss
4 Python, Machine Learning, and Language Wars 4 points by pmigdal 17 hours ago | discuss

In [29]:
df.to_csv('data_tau.csv', encoding = "utf8", index = False)

In [ ]: