We all know that the most important aspect of data science or machine learning is data; with enough quality data you can do everything. Is also not a mistery that the problem of big data is to get that amount of data into a queryable, reportable or undestandable format; now we have a lot of amazing new tools to store that amount of data (casandra, hbase and more) but I still believe that almost nothing beats the fact of collecting a good amount (not necessarily huge, but the more you have the better) but structured data, and there is nothing more structured than SQL.

There is a lot of information power in the web and crawl it gives you that power (or is at least the first step), Google does it and I am pretty sure I don't have to say more. I cannot even begin to imagine the amount of work that they do to understand that data. So I created my own mini crawler to crawl what I call relevant content of websites, more specificly blogs, yes I believe blogs and not twitter have a lot of information power, that is why I am writing this in a blog.

All I needed was python + some libraries, mainly the readability API. The idea is very simple, get the feed of each blog to get the posts and ask readability to give me the text content of each post. For now this code only works with blogspot and wordpress blogs because is easy to get more than 10 posts from their feed. Also most of the blogs are just on those services.

The readability api is beautiful because I dont have to write beautifulsoup code for each site. I tried some implementations of the arc90 readability (javascript and python) without very good results. But if you are looking to pass the 1000 posts per hours of readability API that is the way to go, they just work. But I don't care to wait 3.6 seconds for each post if the content is better.

OK, here is the code!


In [1]:
import sqlalchemy as sql
from sqlalchemy.ext.declarative import declarative_base

In [2]:
engine = sql.create_engine('sqlite:///blogs.db')

In [3]:
Base = declarative_base()

In [4]:
class Post(Base):
    __tablename__ = 'post'
    url = sql.Column(sql.String(50), primary_key=True)
    date = sql.Column(sql.DateTime)
    content = sql.Column(sql.String(10000))

    def __init__(self, url, date, content):
        self.url = url
        self.date = date
        self.content = content

    def __repr__(self):
       return "<Post('%s','%s')>" % (self.url, self.date)

In [5]:


In [6]:
from __future__ import division
import math
import time
import logging
import requests
import feedparser
import dateutil
from datetime import datetime
import readability
from bs4 import BeautifulSoup
import sqlalchemy as sql
from sqlalchemy.orm import sessionmaker

In [7]:
logger = logging.getLogger('crawler')
handler = logging.FileHandler('crawler.log')
f = logging.Formatter("%(asctime)s %(message)s")

In [8]:
blogs = [
    {'url': 'http://mypreciousconfessions.blogspot.com', 'kind': 'blogspot'},
    {'url': 'http://cupcakesandcashmere.com', 'kind':'wordpress' }

Don't ask why those are fashion blogs, I just needed the data.

In [9]:
def parse_info(blog):
    feed = ''
    kind = ''
    if 'feed' in blog:
        feed = blog['feed']
        if 'blogger.com' in blog['feed']:
            kind = 'blogspot'
        elif 'wordpress.com' in blog['feed']:
            kind = 'wordpress'
            kind = blog['kind']
    elif 'url' in blog:
        if 'blogspot.com' in blog['url'] or blog['kind'] == 'blogspot':
            r = requests.get(blog['url'])
            html = r.text
            soup = BeautifulSoup(html)
            feed = soup.find('link', rel='service.post')['href']
            kind = 'blogspot'
        elif 'wordpress.com' in blog['url'] or blog['kind'] == 'wordpress':
            feed = blog['url'] + '/feed/'
            kind = 'wordpress'
    return feed, kind

In [10]:
def get_posts(blog, limit=10000):
    feed, kind = parse_info(blog)
    posts = []
    if kind == 'blogspot':
        feed = feed + '?max-results=%i' % limit
        json_feed = feedparser.parse(feed)
        for entry in json_feed['entries']:
            date = dateutil.parser.parse(entry['published'])
            posts.append((entry['link'], date))
    elif kind == 'wordpress':
        page = 1
        while True and page <= math.ceil(limit / 10):
            url = feed + '?paged=%i' % page
            r = requests.get(url)
            if r.status_code == 200:
                json_feed = feedparser.parse(r.text)
                for entry in json_feed['entries']:
                    if len(posts) < limit:
                        date = dateutil.parser.parse(entry['published'])
                        posts.append((entry['link'], date))
                page += 1
    return posts

In [11]:
def insert_post(post_link, date, content):
    session = Session()
    post = Post(post_link, date, content)

In [12]:
def exists(post_link):
    session = Session()
    response = session.query(Post).filter(Post.url == post_link).all()
    return len(response) == 1

In [13]:
def crawl(blogs):
    parser = readability.ParserClient('YOUR_READABILITY_API')
    for blog in blogs[4:]:
        posts = get_posts(blog, limit=1000)
        n_posts = len(posts)
        if 'url' in blog:
            logger.info('{0} ({1})'.format(blog['url'], n_posts))
            logger.info('{0} ({1})'.format(blog['feed'], n_posts))
        for i, (post_link, post_date) in enumerate(posts):
            if exists(post_link):
                logger.info('{0}/{1} Already exists: {2}'.format(i, n_posts, post_link))
                parser_response = parser.get_article_content(post_link)
                    soup = BeautifulSoup(parser_response.content['content'])
                    content = soup.get_text(" ", strip=True)
                    content = content.replace('\t', ' ')
                    content = content.replace('"', '')
                    insert_post(post_link, post_date, content)
                except Exception as e:
                    logger.info('{0}/{1} FAIL: {2}'.format(i + 1, n_posts, post_link))
                    logger.info('{0}/{1} OK: {2}'.format(i + 1, n_posts, post_link))

That is it! just need to call crawl(blogs)

Q: I need to crawl faster!

A: One easy way to double the speed of crawling is to create another readbility account and cycle though the parsers or even better just contact readability ;)

Q: Why is this data useful (spoiler of my next post)?

A: https://code.google.com/p/word2vec/