Need to collect data from the net
Tools to collect links from the internet. We will use a simple tool called Scrappy. http://doc.scrapy.org/
pip install Scrapy
Starting a tutorial project in the code directory: http://doc.scrapy.org/en/latest/intro/tutorial.html
mkdir code cd code scrapy startproject tutorial cd tutorial scrapy genspider example example.com
In [1]:
!ls -Rl code/tutorial
In [ ]:
# %load code/tutorial/tutorial/items.py
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
## Adding container to hold scraped data using scrapy.Item as the parent class
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
In [ ]:
# %load code/tutorial/tutorial/spiders/dmoz_spider.py
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/","http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"]
def parse(self, response):
f.write(response.body)
In [4]:
cd code/tutorial
In [1]:
import requests
from scrapy.http import TextResponse
#running scrappy to get a response from a URL
r = requests.get('http://stackoverflow.com/')
response = TextResponse(r.url, body=r.text, encoding='utf-8')
In [2]:
print response
In [3]:
#Using xpath to access data
response.xpath('//title')
Out[3]:
In [4]:
response.xpath('//title/text()').extract()
Out[4]:
In [5]:
response.xpath('//ul/li')
Out[5]:
In [6]:
response.xpath('//ul/li/a/@href').extract()
Out[6]:
In a command window/ terminal window
go to the tutorial project directory and type scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
Try running the crawler to see the response using scrapy crawl dmoz
Let's try an example crawler that creates a csv from:
https://github.com/mjhea0/Scrapy-Samples/tree/master/crawlspider
I cloned the code as part of the code directory code/Scrapy-Samples-master and added a parent URL in the output csv file it can be run by using:
cd to the code/Scrapy-Samples-master/crawlspider and run: scrapy crawl craigs -o items.csv -t csv
Sample 20 lines from the csv file..
In [ ]:
# %load -r 1-20 code/Scrapy-Samples-master/crawlspider/items.csv
link,parent,title
/pen/npo/5295330281.html,http://sfbay.craigslist.org/search/npo?s=100,Residential Counselor: 8am-1pm M-F
/sby/npo/5295329495.html,http://sfbay.craigslist.org/search/npo?s=100,Residential Counselor-Teen Group Home: Graveyard
/pen/npo/5295329005.html,http://sfbay.craigslist.org/search/npo?s=100,Residential Counselor: Teen Group Home (sexually exploited) Graveyard
/pen/npo/5295328420.html,http://sfbay.craigslist.org/search/npo?s=100,Residential Counselor: Teen Group Home (sexually exploited) Graveyard
/pen/npo/5295319941.html,http://sfbay.craigslist.org/search/npo?s=100,Open House 11/5 9am-12pm! Residential Counselor: Teen Group Home
/sby/npo/5295318633.html,http://sfbay.craigslist.org/search/npo?s=100,Open House 11/5 9am-12pm! Residential Counselor: Teen Group Home
/sby/npo/5295314523.html,http://sfbay.craigslist.org/search/npo?s=100,Open House! Residential Counselor: Teen Group Home
/pen/npo/5295313019.html,http://sfbay.craigslist.org/search/npo?s=100,Open House! Residential Counselor: Teen Group Home
/nby/npo/5293967692.html,http://sfbay.craigslist.org/search/npo?s=100,"Mental Health: part-time Care Manager, Sun-Wed"
/eby/npo/5293356884.html,http://sfbay.craigslist.org/search/npo?s=100,Executive Assistant for Progressive Organizations - Full or Part Time
/sfc/npo/5293340020.html,http://sfbay.craigslist.org/search/npo?s=100,Curriculum & Instruction Director - Level Playing Field Institute
/sfc/npo/5293326898.html,http://sfbay.craigslist.org/search/npo?s=100,Site Director - Level Playing Field
/nby/npo/5293324327.html,http://sfbay.craigslist.org/search/npo?s=100,Residential Treatment Evening/Weekend Supervisor
/nby/npo/5293323754.html,http://sfbay.craigslist.org/search/npo?s=100,Group Counselor
/eby/npo/5293294634.html,http://sfbay.craigslist.org/search/npo?s=100,Job Developer / Vocational Counselor
/nby/npo/5293293327.html,http://sfbay.craigslist.org/search/npo?s=100,Counselor / Case Managers
/nby/npo/5293292799.html,http://sfbay.craigslist.org/search/npo?s=100,Assistant Program Director
/sfc/npo/5293292256.html,http://sfbay.craigslist.org/search/npo?s=100,PROGRAM MANAGER
/nby/npo/5293291685.html,http://sfbay.craigslist.org/search/npo?s=100,Licensed Therapist - Part Time
In [15]:
I am including implmentations of page rank and hits from
https://cs7083.wordpress.com/2013/01/31/demystifying-the-pagerank-and-hits-algorithms/
In [31]:
from numpy import *
def pagerank(H):
n= len(H)
w = zeros(n)
rho = 1./n * ones(n);
for i in range(n):
if multiply.reduce(H[i]== zeros(n)):
w[i] = 1
newH = H + outer((1./n * w),ones(n))
theta=0.85
G = (theta * newH) + ((1-theta) * outer(1./n * ones(n), ones(n)))
print rho
for j in range(10):
rho = dot(rho,G)
print rho
In [32]:
def hits(A):
n= len(A)
Au= dot(transpose(A),A)
Hu = dot(A,transpose(A))
a = ones(n); h = ones(n)
print a,h
for j in range(5):
a = dot(a,Au)
a= a/sum(a)
h = dot(h,Hu)
h = h/ sum(h)
print a,h
Now we would need to create the Stostic matrix to perfrom the analyis
H2= array([[0, 1./2, 0, 0,1./2, 0], [0, 0, 1, 0, 0, 0],[ 1./3,1./3,0, 1./3, 0,0],[1./2, 0, 0, 0,1./2, 0],[1./2, 0, 0,1./2, 0,0],[0, 1./2 ,1./2,0,0,0]])
In [35]:
H2= array([[0, 1./2, 0, 0,1./2, 0], [0, 0, 1, 0, 0, 0],[ 1./3,1./3,0, 1./3, 0,0],[1./2, 0, 0, 0,1./2, 0],[1./2, 0, 0,1./2, 0,0],[0, 1./2 ,1./2,0,0,0]])
In [36]:
pagerank(H2)
In [37]:
#For the HIT algorithm we only need the connectivity graph
A2 = array([[0, 1, 0, 0,1, 0], [0, 0, 1, 0, 0, 0],[ 1,1,0, 1, 0,0], [1,0,0,0,1,0], [1,0, 0,1, 0, 0],[0,1,1,0,0,0]])
hits(A2)
In [ ]: