Link Analysis

Need to collect data from the net

Simple Spiders

Tools to collect links from the internet. We will use a simple tool called Scrappy. http://doc.scrapy.org/

pip install Scrapy

Starting a tutorial project in the code directory: http://doc.scrapy.org/en/latest/intro/tutorial.html

mkdir code cd code scrapy startproject tutorial cd tutorial scrapy genspider example example.com


In [1]:
!ls -Rl code/tutorial


code/tutorial:
total 5
-rwxrwxr-x+ 1 PS None   0 Nov  1 21:59 dmoz_crawl_results.txt
-rwxrwxr-x+ 1 PS None 260 Nov  1 21:33 scrapy.cfg
drwxrwxr-x+ 1 PS None   0 Nov  1 21:47 tutorial

code/tutorial/tutorial:
total 13
-rwxrwxr-x+ 1 PS None    0 Nov  1 20:56 __init__.py
-rwxrwxr-x+ 1 PS None  184 Nov  1 21:35 __init__.pyc
-rwxrwxr-x+ 1 PS None  475 Nov  1 21:47 items.py
-rwxrwxr-x+ 1 PS None  287 Nov  1 21:33 items.py~
-rwxrwxr-x+ 1 PS None  288 Nov  1 21:33 pipelines.py
-rwxrwxr-x+ 1 PS None 3004 Nov  1 21:33 settings.py
-rwxrwxr-x+ 1 PS None  296 Nov  1 21:35 settings.pyc
drwxrwxr-x+ 1 PS None    0 Nov  1 21:56 spiders

code/tutorial/tutorial/spiders:
total 13
-rwxrwxr-x+ 1 PS None 161 Nov  1 20:56 __init__.py
-rwxrwxr-x+ 1 PS None 192 Nov  1 21:35 __init__.pyc
-rw-rwxr--+ 1 PS None 333 Nov  1 21:56 dmoz_spider.py
-rw-rwxr--+ 1 PS None 402 Nov  1 21:49 dmoz_spider.py~
-rwxrwxr-x+ 1 PS None 999 Nov  1 21:56 dmoz_spider.pyc
-rwxrwxr-x+ 1 PS None 240 Nov  1 21:35 example.py
-rwxrwxr-x+ 1 PS None 866 Nov  1 21:56 example.pyc

In [ ]:
# %load code/tutorial/tutorial/items.py

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

## Adding container to hold scraped data using scrapy.Item as the parent class
class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

Creating first spider

create a file in the spider directory. Set the name, start_urls, parse fuction


In [ ]:
# %load code/tutorial/tutorial/spiders/dmoz_spider.py
import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/","http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"]

    def parse(self, response):
                  f.write(response.body)

In [4]:
cd code/tutorial


C:\users\ps\My Documents\GitHub\big-data-python-class\Lectures\code\tutorial

In [1]:
import requests
from scrapy.http import TextResponse

#running scrappy to get a response from a URL
r = requests.get('http://stackoverflow.com/')
response = TextResponse(r.url, body=r.text, encoding='utf-8')

In [2]:
print response


<200 http://stackoverflow.com/>

In [3]:
#Using xpath to access data
response.xpath('//title')


Out[3]:
[<Selector xpath='//title' data=u'<title>Stack Overflow</title>'>]

In [4]:
response.xpath('//title/text()').extract()


Out[4]:
[u'Stack Overflow']

In [5]:
response.xpath('//ul/li')


Out[5]:
[<Selector xpath='//ul/li' data=u'<li>\r\n                        <div class'>,
 <Selector xpath='//ul/li' data=u'<li class="related-site">\r\n             '>,
 <Selector xpath='//ul/li' data=u'<li class="related-site">\r\n             '>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                        <a href="/'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                    <a href="/help'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                        <a href="/'>,
 <Selector xpath='//ul/li' data=u'<li><a id="nav-questions" href="/questio'>,
 <Selector xpath='//ul/li' data=u'<li><a id="nav-tags" href="/tags">Tags</'>,
 <Selector xpath='//ul/li' data=u'<li><a id="nav-users" href="/users">User'>,
 <Selector xpath='//ul/li' data=u'<li><a id="nav-badges" href="/help/badge'>,
 <Selector xpath='//ul/li' data=u'<li><a id="nav-unanswered" href="/unansw'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                            <a id='>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li>\r\n                <div class="favico'>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>,
 <Selector xpath='//ul/li' data=u'<li class="dno js-hidden">\r\n            '>]

In [6]:
response.xpath('//ul/li/a/@href').extract()


Out[6]:
[u'//stackoverflow.com',
 u'http://meta.stackoverflow.com',
 u'//careers.stackoverflow.com?utm_source=stackoverflow.com&utm_medium=site-ui&utm_campaign=multicollider',
 u'/tour',
 u'/help',
 u'//meta.stackoverflow.com',
 u'/questions',
 u'/tags',
 u'/users',
 u'/help/badges',
 u'/unanswered',
 u'/questions/ask',
 u'http://chemistry.stackexchange.com/questions/40058/photochemistry-of-beta-gamma-unsaturated-ketones',
 u'http://mathoverflow.net/questions/222376/argument-of-zariski-density-to-prove-rationality-of-a-regular-map',
 u'http://worldbuilding.stackexchange.com/questions/28844/how-many-major-cities-would-need-to-be-destroyed-by-nuclear-strikes-to-completel',
 u'http://academia.stackexchange.com/questions/57314/choose-journal-before-writing-paper',
 u'http://scifi.stackexchange.com/questions/106610/looking-for-a-futuristic-story-about-american-football',
 u'http://movies.stackexchange.com/questions/42848/in-back-to-the-future-how-did-marty-get-the-timing-right-with-the-lightning-str',
 u'http://mathematica.stackexchange.com/questions/98419/incomplete-elliptic-integral-of-first-kind',
 u'http://physics.stackexchange.com/questions/215997/what-is-the-smallest-item-for-which-gravity-has-been-recorded-or-observed',
 u'http://academia.stackexchange.com/questions/57286/how-to-deal-with-classmate-who-refuses-to-acknowledge-self-plagiarism-on-group-p',
 u'http://cs.stackexchange.com/questions/48977/when-to-use-djikstra-or-bellman-kallaba-algorithm',
 u'http://skeptics.stackexchange.com/questions/30599/have-islamic-prayers-been-introduced-into-ontario-public-schools-while-christian',
 u'http://security.stackexchange.com/questions/103908/is-there-any-real-value-in-hashing-salting-passwords',
 u'http://politics.stackexchange.com/questions/9184/why-would-governments-legalize-marijuana',
 u'http://codegolf.stackexchange.com/questions/62350/halloween-golf-the-2spooky4me-challenge',
 u'http://bicycles.stackexchange.com/questions/35338/super-heavyweight-biker-keeps-destroying-dutch-bikes',
 u'http://askubuntu.com/questions/692599/how-to-make-open-in-terminal-in-the-right-click-menu-use-terminator-instead-gn',
 u'http://mathoverflow.net/questions/222403/arithmetical-results-to-help-study-arithmetic-geometry',
 u'http://programmers.stackexchange.com/questions/301400/why-is-the-git-git-objects-folder-subdivided-in-many-sha-prefix-folders',
 u'http://unix.stackexchange.com/questions/240146/why-tmux-sets-term-variable-to-screen',
 u'http://scifi.stackexchange.com/questions/106582/harry-potter-why-7',
 u'http://electronics.stackexchange.com/questions/198428/general-question-about-analog-and-digital-signals',
 u'http://academia.stackexchange.com/questions/57042/students-staying-hours-past-end-of-office-hours',
 u'http://codegolf.stackexchange.com/questions/62476/sort-this-quick',
 u'http://scifi.stackexchange.com/questions/106203/what-is-the-latest-date-for-the-setting-in-a-sci-fi-text-or-movie']

Scrapy shell

In a command window/ terminal window

go to the tutorial project directory and type scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Try running the crawler to see the response using scrapy crawl dmoz

Let's try an example crawler that creates a csv from:

https://github.com/mjhea0/Scrapy-Samples/tree/master/crawlspider

I cloned the code as part of the code directory code/Scrapy-Samples-master and added a parent URL in the output csv file it can be run by using:

cd to the code/Scrapy-Samples-master/crawlspider and run: scrapy crawl craigs -o items.csv -t csv

Sample 20 lines from the csv file..


In [ ]:
# %load -r 1-20  code/Scrapy-Samples-master/crawlspider/items.csv
link,parent,title
/pen/npo/5295330281.html,http://sfbay.craigslist.org/search/npo?s=100,Residential Counselor: 8am-1pm M-F
/sby/npo/5295329495.html,http://sfbay.craigslist.org/search/npo?s=100,Residential Counselor-Teen Group Home: Graveyard
/pen/npo/5295329005.html,http://sfbay.craigslist.org/search/npo?s=100,Residential Counselor: Teen Group Home (sexually exploited) Graveyard
/pen/npo/5295328420.html,http://sfbay.craigslist.org/search/npo?s=100,Residential Counselor: Teen Group Home (sexually exploited) Graveyard
/pen/npo/5295319941.html,http://sfbay.craigslist.org/search/npo?s=100,Open House 11/5 9am-12pm! Residential Counselor: Teen Group Home
/sby/npo/5295318633.html,http://sfbay.craigslist.org/search/npo?s=100,Open House 11/5 9am-12pm! Residential Counselor: Teen Group Home
/sby/npo/5295314523.html,http://sfbay.craigslist.org/search/npo?s=100,Open House! Residential Counselor: Teen Group Home
/pen/npo/5295313019.html,http://sfbay.craigslist.org/search/npo?s=100,Open House! Residential Counselor: Teen Group Home
/nby/npo/5293967692.html,http://sfbay.craigslist.org/search/npo?s=100,"Mental Health: part-time Care Manager, Sun-Wed"
/eby/npo/5293356884.html,http://sfbay.craigslist.org/search/npo?s=100,Executive Assistant for Progressive Organizations - Full or Part Time
/sfc/npo/5293340020.html,http://sfbay.craigslist.org/search/npo?s=100,Curriculum & Instruction Director - Level Playing Field Institute
/sfc/npo/5293326898.html,http://sfbay.craigslist.org/search/npo?s=100,Site Director - Level Playing Field
/nby/npo/5293324327.html,http://sfbay.craigslist.org/search/npo?s=100,Residential Treatment Evening/Weekend Supervisor
/nby/npo/5293323754.html,http://sfbay.craigslist.org/search/npo?s=100,Group Counselor
/eby/npo/5293294634.html,http://sfbay.craigslist.org/search/npo?s=100,Job Developer / Vocational Counselor
/nby/npo/5293293327.html,http://sfbay.craigslist.org/search/npo?s=100,Counselor / Case Managers
/nby/npo/5293292799.html,http://sfbay.craigslist.org/search/npo?s=100,Assistant Program Director
/sfc/npo/5293292256.html,http://sfbay.craigslist.org/search/npo?s=100,PROGRAM MANAGER
/nby/npo/5293291685.html,http://sfbay.craigslist.org/search/npo?s=100,Licensed Therapist - Part Time

In [15]:

    I am including implmentations of page rank and hits from
https://cs7083.wordpress.com/2013/01/31/demystifying-the-pagerank-and-hits-algorithms/

In [31]:
from numpy import *
 
def pagerank(H):
    n= len(H)
    w = zeros(n)
    rho = 1./n * ones(n);
    for i in range(n):
      if multiply.reduce(H[i]== zeros(n)):
        w[i] = 1
    newH = H + outer((1./n * w),ones(n))
 
    theta=0.85
    G = (theta * newH) + ((1-theta) * outer(1./n * ones(n), ones(n)))
    print rho
    for j in range(10):
        rho = dot(rho,G)
        print rho

In [32]:
def hits(A):
    n= len(A)
    Au= dot(transpose(A),A)
    Hu = dot(A,transpose(A))
    a = ones(n); h = ones(n)
    print a,h
    for j in range(5):
        a = dot(a,Au)
        a= a/sum(a)
        h = dot(h,Hu)
        h = h/ sum(h)
        print a,h

Now we would need to create the Stostic matrix to perfrom the analyis Now we would need to create the Stostic matrix to perfrom the analyis

H2= array([[0, 1./2, 0, 0,1./2, 0], [0, 0, 1, 0, 0, 0],[ 1./3,1./3,0, 1./3, 0,0],[1./2, 0, 0, 0,1./2, 0],[1./2, 0, 0,1./2, 0,0],[0, 1./2 ,1./2,0,0,0]])


In [35]:
H2= array([[0, 1./2, 0, 0,1./2, 0], [0, 0, 1, 0, 0, 0],[ 1./3,1./3,0, 1./3, 0,0],[1./2, 0, 0, 0,1./2, 0],[1./2, 0, 0,1./2, 0,0],[0, 1./2 ,1./2,0,0,0]])

In [36]:
pagerank(H2)


[ 0.16666667  0.16666667  0.16666667  0.16666667  0.16666667  0.16666667]
[ 0.21388889  0.21388889  0.2375      0.14305556  0.16666667  0.025     ]
[ 0.22392361  0.19381944  0.21743056  0.163125    0.17670139  0.025     ]
[ 0.23103154  0.19239786  0.20037153  0.16170341  0.18949566  0.025     ]
[ 0.23103154  0.19058534  0.19916318  0.16230759  0.19191236  0.025     ]
[ 0.23197304  0.19024297  0.19762254  0.16299232  0.19216913  0.025     ]
[ 0.23193667  0.1902066   0.19733153  0.16266493  0.19286028  0.025     ]
[ 0.23200881  0.19010868  0.19730061  0.16287622  0.19270568  0.025     ]
[ 0.23202414  0.19013058  0.19721738  0.16280175  0.19282614  0.025     ]
[ 0.23202011  0.19011352  0.197236    0.16282937  0.19280101  0.025     ]
[ 0.23202644  0.19011708  0.19722149  0.16282396  0.19281103  0.025     ]

In [37]:
#For the HIT algorithm we only need the connectivity graph
A2 = array([[0, 1, 0, 0,1, 0], [0, 0, 1, 0, 0, 0],[ 1,1,0, 1, 0,0],  [1,0,0,0,1,0], [1,0, 0,1, 0, 0],[0,1,1,0,0,0]])
hits(A2)


[ 1.  1.  1.  1.  1.  1.] [ 1.  1.  1.  1.  1.  1.]
[ 0.26923077  0.26923077  0.11538462  0.19230769  0.15384615  0.        ] [ 0.16666667  0.06666667  0.26666667  0.16666667  0.16666667  0.16666667]
[ 0.28378378  0.27027027  0.08783784  0.20945946  0.14864865  0.        ] [ 0.16666667  0.04166667  0.29166667  0.16666667  0.18452381  0.14880952]
[ 0.29205607  0.26635514  0.0771028   0.21728972  0.14719626  0.        ] [ 0.16356108  0.03312629  0.30020704  0.16977226  0.19461698  0.13871636]
[ 0.29650462  0.26355966  0.0723182   0.22097228  0.14664524  0.        ] [ 0.16131335  0.0296217   0.30371163  0.17201999  0.19985724  0.13347609]
[ 0.29880066  0.26199338  0.07003033  0.22277364  0.14640199  0.        ] [ 0.16004659  0.02801275  0.30532058  0.17328675  0.20252544  0.1308079 ]

In [ ]: