Gathering Data from Web

Website Crawler (spider)

Automatically traverses the hyperlinks in world wide web, ad gather web pages.

Snoaball Sampling: Start from some seed URLs and reversively extract hyperlinks to other URLs.

Restrictions to consider:

In order to avoid overloading the servers, website admins usually create a guideline file named "robot.txt" that specifies some restrictions for that website.
Meta tags also conveys restrictions to a crawler
Example: <META NAME="ROBOTS", CONTENT="NOINDEX">

NOINDEX: to no appear in Google's index
NOFOLLOW: to not follow links on this page
NOARCHIVE: to not archive on search results

wget

Wget is a free GNU software for crawling and retrieving data from HTTP, HTTPS, and FTP over internet.

Wget supports

recursively traverse HTML documents and FTP directory trees
specify wildcards to match certain files
restrict the max depth of directory traversal

Example

Retrieve the index.html file from www.vahidmirjalili.com, and retry 20 times if access fails.
```
wget -t 20 http://vahidmirjalili.com
```
Recursively retrieve files (default depth=4)
```
wget -r http://vahidmirjalili.com
```

wget options

-O                         specify output file
--limit-rate=200k          specify the doanload speed
-b                         download in background
--user-agent="Mozilla/.."  display wget as a browser
-i list-of-urls.txt        download multiple URLs listed in input file
--mirror                   turn on mirror options 
-p                         download full website
--convert-links            Convert links to allow local viewing

Download full website to be viewed locally:

wget --mirror -p --convert-links http://vahidmirjalili.com

Retrieiving Data Using API

Twitter API: https://dev.twitter.com
Facebook: http://devlopers.facebook.com
Reddit: https://github.com/reddit/reddit/wiki/API

Format of Data Returned From APIs:

JSON (Javascript Object Notation)
XML (Extensible Markup Language)

Encoding Data into JSON in PHP:

Using array:

<?php
   $book = array("code" => "DS110",
                 "title" => "Elements of Data Science",
                 "Year" => "2016");

   echo json_encode($book);
?>

Using class:

<?php
   class Book {
       public $code = "";
       public $title = "";
       public $year = "";
   }

   $b = new Course();
   $b->code = "DS110";
   $b->title = "Elements of Data Science";
   $b->year = "2016";

   echo json_encode($book);
?>

Decoding Data into JSON in PHP:

<?php
   $myjson = '{"a":"1", "b":"2", "c":"3"}';
   $arr = json_decode($myjson, true);

   echo $arr['a']." ".$arr['b']."<BR>";
?>

Twitter API

Streaming API
Search (REST) API

Python Twitter Seacrh API

import tweepy
import sys

C_KEY    = 'XXXXX'
C_SECRET = 'XXXXX'
ACCESS_TOKEN_KEY = 'XXXXX'
ACCESS_TOKEN_SECRET = 'XXXXX'

# Authentication
auth = tweepy.OAuthHandler(C_KEY, C_SECRET)
auth.set_access_token(ACCESS_TOKEN_KEY, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

# Search API
if (len(sys.argv)==1):
    print("Please provide a keyword to search")
else
    posts = api.search(sys.argv[1], rpp=15)
    for tweet in posts:
        print(" " + str(tweet.text.encode("UTF-8")))



In [ ]: