Gathering Data from Web

Website Crawler (spider)

Automatically traverses the hyperlinks in world wide web, ad gather web pages.

Snoaball Sampling: Start from some seed URLs and reversively extract hyperlinks to other URLs.

Restrictions to consider:

  1. In order to avoid overloading the servers, website admins usually create a guideline file named "robot.txt" that specifies some restrictions for that website.
  2. Meta tags also conveys restrictions to a crawler
    Example: <META NAME="ROBOTS", CONTENT="NOINDEX">

    NOINDEX: to no appear in Google's index
    NOFOLLOW: to not follow links on this page
    NOARCHIVE: to not archive on search results

wget

Wget is a free GNU software for crawling and retrieving data from HTTP, HTTPS, and FTP over internet.

Wget supports

  • recursively traverse HTML documents and FTP directory trees
  • specify wildcards to match certain files
  • restrict the max depth of directory traversal
Example
  • Retrieve the index.html file from www.vahidmirjalili.com, and retry 20 times if access fails.

    wget -t 20 http://vahidmirjalili.com
    
  • Recursively retrieve files (default depth=4)

    wget -r http://vahidmirjalili.com
    
wget options
-O                         specify output file
--limit-rate=200k          specify the doanload speed
-b                         download in background
--user-agent="Mozilla/.."  display wget as a browser
-i list-of-urls.txt        download multiple URLs listed in input file
--mirror                   turn on mirror options 
-p                         download full website
--convert-links            Convert links to allow local viewing

Download full website to be viewed locally:

wget --mirror -p --convert-links http://vahidmirjalili.com

Retrieiving Data Using API

Format of Data Returned From APIs:

  • JSON (Javascript Object Notation)
  • XML (Extensible Markup Language)

Encoding Data into JSON in PHP:

Using array:

<?php
   $book = array("code" => "DS110",
                 "title" => "Elements of Data Science",
                 "Year" => "2016");

   echo json_encode($book);
?>

Using class:

<?php
   class Book {
       public $code = "";
       public $title = "";
       public $year = "";
   }

   $b = new Course();
   $b->code = "DS110";
   $b->title = "Elements of Data Science";
   $b->year = "2016";

   echo json_encode($book);
?>

Decoding Data into JSON in PHP:

<?php
   $myjson = '{"a":"1", "b":"2", "c":"3"}';
   $arr = json_decode($myjson, true);

   echo $arr['a']." ".$arr['b']."<BR>";
?>

Twitter API

  • Streaming API

  • Search (REST) API

Python Twitter Seacrh API

import tweepy
import sys

C_KEY    = 'XXXXX'
C_SECRET = 'XXXXX'
ACCESS_TOKEN_KEY = 'XXXXX'
ACCESS_TOKEN_SECRET = 'XXXXX'

# Authentication
auth = tweepy.OAuthHandler(C_KEY, C_SECRET)
auth.set_access_token(ACCESS_TOKEN_KEY, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

# Search API
if (len(sys.argv)==1):
    print("Please provide a keyword to search")
else
    posts = api.search(sys.argv[1], rpp=15)
    for tweet in posts:
        print(" " + str(tweet.text.encode("UTF-8")))

In [ ]: