Automatically traverses the hyperlinks in world wide web, ad gather web pages.
Snoaball Sampling: Start from some seed URLs and reversively extract hyperlinks to other URLs.
Restrictions to consider:
Meta tags also conveys restrictions to a crawler
Example: <META NAME="ROBOTS", CONTENT="NOINDEX">
NOINDEX: to no appear in Google's index
NOFOLLOW: to not follow links on this page
NOARCHIVE: to not archive on search results
Wget is a free GNU software for crawling and retrieving data from HTTP, HTTPS, and FTP over internet.
Wget supports
Retrieve the index.html file from www.vahidmirjalili.com, and retry 20 times if access fails.
wget -t 20 http://vahidmirjalili.com
Recursively retrieve files (default depth=4)
wget -r http://vahidmirjalili.com
-O specify output file
--limit-rate=200k specify the doanload speed
-b download in background
--user-agent="Mozilla/.." display wget as a browser
-i list-of-urls.txt download multiple URLs listed in input file
--mirror turn on mirror options
-p download full website
--convert-links Convert links to allow local viewing
Download full website to be viewed locally:
wget --mirror -p --convert-links http://vahidmirjalili.com
Format of Data Returned From APIs:
Encoding Data into JSON in PHP:
Using array:
<?php
$book = array("code" => "DS110",
"title" => "Elements of Data Science",
"Year" => "2016");
echo json_encode($book);
?>
Using class:
<?php
class Book {
public $code = "";
public $title = "";
public $year = "";
}
$b = new Course();
$b->code = "DS110";
$b->title = "Elements of Data Science";
$b->year = "2016";
echo json_encode($book);
?>
Decoding Data into JSON in PHP:
<?php
$myjson = '{"a":"1", "b":"2", "c":"3"}';
$arr = json_decode($myjson, true);
echo $arr['a']." ".$arr['b']."<BR>";
?>
import tweepy
import sys
C_KEY = 'XXXXX'
C_SECRET = 'XXXXX'
ACCESS_TOKEN_KEY = 'XXXXX'
ACCESS_TOKEN_SECRET = 'XXXXX'
# Authentication
auth = tweepy.OAuthHandler(C_KEY, C_SECRET)
auth.set_access_token(ACCESS_TOKEN_KEY, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)
# Search API
if (len(sys.argv)==1):
print("Please provide a keyword to search")
else
posts = api.search(sys.argv[1], rpp=15)
for tweet in posts:
print(" " + str(tweet.text.encode("UTF-8")))
In [ ]: