Can go to the links below to get all box office info. Also, can extract "movie title" and "release date"

use movie title and release data on ==> http://www.omdbapi.com/ to get the imdbID

example from omdbapi.com from The Big Chill(1983) can get rating, genre, actors. plot info is different from IMDB, but actor list is in the same order.

<?xml version="1.0" encoding="UTF-8"?>

http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time

All Time Domestic Box Office (Rank 1-100)

http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/1401

All Time Domestic Box Office (Rank 2,001-2,100)

http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/2001 #All Time Domestic Box Office (Rank 2,001-2,100)


In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
from time import sleep

In [2]:
class TheNumbersSpider(scrapy.Spider):
    name = "thenumbers"
    COOKIES_ENABLED = False
    
    def make_urls(self, upper, lower=2001):
        urls =[]
        upper = round(upper,-2) + 1 #naming scheme of the numbers website
        
        
        base_url = "http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time"
        
        urls = [base_url+'/'+str(num) for num in range(upper,lower,-100)]
        return urls
    
    def start_requests(self):
        urls = self.make_urls(6001)
        for url in urls:
            sleep(11)
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        filename = response.url.split("/")[-1] + '.html'
        with open(filename, 'wb') as f:
          f.write(response.body)
        self.log('saved file %s' % filename)

In [3]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(TheNumbersSpider)
process.start()


2017-07-26 15:40:54 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-07-26 15:40:54 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2017-07-26 15:40:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-07-26 15:40:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-26 15:40:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-07-26 15:40:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-07-26 15:40:54 [scrapy.core.engine] INFO: Spider opened
2017-07-26 15:40:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-26 15:40:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-07-26 15:42:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/6001> (referer: None)
2017-07-26 15:42:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/5901> (referer: None)
2017-07-26 15:42:00 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2017-07-26 15:42:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/5801> (referer: None)
2017-07-26 15:42:11 [thenumbers] DEBUG: saved file 6001.html
2017-07-26 15:42:11 [thenumbers] DEBUG: saved file 5901.html
2017-07-26 15:42:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/5701> (referer: None)
2017-07-26 15:42:22 [thenumbers] DEBUG: saved file 5801.html
2017-07-26 15:42:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/5601> (referer: None)
2017-07-26 15:42:33 [thenumbers] DEBUG: saved file 5701.html
2017-07-26 15:42:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/5501> (referer: None)
2017-07-26 15:42:44 [thenumbers] DEBUG: saved file 5601.html
2017-07-26 15:42:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/5401> (referer: None)
2017-07-26 15:42:55 [thenumbers] DEBUG: saved file 5501.html
2017-07-26 15:42:55 [scrapy.extensions.logstats] INFO: Crawled 7 pages (at 5 pages/min), scraped 0 items (at 0 items/min)
2017-07-26 15:43:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/5301> (referer: None)
2017-07-26 15:43:06 [thenumbers] DEBUG: saved file 5401.html
2017-07-26 15:43:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/5201> (referer: None)
2017-07-26 15:43:17 [thenumbers] DEBUG: saved file 5301.html
2017-07-26 15:43:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/5101> (referer: None)
2017-07-26 15:43:28 [thenumbers] DEBUG: saved file 5201.html
2017-07-26 15:43:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/5001> (referer: None)
2017-07-26 15:43:39 [thenumbers] DEBUG: saved file 5101.html
2017-07-26 15:43:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/4901> (referer: None)
2017-07-26 15:43:50 [thenumbers] DEBUG: saved file 5001.html
2017-07-26 15:44:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/4801> (referer: None)
2017-07-26 15:44:01 [thenumbers] DEBUG: saved file 4901.html
2017-07-26 15:44:01 [scrapy.extensions.logstats] INFO: Crawled 13 pages (at 6 pages/min), scraped 0 items (at 0 items/min)
2017-07-26 15:44:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/4701> (referer: None)
2017-07-26 15:44:12 [thenumbers] DEBUG: saved file 4801.html
2017-07-26 15:44:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/4601> (referer: None)
2017-07-26 15:44:23 [thenumbers] DEBUG: saved file 4701.html
2017-07-26 15:44:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/4501> (referer: None)
2017-07-26 15:44:34 [thenumbers] DEBUG: saved file 4601.html
2017-07-26 15:44:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/4401> (referer: None)
2017-07-26 15:44:45 [thenumbers] DEBUG: saved file 4501.html
2017-07-26 15:44:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/4301> (referer: None)
2017-07-26 15:44:56 [thenumbers] DEBUG: saved file 4401.html
2017-07-26 15:44:56 [scrapy.extensions.logstats] INFO: Crawled 18 pages (at 5 pages/min), scraped 0 items (at 0 items/min)
2017-07-26 15:45:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/4201> (referer: None)
2017-07-26 15:45:07 [thenumbers] DEBUG: saved file 4301.html
2017-07-26 15:45:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/4101> (referer: None)
2017-07-26 15:45:18 [thenumbers] DEBUG: saved file 4201.html
2017-07-26 15:45:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/4001> (referer: None)
2017-07-26 15:45:29 [thenumbers] DEBUG: saved file 4101.html
2017-07-26 15:45:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/3901> (referer: None)
2017-07-26 15:45:40 [thenumbers] DEBUG: saved file 4001.html
2017-07-26 15:45:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/3801> (referer: None)
2017-07-26 15:45:51 [thenumbers] DEBUG: saved file 3901.html
2017-07-26 15:46:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/3701> (referer: None)
2017-07-26 15:46:02 [thenumbers] DEBUG: saved file 3801.html
2017-07-26 15:46:02 [scrapy.extensions.logstats] INFO: Crawled 24 pages (at 6 pages/min), scraped 0 items (at 0 items/min)
2017-07-26 15:46:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/3601> (referer: None)
2017-07-26 15:46:13 [thenumbers] DEBUG: saved file 3701.html
2017-07-26 15:46:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/3501> (referer: None)
2017-07-26 15:46:24 [thenumbers] DEBUG: saved file 3601.html
2017-07-26 15:46:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/3401> (referer: None)
2017-07-26 15:46:35 [thenumbers] DEBUG: saved file 3501.html
2017-07-26 15:46:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/3301> (referer: None)
2017-07-26 15:46:46 [thenumbers] DEBUG: saved file 3401.html
2017-07-26 15:46:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/3201> (referer: None)
2017-07-26 15:46:57 [thenumbers] DEBUG: saved file 3301.html
2017-07-26 15:46:57 [scrapy.extensions.logstats] INFO: Crawled 29 pages (at 5 pages/min), scraped 0 items (at 0 items/min)
2017-07-26 15:47:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/3101> (referer: None)
2017-07-26 15:47:09 [thenumbers] DEBUG: saved file 3201.html
2017-07-26 15:47:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/3001> (referer: None)
2017-07-26 15:47:20 [thenumbers] DEBUG: saved file 3101.html
2017-07-26 15:47:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/2901> (referer: None)
2017-07-26 15:47:31 [thenumbers] DEBUG: saved file 3001.html
2017-07-26 15:47:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/2801> (referer: None)
2017-07-26 15:47:42 [thenumbers] DEBUG: saved file 2901.html
2017-07-26 15:47:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/2701> (referer: None)
2017-07-26 15:47:53 [thenumbers] DEBUG: saved file 2801.html
2017-07-26 15:48:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/2601> (referer: None)
2017-07-26 15:48:04 [thenumbers] DEBUG: saved file 2701.html
2017-07-26 15:48:04 [scrapy.extensions.logstats] INFO: Crawled 35 pages (at 6 pages/min), scraped 0 items (at 0 items/min)
2017-07-26 15:48:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/2501> (referer: None)
2017-07-26 15:48:15 [thenumbers] DEBUG: saved file 2601.html
2017-07-26 15:48:15 [thenumbers] DEBUG: saved file 2501.html
2017-07-26 15:48:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/2401> (referer: None)
2017-07-26 15:48:15 [thenumbers] DEBUG: saved file 2401.html
2017-07-26 15:48:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/2201> (referer: None)
2017-07-26 15:48:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/2301> (referer: None)
2017-07-26 15:48:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/2101> (referer: None)
2017-07-26 15:48:15 [thenumbers] DEBUG: saved file 2201.html
2017-07-26 15:48:15 [thenumbers] DEBUG: saved file 2301.html
2017-07-26 15:48:15 [thenumbers] DEBUG: saved file 2101.html
2017-07-26 15:48:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-26 15:48:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 11880,
 'downloader/request_count': 40,
 'downloader/request_method_count/GET': 40,
 'downloader/response_bytes': 2005195,
 'downloader/response_count': 40,
 'downloader/response_status_count/200': 40,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 7, 26, 19, 48, 15, 684788),
 'log_count/DEBUG': 81,
 'log_count/INFO': 14,
 'response_received_count': 40,
 'scheduler/dequeued': 40,
 'scheduler/dequeued/memory': 40,
 'scheduler/enqueued': 40,
 'scheduler/enqueued/memory': 40,
 'start_time': datetime.datetime(2017, 7, 26, 19, 40, 54, 484267)}
2017-07-26 15:48:15 [scrapy.core.engine] INFO: Spider closed (finished)

In [ ]: