Working Scrapy Code Snippets

# extracts total content of page

keywords = response.xpath('//meta[@name="keywords"]').re('content="(.*)\"')

# extracts keywords from meta tag as list

item['number'] = r.xpath('//title/text()').re('Decision\s(.*)\\r')

# extracts file name, but not reliably

for metadata in response.xpath('//head'):
    title = metadata.xpath('//meta[@name="dcterms.title"]').re('content="(.*)\"')
    doctype = metadata.xpath('//meta[@name="keywords"]').re('content="(.*)\"')
    keywords = metadata.xpath('//meta[@name="keywords"]').re('content="(.*)\"')
    print(dict(title=title, doctype=doctype, keywords=keywords))
# extracts metadata

head ../scrapy/result11a.json

{"file": ["PT2001-99", [{"content": [{"text": "ARCHIVED - Public Notice CRTC 2001-99 Public Notice CRTC 2001-99 Ottawa, 31 August 2001 Terms and conditions of existing agreements for access to municipal property Reference: 8690-A4-01/01 In this Public Notice, the Commission initiates a proceeding to consider the circumstances, if any, where the Commission could alter the terms and conditions of an existing property-access contract between a carrier and a municipality. Background 1. On 28 May 2001, AT&T Canada Corp. on behalf of itself and AT&T Canada Telecom Services Company (AT&T Canada) filed a Part VII application requesting relief pursuant to sections 32(d) and (e) and 43(4) of the Telecommunications Act (the Act), naming the City of Toronto (the City) as respondent. AT&T Canada stated that it wanted the Commission to substitute for those terms and conditions of its current access agreement with the City that are inconsistent with the principles set out in Decision CRTC 2001-23 , Ledcor/Vancouver - Construction, operation and maintenance of transmission lines in Vancouver, dated 25 January 2001, the terms and conditions that are based on the principles set out in that decision. 2. On 27 June 2001, the City filed its response to AT&T Canada's application. The City argued that the principles developed in Decision 2001-23 were not applicable in the unique factual circumstances of the contractual situation between itself and AT&T Canada. Further, the City argued that the Commission lacked jurisdiction under the Act to interfere with an existing agreement for access to municipal property like the agreement between itself and AT&T Canada. Scope of proceeding 3. In this proceeding, the Commission will consider, given the framework set out in sections 43(1) to 43(4) and any other relevant provisions of the Act, and the principles laid out in Decision 2001-23 , what circumstances, if any, would justify an intervention by the Commission to alter the terms of an existing contract between a carrier and a municipality for access to municipal rights-of-way. Procedure 4. The contract filed in the Part VII application as Appendix \"A\" to AT&T Canada's application is made part of the record in this proceeding. 5. AT&T Canada and the City are made parties to this proceeding. 6. Other parties wishing to participate in this proceeding must notify the Commission of their intention to do so by 1 October 2001. These parties should contact the Secretary General, by mail at CRTC, Ottawa, Ontario, K1A 0N2; by fax at (819) 953-0795; or by email at They are to indicate in the notice their email address, where available. If parties do not have access to the Internet, they are to indicate in their notice whether they wish to receive disk versions of hard copy filings. 7. The Commission will issue, as soon as possible after the registration date, a complete list of interested parties and their mailing addresses (including their email address, if available), identifying those parties who wish to receive disk versions. 8. All parties may submit comments on the circumstances, if any, for Commission intervention, serving a copy of their submission on all the parties on the interested parties list, by 29 October 2001. Submissions longer than five pages should include a summary. In order to streamline the process and reduce the workload for all concerned, the Commission encourages parties with similar interests to file joint submissions and to participate jointly in subsequent stages of the proceeding. 9. Parties may file reply comments with the Commission, serving a copy on those parties who filed comments, by 28 November 2001. Submissions longer than five pages should include a summary. 10. Where a document is to be filed or served by a specific date, the document must be actually received, and not merely sent, by that date. 11. Parties wishing to file electronic versions of their comments can do so by email at the address shown above, or on diskette. 12. The electronic version should be in the HTML format. As an alternative, those submitting comments may use \"Microsoft Word\" for text and \"Microsoft Excel\" for spreadsheets. 13. Please number each paragraph of your submission. In addition, please enter the line ***End of document*** following the last paragraph. This will help the Commission verify that the document has not been damaged during transmission. 14. The Commission will make submissions filed in electronic form available on its web site at in the official language and format in which they are submitted. This will make it easier for members of the public to consult the documents. 15. The Commission also encourages interested parties to monitor the public examination file (and/or the Commission's web site) for additional information that they may find useful when preparing their submission. 16. Submissions may be examined or will be made available promptly upon request at the Commission offices during normal business hours: Central Building Les Terasses de la Chaudi\u00e8re 1 Promenade du Portage, Room G-5 Hull, Quebec K1A 0N2 Tel: (819) 997-2429 - TDD: 994-0423 Fax: (819) 994-0218 Bank of Commerce Building 1809 Barrington Street Suite 1007 Halifax, Nova Scotia B3J 3K8 Tel: (902) 426-7997 - TDD: 426-6997 Fax: (902) 426-2721 405 de Maisonneuve Blvd. East 2 nd Floor, Suite B2300 Montr\u00e9al, Quebec H2L 4J5 Tel: (514) 283-6607 - TDD: 283-8316 Fax: (514) 283-3689 55 St. Clair Avenue East Suite 624 Toronto, Ontario M4T 1M2 Tel: (416) 952-9096 Fax: (416) 954-6343 Kensington Building 275 Portage Avenue Suite 1810 Winnipeg, Manitoba R3B 2B3 Tel: (204) 983-6306 - TDD:983-8274 Fax: (204) 983-6317 Cornwall Professional Building 2125 - 11 th Avenue Room 103 Regina, Saskatchewan S4P 3X3 Tel: (306) 780-3422 Fax: (306) 780-3319 10405 Jasper Avenue Suite 520 Edmonton, Alberta T5J 3N4 Tel: (780) 495-3224 Fax: (780) 495-3214 530-580 Hornby Street Vancouver, British Columbia V6C 3B6 Tel: (604) 666-2111 - TDD:666-0778 Fax: (604) 666-8322 Secretary General This document is available in alternative format upon request, and may also be examined at the following Internet site: Date Modified: 2001-08-31 Date modified: 2001-08-31"}], "metadata": [{"subject": ["Telecommunications, Agreements, Property, Municipal governments, AT&T Canada, City of Toronto, Access arrangements"], "dateIssued": [], "title": ["ARCHIVED - Terms and conditions of existing agreements for access to municipal property"], "docType": ["Notices of consultation"], "keywords": ["Telecommunications, Agreements, Property, Municipal governments, AT&T Canada, City of Toronto, Access arrangements", "Telecommunications, Agreements, Property, Municipal governments, AT&T Canada, City of Toronto, Access arrangements"], "date": [], "dateCreated": ["2001-08-31"], "dateMod": []}]}]]}

grep -E  ../scrapy/result11.json

import ijson
from ijson import items

In [205]:
filename = "../scrapy/result11.json"
with open(filename, 'r') as f:
    objects = ijson.items(f, 'metadata')
    items = list(objects)

In [240]:
f = "../scrapy/result11.json"
objects = items(f, 'file.metadata.item')
titles = (o for o in objects if o['type'] == 'title')
for title in titles:

AttributeError                            Traceback (most recent call last)
<ipython-input-240-2a683b902f5b> in <module>()
      2 objects = items(f, 'file.metadata.item')
      3 titles = (o for o in objects if o['type'] == 'title')
----> 4 for title in titles:
      5     print(title)

<ipython-input-240-2a683b902f5b> in <genexpr>(.0)
      1 f = "../scrapy/result11.json"
      2 objects = items(f, 'file.metadata.item')
----> 3 titles = (o for o in objects if o['type'] == 'title')
      4 for title in titles:
      5     print(title)

/usr/local/lib/python3.5/site-packages/ijson/ in items(prefixed_events, prefix)
    136     try:
    137         while True:
--> 138             current, event, value = next(prefixed_events)
    139             if current == prefix:
    140                 if event in ('start_map', 'start_array'):

/usr/local/lib/python3.5/site-packages/ijson/ in parse(basic_events)
     63     '''
     64     path = []
---> 65     for event, value in basic_events:
     66         if event == 'map_key':
     67             prefix = '.'.join(path[:-1])

/usr/local/lib/python3.5/site-packages/ijson/backends/ in basic_parse(file, buf_size)
    183     '''
    184     lexer = iter(Lexer(file, buf_size))
--> 185     for value in parse_value(lexer):
    186         yield value
    187     try:

/usr/local/lib/python3.5/site-packages/ijson/backends/ in parse_value(lexer, symbol, pos)
    106     try:
    107         if symbol is None:
--> 108             pos, symbol = next(lexer)
    109         if symbol == 'null':
    110             yield ('null', None)

/usr/local/lib/python3.5/site-packages/ijson/backends/ in Lexer(f, buf_size)
     24 def Lexer(f, buf_size=BUFSIZE):
---> 25     if type( == bytetype:
     26         f = getreader('utf-8')(f)
     27     buf =

AttributeError: 'str' object has no attribute 'read'

In [72]:
url = ''
page = url.split("/")[-1]
file = page.split(".")[0]


In [56]:
import json
import glob
import os
import pandas as pd
import numpy as np
from pprint import pprint

In [249]:
data = json.loads(json_data)

In [117]:
crtc_files = "../scrapy/result11.json"
crtc_data = []

In [118]:
with open(crtc_files) as f:
    for line in f:

crtc = pd.read_json("../scrapy/result12.json", orient = "index")

crtc ="../scrapy/result9.json", format = 'text')[:160]

import ijson
filename = "../scrapy/result9.json"
text = []
with open(filename, 'r') as f:
    for line in f:

objects = ijson.items(f, 'file.metadata.subject.item')
columns = list(objects)

In [315]:
# finally something that is working. This code reads each level of the json file (file, metadata, text)
with open("../scrapy/result12.json") as json_file:  
    data = json.load(json_file)
    for d in data:
        print(d['file'][0]) # changing the number from 0 to 2 reads each level


In [261]:
text = file['text']

TypeError                                 Traceback (most recent call last)
<ipython-input-261-da2633f3f826> in <module>()
----> 1 text = file['text']

TypeError: list indices must be integers or slices, not str

In [327]:
for prefix, the_type, value in ijson.parse(open("../scrapy/result12.json")):
    print(prefix, the_type, value)

import xml.dom.minidom

xml = xml.dom.minidom.parse("../scrapy/result10.xml") # or xml.dom.minidom.parseString(xml_string)
pretty_xml_as_string = xml.toprettyxml()

