Lesson 1: Data Extraction Fundamentals

01 - Intro

  • Data Scientists spend about 70% of time data wrangling
  • data Wrangling is process of gathering, extracting, cleaning and storing our data
  • need to make sure that data is in good shape before doing any analysis
  • otherwise
    • waste lot of time
    • lose the faith of your colleagues

03 - Assessing the Quality of Data Pt 1

  • We should not trust any data as we get data
    • entered by a human
    • created by a program written by a human

04 - Assessing the Quality of Data Pt 2

We need to assess our data to

  • Test assumptions about
    • values
    • data types
    • shape
  • identify errors or outliers
  • find missing values
  • ensure that our data will support the type of queries that we need it to make
  • Eliminate any surprises later on

05 - Tabular Format

06 - CSV Format

CSV is lighweight

  • Each line of text is a single row
  • Fields are separated by delimiter
  • stores just the data itself
  • don't need special purpose software
  • all spreadsheet software read/write CSV

07 - Parsing CSV Files in Python

  • not all spreadsheet software can handle big files
  • reading in csv in case the number of files is big manually is not an option

We will try and parse a CSV file as a list of dictionaries


In [39]:
import os

DATA_FILE_CSV = "beatles-diskography.csv"

def parse_file(data_file):
    data = []
    row_count = 0
    with open(data_file) as f:
        header = f.readline().split(',')
        for line in f:
            if row_count >= 10:
                break
                
            fields = line.strip().split(',')
            
            row = {}
            for i, value in enumerate(fields):
                row[header[i].strip()] = value.strip()
                
            data.append(row)
            
            row_count += 1
            
            
    return data

In [40]:
d = parse_file(DATA_FILE_CSV)

d[0]


Out[40]:
{'BPI Certification': 'Gold',
 'Label': 'Parlophone(UK)',
 'RIAA Certification': 'Platinum',
 'Released': '22 March 1963',
 'Title': 'Please Please Me',
 'UK Chart Position': '1',
 'US Chart Position': '\xe2\x80\x94'}

In [41]:
def test(data):
    assert data[0] == {'BPI Certification': 'Gold',
      'Label': 'Parlophone(UK)',
      'RIAA Certification': 'Platinum',
      'Released': '22 March 1963',
      'Title': 'Please Please Me',
      'UK Chart Position': '1',
      'US Chart Position': '\xe2\x80\x94'}

    assert data[9] == {'BPI Certification': 'Gold',
     'Label': 'Parlophone(UK)',
     'RIAA Certification': '',
     'Released': '10 July 1964',
     'Title': '',
     'UK Chart Position': '1',
     'US Chart Position': '\xe2\x80\x94'}

In [42]:
test(d)

11 - Using CSV Module

But there are many small things that will cause us problems if we try and write the CSV reader by ourselves. So we will re write the above using python's csv module


In [43]:
import csv

def parse_csv(data_file):
    data = []
    with open(data_file, 'rb') as sd:
        r = csv.DictReader(sd)
        for line in r:
            data.append(line)
    return data

In [44]:
test(parse_csv(DATA_FILE_CSV))

12 - Intro to XLRD

This module allows us to work with Excel documents whether it is the old .xls or the new .xlsx format

We can install xlrd using

pip install xlrd


In [45]:
def read_sheet(sheet):
    return [[sheet.cell_value(r, col) 
                for col in range(sheet.ncols)] 
                    for r in range(sheet.nrows)]

In [46]:
import xlrd

DATA_FILE_EXCEL = "2013_ERCOT_Hourly_Load_Data.xls"

def parse_excel_file(datafile):
    workbook = xlrd.open_workbook(datafile)
    sheet = workbook.sheet_by_index(0)

    data = read_sheet(sheet)

    print "\nList Comprehension"
    print "data[3][2]:",
    print data[3][2]

    print "\nCells in a nested loop:"    
    for row in range(sheet.nrows):
        for col in range(sheet.ncols):
            if row == 50:
                print sheet.cell_value(row, col),


    ### other useful methods:
    print "\nROWS, COLUMNS, and CELLS:"
    print "Number of rows in the sheet:", 
    print sheet.nrows
    print "Type of data in cell (row 3, col 2):", 
    print sheet.cell_type(3, 2)
    print "Value in cell (row 3, col 2):", 
    print sheet.cell_value(3, 2)
    print "Get a slice of values in column 3, from rows 1-3:"
    print sheet.col_values(3, start_rowx=1, end_rowx=4)

    print "\nDATES:"
    print "Type of data in cell (row 1, col 0):", 
    print sheet.cell_type(1, 0)
    exceltime = sheet.cell_value(1, 0)
    print "Time in Excel format:",
    print exceltime
    print "Convert time to a Python datetime tuple, from the Excel float:",
    print xlrd.xldate_as_tuple(exceltime, 0)

    return data

In [47]:
data = parse_excel_file(DATA_FILE_EXCEL)

data[0:2]


List Comprehension
data[3][2]: 1036.088697

Cells in a nested loop:
41277.0833333 9238.73731 1438.20528 1565.442856 916.708348 14010.903488 3027.98334 6165.211119 1157.741663 37520.933404 
ROWS, COLUMNS, and CELLS:
Number of rows in the sheet: 7296
Type of data in cell (row 3, col 2): 2
Value in cell (row 3, col 2): 1036.088697
Get a slice of values in column 3, from rows 1-3:
[1411.7505669999982, 1403.4722870000019, 1395.053150000001]

DATES:
Type of data in cell (row 1, col 0): 3
Time in Excel format: 41275.0416667
Convert time to a Python datetime tuple, from the Excel float: (2013, 1, 1, 1, 0, 0)
Out[47]:
[[u'Hour_End',
  u'COAST',
  u'EAST',
  u'FAR_WEST',
  u'NORTH',
  u'NORTH_C',
  u'SOUTHERN',
  u'SOUTH_C',
  u'WEST',
  u'ERCOT'],
 [41275.041666666664,
  7606.263544000012,
  1073.892438,
  1411.7505669999982,
  784.9781659999992,
  10369.094390000051,
  2206.6750770000012,
  4368.490945000006,
  882.9319009999975,
  28704.077028000065]]

13 - Reading Excel Files

  • Read the ERCOT load excel file
  • Find min., max. and avg. for COAST and report timestamp (Hour_End) for min. and max.

In [48]:
def data_for_column(sheet, column_index):
    return sheet.col_values(column_index, start_rowx=1, end_rowx=None)

def row_index_for_value_in_column(data, value):
    return data.index(value) + 1

def cell_value_at_position(sheet, row, col):
    return sheet.cell_value(row, col)

def parse_excel_date(excel_date):
    return xlrd.xldate_as_tuple(excel_date, 0)

def get_date_for_row_containing_value(sheet, column_data, value):
    index = row_index_for_value_in_column(column_data, value)
    date = cell_value_at_position(sheet, index, 0)
    result = parse_excel_date(date)

    return result

In [49]:
def parse_file_13(datafile):
    workbook = xlrd.open_workbook(datafile)
    sheet = workbook.sheet_by_index(0)

    sheet_data = read_sheet(sheet)
    
    cv = data_for_column(sheet, 1)
    max_data = max(cv)
    min_data = min(cv)
    
    
    #print sheet_data
    data = {
            'maxtime': get_date_for_row_containing_value(sheet, cv, max_data),
            'maxvalue': max_data,
            'mintime': get_date_for_row_containing_value(sheet, cv, min_data),
            'minvalue': min_data,
            'avgcoast': sum(cv) / float(len(cv))
    }
    return data

In [50]:
import pprint
data = parse_file_13(DATA_FILE_EXCEL)

pprint.pprint(data)

assert data['maxtime'] == (2013, 8, 13, 17, 0, 0)
assert round(data['maxvalue'], 10) == round(18779.02551, 10)


{'avgcoast': 10976.933460679751,
 'maxtime': (2013, 8, 13, 17, 0, 0),
 'maxvalue': 18779.025510000003,
 'mintime': (2013, 2, 3, 4, 0, 0),
 'minvalue': 6602.113898999982}

15 - Intro to JSON

  • sometimes fields have nested fields
  • sometimes items may have different fields. sometimes optional

Resources

17 - JSON Playground


In [51]:
import json
import requests

BASE_URL = "http://musicbrainz.org/ws/2/"
ARTIST_URL = BASE_URL + "artist/"

# query parameters are given to the requests.get function as a dictionary; this
# variable contains some starter parameters.
query_type = {  "simple": {},
                "atr": {"inc": "aliases+tags+ratings"},
                "aliases": {"inc": "aliases"},
                "releases": {"inc": "releases"}}

def query_site(url, params, uid="", fmt="json"):
    # This is the main function for making queries to the musicbrainz API.
    # A json document should be returned by the query.
    params["fmt"] = fmt
    r = requests.get(url + uid, params=params)
    print "requesting", r.url

    if r.status_code == requests.codes.ok:
        return r.json()
    else:
        r.raise_for_status()


def query_by_name(url, params, name):
    # This adds an artist name to the query parameters before making
    # an API call to the function above.
    params["query"] = "artist:" + name
    return query_site(url, params)


def pretty_print(data, indent=4):
    # After we get our output, we can format it to be more readable
    # by using this function.
    if type(data) == dict:
        print json.dumps(data, indent=indent, sort_keys=True)
    else:
        print data

In [52]:
def json_play():
    '''
    Modify the function calls and indexing below to answer the questions on
    the next quiz. HINT: Note how the output we get from the site is a
    multi-level JSON document, so try making print statements to step through
    the structure one level at a time or copy the output to a separate output
    file.
    '''
    results = query_by_name(ARTIST_URL, query_type["simple"], "Nirvana")
    print "All Results for Nirvana"
    pretty_print(results)

    artist_id = results["artists"][1]["id"]
    print "\nARTIST:"
    pretty_print(results["artists"][1])

    artist_data = query_site(ARTIST_URL, query_type["releases"], artist_id)
    releases = artist_data["releases"]
    print "\nONE RELEASE:"
    pretty_print(releases[0], indent=2)
    release_titles = [r["title"] for r in releases]

    print "\nALL TITLES:"
    for t in release_titles:
        print t

In [53]:
json_play()


requesting http://musicbrainz.org/ws/2/artist/?query=artist%3ANirvana&fmt=json
All Results for Nirvana
{
    "artists": [
        {
            "aliases": [
                {
                    "begin-date": null, 
                    "end-date": null, 
                    "locale": null, 
                    "name": "Nirvana US", 
                    "primary": null, 
                    "sort-name": "Nirvana US", 
                    "type": null
                }
            ], 
            "area": {
                "id": "489ce91b-6658-3307-9877-795b68554c98", 
                "name": "United States", 
                "sort-name": "United States"
            }, 
            "begin-area": {
                "id": "a640b45c-c173-49b1-8030-973603e895b5", 
                "name": "Aberdeen", 
                "sort-name": "Aberdeen"
            }, 
            "country": "US", 
            "disambiguation": "90s US grunge band", 
            "id": "5b11f4ce-a62d-471e-81fc-a69a8278c7da", 
            "life-span": {
                "begin": "1988-01", 
                "end": "1994-04-05", 
                "ended": true
            }, 
            "name": "Nirvana", 
            "score": "100", 
            "sort-name": "Nirvana", 
            "tags": [
                {
                    "count": 1, 
                    "name": "punk"
                }, 
                {
                    "count": 0, 
                    "name": "legendary"
                }, 
                {
                    "count": 0, 
                    "name": "90"
                }, 
                {
                    "count": 1, 
                    "name": "seattle"
                }, 
                {
                    "count": 0, 
                    "name": "northwest"
                }, 
                {
                    "count": 0, 
                    "name": "alternative"
                }, 
                {
                    "count": 0, 
                    "name": "rock and indie"
                }, 
                {
                    "count": 1, 
                    "name": "usa"
                }, 
                {
                    "count": 0, 
                    "name": "am\u00e9ricain"
                }, 
                {
                    "count": 0, 
                    "name": "united states"
                }, 
                {
                    "count": 0, 
                    "name": "kurt cobain"
                }, 
                {
                    "count": 1, 
                    "name": "90s"
                }, 
                {
                    "count": 4, 
                    "name": "alternative rock"
                }, 
                {
                    "count": 0, 
                    "name": "band"
                }, 
                {
                    "count": 13, 
                    "name": "grunge"
                }, 
                {
                    "count": 9, 
                    "name": "rock"
                }, 
                {
                    "count": 1, 
                    "name": "acoustic rock"
                }, 
                {
                    "count": 1, 
                    "name": "noise rock"
                }, 
                {
                    "count": 0, 
                    "name": "nirvana"
                }, 
                {
                    "count": 5, 
                    "name": "american"
                }
            ], 
            "type": "Group"
        }, 
        {
            "area": {
                "id": "8a754a16-0027-3a29-b6d7-2b40ea0481ed", 
                "name": "United Kingdom", 
                "sort-name": "United Kingdom"
            }, 
            "begin-area": {
                "id": "f03d09b3-39dc-4083-afd6-159e3f0d462f", 
                "name": "London", 
                "sort-name": "London"
            }, 
            "country": "GB", 
            "disambiguation": "60s band from the UK", 
            "id": "9282c8b4-ca0b-4c6b-b7e3-4f7762dfc4d6", 
            "life-span": {
                "begin": "1967", 
                "ended": null
            }, 
            "name": "Nirvana", 
            "score": "100", 
            "sort-name": "Nirvana", 
            "tags": [
                {
                    "count": 1, 
                    "name": "rock"
                }, 
                {
                    "count": 1, 
                    "name": "pop"
                }, 
                {
                    "count": 1, 
                    "name": "progressive rock"
                }, 
                {
                    "count": 1, 
                    "name": "orchestral"
                }, 
                {
                    "count": 1, 
                    "name": "british"
                }, 
                {
                    "count": 1, 
                    "name": "power pop"
                }, 
                {
                    "count": 1, 
                    "name": "psychedelic rock"
                }, 
                {
                    "count": 1, 
                    "name": "soft rock"
                }, 
                {
                    "count": 1, 
                    "name": "symphonic rock"
                }, 
                {
                    "count": 1, 
                    "name": "english"
                }
            ], 
            "type": "Group"
        }, 
        {
            "area": {
                "id": "6a264f94-6ff1-30b1-9a81-41f7bfabd616", 
                "name": "Finland", 
                "sort-name": "Finland"
            }, 
            "country": "FI", 
            "disambiguation": "Early 1980's Finnish punk band", 
            "id": "85af0709-95db-4fbc-801a-120e9f4766d0", 
            "life-span": {
                "ended": null
            }, 
            "name": "Nirvana", 
            "score": "100", 
            "sort-name": "Nirvana", 
            "tags": [
                {
                    "count": 1, 
                    "name": "punk"
                }, 
                {
                    "count": 1, 
                    "name": "finland"
                }
            ], 
            "type": "Group"
        }, 
        {
            "disambiguation": "founded in 1987 by a Michael Jackson double/imitator", 
            "id": "3aa878c0-224b-41e5-abd1-63be359d2bca", 
            "life-span": {
                "begin": "1987", 
                "ended": null
            }, 
            "name": "Nirvana", 
            "score": "100", 
            "sort-name": "Nirvana"
        }, 
        {
            "disambiguation": "French band from Martigues, activ during the 70s.", 
            "id": "c49d69dc-e008-47cf-b5ff-160fafb1fe1f", 
            "life-span": {
                "ended": null
            }, 
            "name": "Nirvana", 
            "score": "100", 
            "sort-name": "Nirvana"
        }, 
        {
            "id": "b305320e-c158-43f4-b5be-4450e2f99a32", 
            "life-span": {
                "ended": null
            }, 
            "name": "El Nirvana", 
            "score": "62", 
            "sort-name": "Nirvana, El"
        }, 
        {
            "aliases": [
                {
                    "begin-date": null, 
                    "end-date": null, 
                    "locale": null, 
                    "name": "Nirvana", 
                    "primary": null, 
                    "sort-name": "Nirvana", 
                    "type": null
                }, 
                {
                    "begin-date": null, 
                    "end-date": null, 
                    "locale": null, 
                    "name": "Prophet 2002", 
                    "primary": null, 
                    "sort-name": "Prophet 2002", 
                    "type": null
                }
            ], 
            "area": {
                "id": "23d10872-f5ae-3f0c-bf55-332788a16ecb", 
                "name": "Sweden", 
                "sort-name": "Sweden"
            }, 
            "country": "SE", 
            "disambiguation": "Swedish death metal band", 
            "id": "f2dfdff9-3862-4be0-bf85-9c833fa3059e", 
            "life-span": {
                "begin": "1988", 
                "ended": null
            }, 
            "name": "Nirvana 2002", 
            "score": "62", 
            "sort-name": "Nirvana 2002", 
            "type": "Group"
        }, 
        {
            "id": "329c04ae-3b73-4ca3-996f-75608ab1befb", 
            "life-span": {
                "ended": null
            }, 
            "name": "Nirvana Singh", 
            "score": "62", 
            "sort-name": "Singh, Nirvana", 
            "type": "Person"
        }, 
        {
            "area": {
                "id": "489ce91b-6658-3307-9877-795b68554c98", 
                "name": "United States", 
                "sort-name": "United States"
            }, 
            "country": "US", 
            "id": "c3a64a25-251b-4d03-afba-1471440245b8", 
            "life-span": {
                "begin": "2009", 
                "ended": null
            }, 
            "name": "Approaching Nirvana", 
            "score": "62", 
            "sort-name": "Approaching Nirvana", 
            "type": "Group"
        }, 
        {
            "area": {
                "id": "489ce91b-6658-3307-9877-795b68554c98", 
                "name": "United States", 
                "sort-name": "United States"
            }, 
            "country": "US", 
            "gender": "female", 
            "id": "206419e0-3a7a-49ce-8437-4e757767d02b", 
            "life-span": {
                "ended": null
            }, 
            "name": "Nirvana Savoury", 
            "score": "62", 
            "sort-name": "Savoury, Nirvana", 
            "type": "Person"
        }, 
        {
            "id": "86f9ae24-ba2a-4d55-9275-0b89b85f6e3a", 
            "life-span": {
                "ended": null
            }, 
            "name": "Weed Nirvana", 
            "score": "62", 
            "sort-name": "Weed Nirvana"
        }, 
        {
            "area": {
                "id": "e8ad73e9-9e7f-41c4-a395-6e29260ff1df", 
                "name": "Graz", 
                "sort-name": "Graz"
            }, 
            "begin-area": {
                "id": "e8ad73e9-9e7f-41c4-a395-6e29260ff1df", 
                "name": "Graz", 
                "sort-name": "Graz"
            }, 
            "disambiguation": "Nirvana-Coverband", 
            "id": "46d8dae4-abec-438b-9c62-a3dbb2aaa1b7", 
            "life-span": {
                "begin": "2000", 
                "ended": null
            }, 
            "name": "Nirvana Teen Spirit", 
            "score": "50", 
            "sort-name": "Nirvana Teen Spirit", 
            "type": "Group"
        }, 
        {
            "area": {
                "id": "c621114d-73cc-4832-8afe-f13dc261e5af", 
                "name": "Gatineau", 
                "sort-name": "Gatineau"
            }, 
            "begin-area": {
                "id": "c621114d-73cc-4832-8afe-f13dc261e5af", 
                "name": "Gatineau", 
                "sort-name": "Gatineau"
            }, 
            "id": "02c4e6bb-7b7a-4686-8c23-df01bfd42b0e", 
            "life-span": {
                "begin": "2012-04-05", 
                "ended": null
            }, 
            "name": "Sappy Nirvana Tribute", 
            "score": "50", 
            "sort-name": "Sappy Nirvana Tribute", 
            "type": "Group"
        }, 
        {
            "id": "e1388435-f80d-434a-9980-f1c9f5aa9b90", 
            "life-span": {
                "ended": null
            }, 
            "name": "Nirvana Sitar & String Group", 
            "score": "43", 
            "sort-name": "Nirvana Sitar & String Group"
        }
    ], 
    "count": 14, 
    "created": "2016-08-19T13:37:44.018Z", 
    "offset": 0
}

ARTIST:
{
    "area": {
        "id": "8a754a16-0027-3a29-b6d7-2b40ea0481ed", 
        "name": "United Kingdom", 
        "sort-name": "United Kingdom"
    }, 
    "begin-area": {
        "id": "f03d09b3-39dc-4083-afd6-159e3f0d462f", 
        "name": "London", 
        "sort-name": "London"
    }, 
    "country": "GB", 
    "disambiguation": "60s band from the UK", 
    "id": "9282c8b4-ca0b-4c6b-b7e3-4f7762dfc4d6", 
    "life-span": {
        "begin": "1967", 
        "ended": null
    }, 
    "name": "Nirvana", 
    "score": "100", 
    "sort-name": "Nirvana", 
    "tags": [
        {
            "count": 1, 
            "name": "rock"
        }, 
        {
            "count": 1, 
            "name": "pop"
        }, 
        {
            "count": 1, 
            "name": "progressive rock"
        }, 
        {
            "count": 1, 
            "name": "orchestral"
        }, 
        {
            "count": 1, 
            "name": "british"
        }, 
        {
            "count": 1, 
            "name": "power pop"
        }, 
        {
            "count": 1, 
            "name": "psychedelic rock"
        }, 
        {
            "count": 1, 
            "name": "soft rock"
        }, 
        {
            "count": 1, 
            "name": "symphonic rock"
        }, 
        {
            "count": 1, 
            "name": "english"
        }
    ], 
    "type": "Group"
}
requesting http://musicbrainz.org/ws/2/artist/9282c8b4-ca0b-4c6b-b7e3-4f7762dfc4d6?fmt=json&inc=releases

ONE RELEASE:
{
  "barcode": null, 
  "country": "GB", 
  "date": "1969", 
  "disambiguation": "", 
  "id": "0b44cb36-550a-491d-bfd9-8751271f9de7", 
  "packaging": null, 
  "packaging-id": null, 
  "quality": "normal", 
  "release-events": [
    {
      "area": {
        "disambiguation": "", 
        "id": "8a754a16-0027-3a29-b6d7-2b40ea0481ed", 
        "iso-3166-1-codes": [
          "GB"
        ], 
        "name": "United Kingdom", 
        "sort-name": "United Kingdom"
      }, 
      "date": "1969"
    }
  ], 
  "status": "Official", 
  "status-id": "4e304316-386d-3409-af2e-78857eec5cfe", 
  "text-representation": {
    "language": "eng", 
    "script": "Latn"
  }, 
  "title": "To Markos III"
}

ALL TITLES:
To Markos III
Travelling on a Cloud
Songs Of Love And Praise
Songs of Love and Praise
Songs of Love and Praise
Secret Theatre
The Story of Simon Simopath
Me And My Friend
All of Us
The Story of Simon Simopath
To Markos III
Chemistry
Local Anaesthetic
Orange & Blue
Pentecost Hotel
Black Flower
All of Us

18 - Exploring JSON


In [54]:
def is_group(artist):
    return 'type' in artist and artist['type'].lower() == 'group'

def has_same_name(artist, name):
    return artist['name'].lower() == name.lower()

def band_by_name(name):
    results = query_by_name(ARTIST_URL, query_type["simple"], name)
    
    return filter(lambda x: is_group(x) and has_same_name(x, name), results['artists'])

In [55]:
#number of bands with the name
len(band_by_name("FIRST AID KIT"))


requesting http://musicbrainz.org/ws/2/artist/?query=artist%3AFIRST+AID+KIT&fmt=json
Out[55]:
2

In [56]:
#Name of Queen's begin area name
band_by_name("queen")[0]['begin-area']['name']


requesting http://musicbrainz.org/ws/2/artist/?query=artist%3Aqueen&fmt=json
Out[56]:
u'London'

In [57]:
#Spanish alias for the beatles
all_aliases = band_by_name('the beatles')[0]['aliases']
filter(lambda x: x['locale'] == 'es', all_aliases)[0]['name']


requesting http://musicbrainz.org/ws/2/artist/?query=artist%3Athe+beatles&fmt=json
Out[57]:
u'Los Beatles'

In [58]:
#disambiguation for nirvana
filter(lambda x: x['country'] == 'US',band_by_name('nirvana'))[0]['disambiguation']


requesting http://musicbrainz.org/ws/2/artist/?query=artist%3Anirvana&fmt=json
Out[58]:
u'90s US grunge band'

In [59]:
#When was one direction formed?
band_by_name('one direction')[0]['life-span']['begin']


requesting http://musicbrainz.org/ws/2/artist/?query=artist%3Aone+direction&fmt=json
Out[59]:
u'2010-07'

Problem Set Starts here


Using CSV

Your task is to process the supplied file and use the csv module to extract data from it. The data comes from NREL (National Renewable Energy Laboratory) website. Each file contains information from one meteorological station, in particular - about amount of solar and wind energy for each hour of day.

Note that the first line of the datafile is neither data entry, nor header. It is a line describing the data source. You should extract the name of the station from it.

The data should be returned as a list of lists (not dictionaries). You can use the csv modules reader method to get data in such format. Another useful method is next() - to get the next line from the iterator. You should only change the parse_file function.

Resources

Data comes from NREL website. The datafile in this exercise is a small subset from the full file for one of the stations. You can download it from the Downloadables section > or see the full data files for other stations on the National Solar Radiation Data Base.

Documentation on csv.reader on docs.python.org

Documentation on Reader object methods on docs.python.org


In [60]:
import csv
import os

DATA_DIR = ""
DATA_FILE = "745090.csv"


def parse_file(datafile):
    name = ""
    data = []
    with open(datafile, 'rb') as f:
        reader = csv.reader(f)
        for i, row in enumerate(reader):
            if i == 0:
                name = row[1]
            elif i == 1:
                pass
            else:
                data.append(row)

    # Do not change the line below
    return name, data

In [61]:
def test1():
    datafile = os.path.join(DATA_DIR, DATA_FILE)
    name, data = parse_file(datafile)

    assert name == "MOUNTAIN VIEW MOFFETT FLD NAS"
    assert data[0][1] == "01:00"
    assert data[2][0] == "01/01/2005"
    assert data[2][5] == "2"

In [62]:
test1()

Excel to CSV

Find the time and value of max load for each of the regions COAST, EAST, FAR_WEST, NORTH, NORTH_C, SOUTHERN, SOUTH_C, WEST and write the result out in a csv file, using pipe character | as the delimiter.

An example output can be seen in the "example.csv" file.

Resources

See csv module documentation on how to use different delimeters for csv.writer- http://docs.python.org/2/library/csv.html


In [63]:
import xlrd
import os
import csv

DATA_FILE = "2013_ERCOT_Hourly_Load_Data.xls"
OUT_FILE = "2013_Max_Loads.csv"

def get_max_and_max_date_for_column(sheet, column_index):
    data = data_for_column(sheet, column_index)
    max_data = max(data)
    date = get_date_for_row_containing_value(sheet, data, max_data)
    return max_data, date


def parse_file(datafile):

    workbook = xlrd.open_workbook(datafile)
    sheet = workbook.sheet_by_index(0)

    return {
        'COAST': get_max_and_max_date_for_column(sheet, 1),
        'EAST': get_max_and_max_date_for_column(sheet, 2),
        'FAR_WEST': get_max_and_max_date_for_column(sheet, 3),
        'NORTH': get_max_and_max_date_for_column(sheet, 4),
        'NORTH_C': get_max_and_max_date_for_column(sheet, 5),
        'SOUTHERN': get_max_and_max_date_for_column(sheet, 6),
        'SOUTH_C': get_max_and_max_date_for_column(sheet, 7),
        'WEST': get_max_and_max_date_for_column(sheet, 8)
    }


def save_file(data, filename):
    result = ""
    with open(filename, 'w') as f:
        result += "Station|Year|Month|Day|Hour|Max Load\n"
        for key, value in data.iteritems():
            result += "{}|{}|{}|{}|{}|{}\n".format(
                key, value[1][0], value[1][1], value[1][2], value[1][3], value[0])
        result = result.strip("\n")

        f.write(result)

In [64]:
def test2():
    # open_zip(DATA_FILE)
    data = parse_file(DATA_FILE)
    save_file(data, OUT_FILE)

    number_of_rows = 0
    stations = []

    ans = {'FAR_WEST': {'Max Load': '2281.2722140000024',
                        'Year': '2013',
                        'Month': '6',
                        'Day': '26',
                        'Hour': '17'}}
    correct_stations = ['COAST', 'EAST', 'FAR_WEST', 'NORTH',
                        'NORTH_C', 'SOUTHERN', 'SOUTH_C', 'WEST']
    fields = ['Year', 'Month', 'Day', 'Hour', 'Max Load']

    with open(OUT_FILE) as of:
        csvfile = csv.DictReader(of, delimiter="|")
        for line in csvfile:

            station = line['Station']

            if station == 'FAR_WEST':
                for field in fields:
                    # Check if 'Max Load' is within .1 of answer
                    if field == 'Max Load':

                        max_answer = round(float(ans[station][field]), 1)
                        max_line = round(float(line[field]), 1)

                        assert max_answer == max_line

                    # Otherwise check for equality
                    else:
                        assert ans[station][field] == line[field]

            number_of_rows += 1
            stations.append(station)

        # Output should be 8 lines not including header
        assert number_of_rows == 8

        # Check Station Names
        assert set(stations) == set(correct_stations)

In [65]:
test2()

Wrangling JSON

This exercise shows some important concepts that you should be aware about:

  • using codecs module to write unicode files
  • using authentication with web APIs
  • using offset when accessing web APIs

To run this code locally you have to register at the NYTimes developer site and get your own API key. You will be able to complete this exercise in our UI without doing so, as we have provided a sample result.

Your task is to process the saved file that represents the most popular articles (by view count) from the last day, and return the following data:

  • list of dictionaries, where the dictionary key is "section" and value is "title"
  • list of URLs for all media entries with "format": "Standard Thumbnail"

All your changes should be in the article_overview function. The rest of functions are provided for your convenience, if you want to access the API by yourself.

If you want to know more, or query the site by yourself, please read the NYTimes Developer Documentation for the Most Popular API and apply for your own API Key for NY Times.


In [66]:
import json
import codecs
import requests

URL_MAIN = "http://api.nytimes.com/svc/"
URL_POPULAR = URL_MAIN + "mostpopular/v2/"
API_KEY = { "popular": "",
            "article": ""}


def get_from_file(kind, period):
    filename = "popular-{0}-{1}.json".format(kind, period)
    with open(filename, "r") as f:
        return json.loads(f.read())


def article_overview(kind, period):
    data = get_from_file(kind, period)
    titles = []
    urls = []
    for row in data:
        titles.append({row['section']: row['title']})
        for media in row['media']:
            for metadata in media['media-metadata']:
                if metadata['format'] == 'Standard Thumbnail':
                    urls.append(metadata['url'])

    return titles, urls


def query_site(url, target, offset):
    # This will set up the query with the API key and offset
    # Web services often use offset paramter to return data in small chunks
    # NYTimes returns 20 articles per request, if you want the next 20
    # You have to provide the offset parameter
    if API_KEY["popular"] == "" or API_KEY["article"] == "":
        print "You need to register for NYTimes Developer account to run this program."
        print "See Intructor notes for information"
        return False
    params = {"api-key": API_KEY[target], "offset": offset}
    r = requests.get(url, params = params)

    if r.status_code == requests.codes.ok:
        return r.json()
    else:
        r.raise_for_status()


def get_popular(url, kind, days, section="all-sections", offset=0):
    # This function will construct the query according to the requirements of the site
    # and return the data, or print an error message if called incorrectly
    if days not in [1, 7, 30]:
        print "Time period can be 1,7, 30 days only"
        return False
    if kind not in ["viewed", "shared", "emailed"]:
        print "kind can be only one of viewed/shared/emailed"
        return False

    url += "most{0}/{1}/{2}.json".format(kind, section, days)
    data = query_site(url, "popular", offset)

    return data


def save_file(kind, period):
    # This will process all results, by calling the API repeatedly with supplied offset value,
    # combine the data and then write all results in a file.
    data = get_popular(URL_POPULAR, "viewed", 1)
    num_results = data["num_results"]
    full_data = []
    with codecs.open("popular-{0}-{1}.json".format(kind, period), encoding='utf-8', mode='w') as v:
        for offset in range(0, num_results, 20):
            data = get_popular(URL_POPULAR, kind, period, offset=offset)
            full_data += data["results"]

        v.write(json.dumps(full_data, indent=2))

In [67]:
def test3():
    titles, urls = article_overview("viewed", 1)
    assert len(titles) == 20
    assert len(urls) == 30
    assert titles[2] == {'Opinion': 'Professors, We Need You!'}
    assert urls[20] == 'http://graphics8.nytimes.com/images/2014/02/17/sports/ICEDANCE/ICEDANCE-thumbStandard.jpg'

In [68]:
test3()