Getting Started with Data

Example code blocks showing how to load various data types into Python

Designed for Python 3.5+


In [21]:
#pandas is commonly imported as pd
import pandas as pd

#We'll import the other libraries as needed

.csv — Comma-Separated Values

Data type with the least overhead (aka file size) and very common

Std Python

Conceptually, the basics of a CSV are just like splitting a string by commas:


In [24]:
print("Split on comma as strings")
csv_row = '1,2.0,Three point five,True'
print(csv_row.split(','))

print("\nSplit on comma and converted to ints")
csv_row = '1,2,3,4,5,6,7,8,9'
print([int(c) for c in csv_row.split(',')])


Split on comma as strings
['1', '2.0', 'Three point five', 'True']

Split on comma and converted to ints
[1, 2, 3, 4, 5, 6, 7, 8, 9]

But there are more than just strings or rows with a single data type. We can use the csv library to handle edge cases and memory management for us.


In [11]:
import csv
with open('data/volcanoes.csv') as fin:
    csv_data = csv.reader(fin)
    #csv.reader is a generator
    #for row in csv_data: do stuff
    print(csv_data.__next__())
    print()
    print(csv_data.__next__())


['Number', 'Name', 'Country', 'Region', 'Type', 'Activity Evidence', 'Last Known Eruption', 'Latitude', 'Longitude', 'Elevation (Meters)', 'Dominant Rock Type', 'Tectonic Setting']

['210010', 'West Eifel Volcanic Field', 'Germany', 'Mediterranean and Western Asia', 'Maar(s)', 'Eruption Dated', '8300 BCE', '50.17', '6.85', '600', 'Foidite', 'Rift Zone / Continental Crust (>25 km)']

In [16]:
with open('data/volcanoes.csv') as fin:
    csv_dict_data = csv.DictReader(fin)
    for row in csv_dict_data:
        print(row)
        print()
        print('Name:', row['Name'], '\nType:', row['Type'])
        break


OrderedDict([('Number', '210010'), ('Name', 'West Eifel Volcanic Field'), ('Country', 'Germany'), ('Region', 'Mediterranean and Western Asia'), ('Type', 'Maar(s)'), ('Activity Evidence', 'Eruption Dated'), ('Last Known Eruption', '8300 BCE'), ('Latitude', '50.17'), ('Longitude', '6.85'), ('Elevation (Meters)', '600'), ('Dominant Rock Type', 'Foidite'), ('Tectonic Setting', 'Rift Zone / Continental Crust (>25 km)')])

Name: West Eifel Volcanic Field 
Type: Maar(s)

Pandas

Because CSV matches the structure of pandas.DataFrame, it's really easy to import the data.


In [31]:
csv_df = pd.read_csv(open('data/volcanoes.csv'))
csv_df.head()


Out[31]:
Number Name Country Region Type Activity Evidence Last Known Eruption Latitude Longitude Elevation (Meters) Dominant Rock Type Tectonic Setting
0 210010 West Eifel Volcanic Field Germany Mediterranean and Western Asia Maar(s) Eruption Dated 8300 BCE 50.170 6.85 600 Foidite Rift Zone / Continental Crust (>25 km)
1 210020 Chaine des Puys France Mediterranean and Western Asia Lava dome(s) Eruption Dated 4040 BCE 45.775 2.97 1464 Basalt / Picro-Basalt Rift Zone / Continental Crust (>25 km)
2 210030 Olot Volcanic Field Spain Mediterranean and Western Asia Pyroclastic cone(s) Evidence Credible Unknown 42.170 2.53 893 Trachybasalt / Tephrite Basanite Intraplate / Continental Crust (>25 km)
3 210040 Calatrava Volcanic Field Spain Mediterranean and Western Asia Pyroclastic cone(s) Eruption Dated 3600 BCE 38.870 -4.02 1117 Basalt / Picro-Basalt Intraplate / Continental Crust (>25 km)
4 211001 Larderello Italy Mediterranean and Western Asia Explosion crater(s) Eruption Observed 1282 CE 43.250 10.87 500 No Data Subduction Zone / Continental Crust (>25 km)

.xls .xlsx — Excel Spreadsheet

Excel spreadsheets are more complicated than CSV (ex they can contain multiple datasets), but they still follow the 2D array format of a CSV and DataFrames

Std Python


In [19]:
import openpyxl
# You can also use xlrd, xlsxwriter, and a host of others.
# See http://www.python-excel.org/ for all your options.
wb = openpyxl.Workbook()
ws = wb.create_sheet("NewSheet")

ws['A1'] = "Name"
ws['A2'] = "Michael"
ws['B1'] = "fav_color"
ws['B2'] = "Purple"

wb.save("data/ExcelData.xlsx")


excel_row = 'data/ExcelData.xlsx'
data = openpyxl.load_workbook(excel_row)
for sheet in data:
    for row in sheet:
        print(row)
        print(*[cell.value for cell in row])


(<Cell Sheet.A1>,)
None
(<Cell NewSheet.A1>, <Cell NewSheet.B1>)
Name fav_color
(<Cell NewSheet.A2>, <Cell NewSheet.B2>)
Michael Purple

Pandas


In [30]:
file = "data/volcanoes.xlsx"

# Pandas excel loader is built off of several other excel readers,
# such as openXLRD and xlsxwriter

# this is reflected in how many ways there are to read in an excel file.
basicLoad = pd.read_excel(file)
alternateLoad = pd.ExcelFile(file)

basicLoad.head()


Out[30]:
Number Name Country Region Type Activity Evidence Last Known Eruption Latitude Longitude Elevation (Meters) Dominant Rock Type Tectonic Setting
0 210010 West Eifel Volcanic Field Germany Mediterranean and Western Asia Maar(s) Eruption Dated 8300 BCE 50.170 6.85 600 Foidite Rift Zone / Continental Crust (>25 km)
1 210020 Chaine des Puys France Mediterranean and Western Asia Lava dome(s) Eruption Dated 4040 BCE 45.775 2.97 1464 Basalt / Picro-Basalt Rift Zone / Continental Crust (>25 km)
2 210030 Olot Volcanic Field Spain Mediterranean and Western Asia Pyroclastic cone(s) Evidence Credible Unknown 42.170 2.53 893 Trachybasalt / Tephrite Basanite Intraplate / Continental Crust (>25 km)
3 210040 Calatrava Volcanic Field Spain Mediterranean and Western Asia Pyroclastic cone(s) Eruption Dated 3600 BCE 38.870 -4.02 1117 Basalt / Picro-Basalt Intraplate / Continental Crust (>25 km)
4 211001 Larderello Italy Mediterranean and Western Asia Explosion crater(s) Eruption Observed 1282 CE 43.250 10.87 500 No Data Subduction Zone / Continental Crust (>25 km)

In [33]:
# read_excel defaults to read the first sheet in an excel book
# For a comprehensive list of parameters for read_excel, see: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
# you can specify the sheets you want by name
sheetByName = pd.read_excel(file, sheetname="volcanoes")
# by index
sheetsByIndex = pd.read_excel(file, sheetname=[0])

# if you don't know which sheets you want, you can specify header=None and
# all sheets will be loaded in a nested structure:
allSheetsByHeader = pd.read_excel(file, header=None)
allSheetsBySheets = pd.read_excel(file, sheetname=0)

# You can skip rows or columns
cols = ['Number', 'Name', 'Country', 'Region', 'Type', 'Activity Evidence',
        'Last Known Eruption', 'Latitude', 'Longitude', 'Elevation (Meters)',
        'Dominant Rock Type', 'Tectonic Setting']
subset = pd.read_excel(file, skip_footer=5, skiprows=2, names=cols)

.json — JavaScript Object Notation

JSON is a common web and API communication format and easy to read. It also happens to be copy/paste compatible as native Python code (with some restrictions the other way around)

Std Python


In [69]:
import json

color_data = [
    {"name": "Michael", "fav_color": "purple"},
    {"name": "Casey", "fav_color": "turquoise"}
]

#load, loads, dump, and dumps are common library APIs
#load/dump tkae file objects while loads/dumps take strings
print(json.dumps(color_data))


[{"name": "Michael", "fav_color": "purple"}, {"name": "Casey", "fav_color": "turquoise"}]

In [54]:
json_data = json.load(open('data/volcanoes.json'))
print('Rows:', len(json_data))


Rows: 1508

Pandas


In [52]:
json_df = pd.read_json('data/volcanoes.json')
json_df.head()


Out[52]:
Activity Evidence Country Dominant Rock Type Elevation (Meters) Last Known Eruption Latitude Longitude Name Number Region Tectonic Setting Type
0 Eruption Dated Germany Foidite 600 8300 BCE 50.170 6.85 West Eifel Volcanic Field 210010 Mediterranean and Western Asia Rift Zone / Continental Crust (>25 km) Maar(s)
1 Eruption Dated France Basalt / Picro-Basalt 1464 4040 BCE 45.775 2.97 Chaine des Puys 210020 Mediterranean and Western Asia Rift Zone / Continental Crust (>25 km) Lava dome(s)
2 Evidence Credible Spain Trachybasalt / Tephrite Basanite 893 Unknown 42.170 2.53 Olot Volcanic Field 210030 Mediterranean and Western Asia Intraplate / Continental Crust (>25 km) Pyroclastic cone(s)
3 Eruption Dated Spain Basalt / Picro-Basalt 1117 3600 BCE 38.870 -4.02 Calatrava Volcanic Field 210040 Mediterranean and Western Asia Intraplate / Continental Crust (>25 km) Pyroclastic cone(s)
4 Eruption Observed Italy No Data 500 1282 CE 43.250 10.87 Larderello 211001 Mediterranean and Western Asia Subduction Zone / Continental Crust (>25 km) Explosion crater(s)

.xml — eXtensible Markup Language

XML looks and acts just like HTML. It is a popular web format but is being superseded by JSON

Std Python


In [60]:
import xml.etree.ElementTree as ET

tree = ET.parse('data/volcano.xml')
#ET.fromstring(some_xml_string)
root = tree.getroot()

for item in root:
    print(item.tag, ' -- ', item.text)


Number  --  210010
Name  --  West Eifel Volcanic Field
Country  --  Germany
Region  --  Mediterranean and Western Asia
Type  --  Maar(s)
Activity_Evidence  --  Eruption Dated
Last_Known_Eruption  --  8300 BCE
Latitude  --  50.17
Longitude  --  6.85
key  --  600
Dominant_Rock_Type  --  Foidite
Tectonic_Setting  --  Rift Zone / Continental Crust (>25 km)

Pandas

There is no native way to read or write XML with pandas. There are plenty of code snippets out there if you absolutely need to

.yml .yaml — Yet Another Markup Language

YAML has less overhead than JSON and can reference other objects within the same file to save space

Std Python


In [67]:
import yaml

yaml_data = yaml.load(open('data/volcanoes.yml'))
print('Rows:', len(yaml_data))


Rows: 1508

Pandas

There's not a native way to read in YAML data to Pandas, but you could use it's JSON lib to do so


In [68]:
yaml_df = pd.io.json.json_normalize(yaml.load(open('data/volcanoes.yml')))
yaml_df.head()


Out[68]:
Activity Evidence Country Dominant Rock Type Elevation (Meters) Last Known Eruption Latitude Longitude Name Number Region Tectonic Setting Type
0 Eruption Dated Germany Foidite 600 8300 BCE 50.17 6.85 West Eifel Volcanic Field 210010 Mediterranean and Western Asia Rift Zone / Continental Crust (>25 km) Maar(s)
1 Eruption Dated France Basalt / Picro-Basalt 1464 4040 BCE 45.775 2.97 Chaine des Puys 210020 Mediterranean and Western Asia Rift Zone / Continental Crust (>25 km) Lava dome(s)
2 Evidence Credible Spain Trachybasalt / Tephrite Basanite 893 Unknown 42.17 2.53 Olot Volcanic Field 210030 Mediterranean and Western Asia Intraplate / Continental Crust (>25 km) Pyroclastic cone(s)
3 Eruption Dated Spain Basalt / Picro-Basalt 1117 3600 BCE 38.87 -4.02 Calatrava Volcanic Field 210040 Mediterranean and Western Asia Intraplate / Continental Crust (>25 km) Pyroclastic cone(s)
4 Eruption Observed Italy No Data 500 1282 CE 43.25 10.87 Larderello 211001 Mediterranean and Western Asia Subduction Zone / Continental Crust (>25 km) Explosion crater(s)

.p .pkl — Pickle

Pickle files allow us to save and restore binary copies of native language objects (int, list, DataFrame, custom class, etc)

Std Python


In [72]:
import pickle

print('Raw hex format:\n', pickle.dumps(color_data))
pickle.dump(color_data, open('data/colors.pkl', 'wb'))
new_data = pickle.load(open('data/colors.pkl', 'rb'))

print("\nSame? :", color_data == new_data)


Raw hex format:
 b'\x80\x03]q\x00(}q\x01(X\x04\x00\x00\x00nameq\x02X\x07\x00\x00\x00Michaelq\x03X\t\x00\x00\x00fav_colorq\x04X\x06\x00\x00\x00purpleq\x05u}q\x06(h\x02X\x05\x00\x00\x00Caseyq\x07h\x04X\t\x00\x00\x00turquoiseq\x08ue.'

Same? : True

Pandas


In [46]:
color_df = pd.DataFrame(pd.read_pickle('data/colors.pkl'))
color_df


Out[46]:
fav_color name
0 purple Michael
1 turquoise Casey

In [47]:
color_df.to_pickle('data/color_df.pkl')

Bonus

Here are some scripts we used to create the different datasets


In [37]:
#Create full JSON from CSV
with open('data/volcanoes.csv') as fin:
    data = [dict(row) for row in csv.DictReader(fin)]
    json.dump(data, open('data/volcanoes.json', 'w'))

In [ ]:
#Create single element XML from the JSON data
#This library is not included
from dicttoxml import dicttoxml
print(dicttoxml(json_data[0]), file=open('data/volcano.xml'))

In [64]:
#Create full YAML from the JSON data
yaml.dump(json_data, open('data/volcanoes.yml', 'w'))