Getting Started with Data

Example code blocks showing how to load various data types into Python

Designed for Python 3.5+



In [21]:

    
#pandas is commonly imported as pd
import pandas as pd

#We'll import the other libraries as needed

.csv — Comma-Separated Values

Data type with the least overhead (aka file size) and very common

Std Python

Conceptually, the basics of a CSV are just like splitting a string by commas:



In [24]:

    
print("Split on comma as strings")
csv_row = '1,2.0,Three point five,True'
print(csv_row.split(','))

print("\nSplit on comma and converted to ints")
csv_row = '1,2,3,4,5,6,7,8,9'
print([int(c) for c in csv_row.split(',')])









    



Split on comma as strings
['1', '2.0', 'Three point five', 'True']

Split on comma and converted to ints
[1, 2, 3, 4, 5, 6, 7, 8, 9]

But there are more than just strings or rows with a single data type. We can use the csv library to handle edge cases and memory management for us.



In [11]:

    
import csv
with open('data/volcanoes.csv') as fin:
    csv_data = csv.reader(fin)
    #csv.reader is a generator
    #for row in csv_data: do stuff
    print(csv_data.__next__())
    print()
    print(csv_data.__next__())









    



['Number', 'Name', 'Country', 'Region', 'Type', 'Activity Evidence', 'Last Known Eruption', 'Latitude', 'Longitude', 'Elevation (Meters)', 'Dominant Rock Type', 'Tectonic Setting']

['210010', 'West Eifel Volcanic Field', 'Germany', 'Mediterranean and Western Asia', 'Maar(s)', 'Eruption Dated', '8300 BCE', '50.17', '6.85', '600', 'Foidite', 'Rift Zone / Continental Crust (>25 km)']



In [16]:

    
with open('data/volcanoes.csv') as fin:
    csv_dict_data = csv.DictReader(fin)
    for row in csv_dict_data:
        print(row)
        print()
        print('Name:', row['Name'], '\nType:', row['Type'])
        break









    



OrderedDict([('Number', '210010'), ('Name', 'West Eifel Volcanic Field'), ('Country', 'Germany'), ('Region', 'Mediterranean and Western Asia'), ('Type', 'Maar(s)'), ('Activity Evidence', 'Eruption Dated'), ('Last Known Eruption', '8300 BCE'), ('Latitude', '50.17'), ('Longitude', '6.85'), ('Elevation (Meters)', '600'), ('Dominant Rock Type', 'Foidite'), ('Tectonic Setting', 'Rift Zone / Continental Crust (>25 km)')])

Name: West Eifel Volcanic Field 
Type: Maar(s)

Pandas

Because CSV matches the structure of pandas.DataFrame, it's really easy to import the data.



In [31]:

    
csv_df = pd.read_csv(open('data/volcanoes.csv'))
csv_df.head()









    Out[31]:






  
    
      
      Number
      Name
      Country
      Region
      Type
      Activity Evidence
      Last Known Eruption
      Latitude
      Longitude
      Elevation (Meters)
      Dominant Rock Type
      Tectonic Setting
    
  
  
    
      0
      210010
      West Eifel Volcanic Field
      Germany
      Mediterranean and Western Asia
      Maar(s)
      Eruption Dated
      8300 BCE
      50.170
      6.85
      600
      Foidite
      Rift Zone / Continental Crust (>25 km)
    
    
      1
      210020
      Chaine des Puys
      France
      Mediterranean and Western Asia
      Lava dome(s)
      Eruption Dated
      4040 BCE
      45.775
      2.97
      1464
      Basalt / Picro-Basalt
      Rift Zone / Continental Crust (>25 km)
    
    
      2
      210030
      Olot Volcanic Field
      Spain
      Mediterranean and Western Asia
      Pyroclastic cone(s)
      Evidence Credible
      Unknown
      42.170
      2.53
      893
      Trachybasalt / Tephrite Basanite
      Intraplate / Continental Crust (>25 km)
    
    
      3
      210040
      Calatrava Volcanic Field
      Spain
      Mediterranean and Western Asia
      Pyroclastic cone(s)
      Eruption Dated
      3600 BCE
      38.870
      -4.02
      1117
      Basalt / Picro-Basalt
      Intraplate / Continental Crust (>25 km)
    
    
      4
      211001
      Larderello
      Italy
      Mediterranean and Western Asia
      Explosion crater(s)
      Eruption Observed
      1282 CE
      43.250
      10.87
      500
      No Data
      Subduction Zone / Continental Crust (>25 km)

.xls .xlsx — Excel Spreadsheet

Excel spreadsheets are more complicated than CSV (ex they can contain multiple datasets), but they still follow the 2D array format of a CSV and DataFrames

Std Python



In [19]:

    
import openpyxl
# You can also use xlrd, xlsxwriter, and a host of others.
# See http://www.python-excel.org/ for all your options.
wb = openpyxl.Workbook()
ws = wb.create_sheet("NewSheet")

ws['A1'] = "Name"
ws['A2'] = "Michael"
ws['B1'] = "fav_color"
ws['B2'] = "Purple"

wb.save("data/ExcelData.xlsx")


excel_row = 'data/ExcelData.xlsx'
data = openpyxl.load_workbook(excel_row)
for sheet in data:
    for row in sheet:
        print(row)
        print(*[cell.value for cell in row])









    



(<Cell Sheet.A1>,)
None
(<Cell NewSheet.A1>, <Cell NewSheet.B1>)
Name fav_color
(<Cell NewSheet.A2>, <Cell NewSheet.B2>)
Michael Purple

Pandas



In [30]:

    
file = "data/volcanoes.xlsx"

# Pandas excel loader is built off of several other excel readers,
# such as openXLRD and xlsxwriter

# this is reflected in how many ways there are to read in an excel file.
basicLoad = pd.read_excel(file)
alternateLoad = pd.ExcelFile(file)

basicLoad.head()









    Out[30]:






  
    
      
      Number
      Name
      Country
      Region
      Type
      Activity Evidence
      Last Known Eruption
      Latitude
      Longitude
      Elevation (Meters)
      Dominant Rock Type
      Tectonic Setting
    
  
  
    
      0
      210010
      West Eifel Volcanic Field
      Germany
      Mediterranean and Western Asia
      Maar(s)
      Eruption Dated
      8300 BCE
      50.170
      6.85
      600
      Foidite
      Rift Zone / Continental Crust (>25 km)
    
    
      1
      210020
      Chaine des Puys
      France
      Mediterranean and Western Asia
      Lava dome(s)
      Eruption Dated
      4040 BCE
      45.775
      2.97
      1464
      Basalt / Picro-Basalt
      Rift Zone / Continental Crust (>25 km)
    
    
      2
      210030
      Olot Volcanic Field
      Spain
      Mediterranean and Western Asia
      Pyroclastic cone(s)
      Evidence Credible
      Unknown
      42.170
      2.53
      893
      Trachybasalt / Tephrite Basanite
      Intraplate / Continental Crust (>25 km)
    
    
      3
      210040
      Calatrava Volcanic Field
      Spain
      Mediterranean and Western Asia
      Pyroclastic cone(s)
      Eruption Dated
      3600 BCE
      38.870
      -4.02
      1117
      Basalt / Picro-Basalt
      Intraplate / Continental Crust (>25 km)
    
    
      4
      211001
      Larderello
      Italy
      Mediterranean and Western Asia
      Explosion crater(s)
      Eruption Observed
      1282 CE
      43.250
      10.87
      500
      No Data
      Subduction Zone / Continental Crust (>25 km)



In [33]:

    
# read_excel defaults to read the first sheet in an excel book
# For a comprehensive list of parameters for read_excel, see: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
# you can specify the sheets you want by name
sheetByName = pd.read_excel(file, sheetname="volcanoes")
# by index
sheetsByIndex = pd.read_excel(file, sheetname=[0])

# if you don't know which sheets you want, you can specify header=None and
# all sheets will be loaded in a nested structure:
allSheetsByHeader = pd.read_excel(file, header=None)
allSheetsBySheets = pd.read_excel(file, sheetname=0)

# You can skip rows or columns
cols = ['Number', 'Name', 'Country', 'Region', 'Type', 'Activity Evidence',
        'Last Known Eruption', 'Latitude', 'Longitude', 'Elevation (Meters)',
        'Dominant Rock Type', 'Tectonic Setting']
subset = pd.read_excel(file, skip_footer=5, skiprows=2, names=cols)

.json — JavaScript Object Notation

JSON is a common web and API communication format and easy to read. It also happens to be copy/paste compatible as native Python code (with some restrictions the other way around)

Std Python



In [69]:

    
import json

color_data = [
    {"name": "Michael", "fav_color": "purple"},
    {"name": "Casey", "fav_color": "turquoise"}
]

#load, loads, dump, and dumps are common library APIs
#load/dump tkae file objects while loads/dumps take strings
print(json.dumps(color_data))









    



[{"name": "Michael", "fav_color": "purple"}, {"name": "Casey", "fav_color": "turquoise"}]



In [54]:

    
json_data = json.load(open('data/volcanoes.json'))
print('Rows:', len(json_data))









    



Rows: 1508

Pandas



In [52]:

    
json_df = pd.read_json('data/volcanoes.json')
json_df.head()









    Out[52]:






  
    
      
      Activity Evidence
      Country
      Dominant Rock Type
      Elevation (Meters)
      Last Known Eruption
      Latitude
      Longitude
      Name
      Number
      Region
      Tectonic Setting
      Type
    
  
  
    
      0
      Eruption Dated
      Germany
      Foidite
      600
      8300 BCE
      50.170
      6.85
      West Eifel Volcanic Field
      210010
      Mediterranean and Western Asia
      Rift Zone / Continental Crust (>25 km)
      Maar(s)
    
    
      1
      Eruption Dated
      France
      Basalt / Picro-Basalt
      1464
      4040 BCE
      45.775
      2.97
      Chaine des Puys
      210020
      Mediterranean and Western Asia
      Rift Zone / Continental Crust (>25 km)
      Lava dome(s)
    
    
      2
      Evidence Credible
      Spain
      Trachybasalt / Tephrite Basanite
      893
      Unknown
      42.170
      2.53
      Olot Volcanic Field
      210030
      Mediterranean and Western Asia
      Intraplate / Continental Crust (>25 km)
      Pyroclastic cone(s)
    
    
      3
      Eruption Dated
      Spain
      Basalt / Picro-Basalt
      1117
      3600 BCE
      38.870
      -4.02
      Calatrava Volcanic Field
      210040
      Mediterranean and Western Asia
      Intraplate / Continental Crust (>25 km)
      Pyroclastic cone(s)
    
    
      4
      Eruption Observed
      Italy
      No Data
      500
      1282 CE
      43.250
      10.87
      Larderello
      211001
      Mediterranean and Western Asia
      Subduction Zone / Continental Crust (>25 km)
      Explosion crater(s)

.xml — eXtensible Markup Language

XML looks and acts just like HTML. It is a popular web format but is being superseded by JSON

Std Python



In [60]:

    
import xml.etree.ElementTree as ET

tree = ET.parse('data/volcano.xml')
#ET.fromstring(some_xml_string)
root = tree.getroot()

for item in root:
    print(item.tag, ' -- ', item.text)









    



Number  --  210010
Name  --  West Eifel Volcanic Field
Country  --  Germany
Region  --  Mediterranean and Western Asia
Type  --  Maar(s)
Activity_Evidence  --  Eruption Dated
Last_Known_Eruption  --  8300 BCE
Latitude  --  50.17
Longitude  --  6.85
key  --  600
Dominant_Rock_Type  --  Foidite
Tectonic_Setting  --  Rift Zone / Continental Crust (>25 km)

Pandas

There is no native way to read or write XML with pandas. There are plenty of code snippets out there if you absolutely need to

.yml .yaml — Yet Another Markup Language

YAML has less overhead than JSON and can reference other objects within the same file to save space

Std Python



In [67]:

    
import yaml

yaml_data = yaml.load(open('data/volcanoes.yml'))
print('Rows:', len(yaml_data))









    



Rows: 1508

Pandas

There's not a native way to read in YAML data to Pandas, but you could use it's JSON lib to do so



In [68]:

    
yaml_df = pd.io.json.json_normalize(yaml.load(open('data/volcanoes.yml')))
yaml_df.head()









    Out[68]:






  
    
      
      Activity Evidence
      Country
      Dominant Rock Type
      Elevation (Meters)
      Last Known Eruption
      Latitude
      Longitude
      Name
      Number
      Region
      Tectonic Setting
      Type
    
  
  
    
      0
      Eruption Dated
      Germany
      Foidite
      600
      8300 BCE
      50.17
      6.85
      West Eifel Volcanic Field
      210010
      Mediterranean and Western Asia
      Rift Zone / Continental Crust (>25 km)
      Maar(s)
    
    
      1
      Eruption Dated
      France
      Basalt / Picro-Basalt
      1464
      4040 BCE
      45.775
      2.97
      Chaine des Puys
      210020
      Mediterranean and Western Asia
      Rift Zone / Continental Crust (>25 km)
      Lava dome(s)
    
    
      2
      Evidence Credible
      Spain
      Trachybasalt / Tephrite Basanite
      893
      Unknown
      42.17
      2.53
      Olot Volcanic Field
      210030
      Mediterranean and Western Asia
      Intraplate / Continental Crust (>25 km)
      Pyroclastic cone(s)
    
    
      3
      Eruption Dated
      Spain
      Basalt / Picro-Basalt
      1117
      3600 BCE
      38.87
      -4.02
      Calatrava Volcanic Field
      210040
      Mediterranean and Western Asia
      Intraplate / Continental Crust (>25 km)
      Pyroclastic cone(s)
    
    
      4
      Eruption Observed
      Italy
      No Data
      500
      1282 CE
      43.25
      10.87
      Larderello
      211001
      Mediterranean and Western Asia
      Subduction Zone / Continental Crust (>25 km)
      Explosion crater(s)

.p .pkl — Pickle

Pickle files allow us to save and restore binary copies of native language objects (int, list, DataFrame, custom class, etc)

Std Python



In [72]:

    
import pickle

print('Raw hex format:\n', pickle.dumps(color_data))
pickle.dump(color_data, open('data/colors.pkl', 'wb'))
new_data = pickle.load(open('data/colors.pkl', 'rb'))

print("\nSame? :", color_data == new_data)









    



Raw hex format:
 b'\x80\x03]q\x00(}q\x01(X\x04\x00\x00\x00nameq\x02X\x07\x00\x00\x00Michaelq\x03X\t\x00\x00\x00fav_colorq\x04X\x06\x00\x00\x00purpleq\x05u}q\x06(h\x02X\x05\x00\x00\x00Caseyq\x07h\x04X\t\x00\x00\x00turquoiseq\x08ue.'

Same? : True

Pandas



In [46]:

    
color_df = pd.DataFrame(pd.read_pickle('data/colors.pkl'))
color_df









    Out[46]:






  
    
      
      fav_color
      name
    
  
  
    
      0
      purple
      Michael
    
    
      1
      turquoise
      Casey



In [47]:

    
color_df.to_pickle('data/color_df.pkl')

Bonus

Here are some scripts we used to create the different datasets



In [37]:

    
#Create full JSON from CSV
with open('data/volcanoes.csv') as fin:
    data = [dict(row) for row in csv.DictReader(fin)]
    json.dump(data, open('data/volcanoes.json', 'w'))



In [ ]:

    
#Create single element XML from the JSON data
#This library is not included
from dicttoxml import dicttoxml
print(dicttoxml(json_data[0]), file=open('data/volcano.xml'))



In [64]:

    
#Create full YAML from the JSON data
yaml.dump(json_data, open('data/volcanoes.yml', 'w'))

	Number	Name	Country	Region	Type	Activity Evidence	Last Known Eruption	Latitude	Longitude	Elevation (Meters)	Dominant Rock Type	Tectonic Setting
0	210010	West Eifel Volcanic Field	Germany	Mediterranean and Western Asia	Maar(s)	Eruption Dated	8300 BCE	50.170	6.85	600	Foidite	Rift Zone / Continental Crust (>25 km)
1	210020	Chaine des Puys	France	Mediterranean and Western Asia	Lava dome(s)	Eruption Dated	4040 BCE	45.775	2.97	1464	Basalt / Picro-Basalt	Rift Zone / Continental Crust (>25 km)
2	210030	Olot Volcanic Field	Spain	Mediterranean and Western Asia	Pyroclastic cone(s)	Evidence Credible	Unknown	42.170	2.53	893	Trachybasalt / Tephrite Basanite	Intraplate / Continental Crust (>25 km)
3	210040	Calatrava Volcanic Field	Spain	Mediterranean and Western Asia	Pyroclastic cone(s)	Eruption Dated	3600 BCE	38.870	-4.02	1117	Basalt / Picro-Basalt	Intraplate / Continental Crust (>25 km)
4	211001	Larderello	Italy	Mediterranean and Western Asia	Explosion crater(s)	Eruption Observed	1282 CE	43.250	10.87	500	No Data	Subduction Zone / Continental Crust (>25 km)