Extracting & Cleaning Data with Python

$~$

“I Have a Data File, Now What?”

$~$

Naomi Ceder

  • naomi@naomiceder.tech
  • @NaomiCeder
  • projects.naomiceder.tech

My main qualification? I'm quite old...

Who am I?

  • Python since 2001
  • Author of Quick Python Book, 2nd Edition (3rd edition late 2017)
  • Python Software Foundation, Vice Chair of Board of Directors
  • Python Training (ask your company to hire me)
  • Long time data grunt - student records, product info, transactions, etc.

About you...

  • Do you have data cleaning issues?
  • What problems are you trying to solve?

Workshop Philosophy

“All happy families are alike; each unhappy family is unhappy in its own way.” - Tolstoy

“What could possibly go wrong?”

“If anything can go wrong, it will” - Murphy

Doing things the hardway - can't we just use pandas?

If things work you can use high level tools and never care... but what about when they don't work?

Pandas objects do take additional memory overhead.

I need you to...

  • Ask questions
  • Try the samples
  • Raise issues
  • Stop me if I'm being boring or irrelevant

Listening to me drone on will be boring... for all of us.

Workshop plan

  • Introduction
    • Tools
    • Getting Help
    • Debugging
  • Sample problems - code and discussion (depending on interest)

    • plain text - word counting Moby Dick
    • Fixed Width
    • CSV
    • JSON
    • XML
    • Excel
    • HTML/Scraping
    • binary

Tools

Getting help

  • Python - docs.python.org
  • Library documentation
  • dir() and help() from Python shell

Python.org documentation


In [ ]:
#Using dir() and help()
import pandas

Debugging

Use print a lot

  • simple
  • fast
  • “Did the code get here, and what was x when it did?”
  • Not so good for edge cases in loops, large structures, etc.
  • a bit awkward to clean up afterwards

In [ ]:
for x in range(10):
    # do various things here... 
    y = x*x
    print("y: ", y)

logging


In [ ]:
import logging

# create the logger
logger = logging.getLogger('my_process')
logger.setLevel(logging.DEBUG)

# set up file for debug level messages
file_handler = logging.FileHandler('process.log')
file_handler.setLevel(logging.DEBUG)
logger.addHandler(file_handler)

# setup console for errors only
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.ERROR)
logger.addHandler(console_handler)

logger.debug("This goes only to the file")
logger.error("This only goes to the console and the file")

In [ ]:
print(open('process.log').read())

Python (or IDE) debugger


In [ ]:
import pdb

for x in range(10):
    # do various things here... 
    y = x*x
    pdb.set_trace()

Opening and reading files

Opening

  • encoding and errors
  • binary vs text
  • universal newline
  • "with" context handler

Binary files

  • open as binary
  • bytes, not strings
  • struct module
  • "endianness" or byte order
  • Format specifier to unpack must match the one used to pack

In [ ]:
import sys
sys.byteorder

In [ ]:
# example of binary files
import struct
answer = 42
month = 5
day = 6

# pack 3 ints into binary
buffer = struct.pack("III", answer, month, day)
print("Buffer as bytes:", buffer)

# write to file in binary mode
open("test_binary", "wb").write(buffer)
print("Decoded to string :", buffer.decode())
print("Unpacked to tuple:", struct.unpack("III", buffer))

# read from file in binary mode
buffer2 = open("test_binary", "rb").read()
print("Read from file:", buffer2)
print(int(buffer[0]))
print(buffer[0])

strings vs. bytes

  • no difference in Python 2
    • series of bytes, but treated as strings
    • conversion needed for unicode
  • big difference in Python 3
    • bytes - series of bytes, but really treated as bytes (0-256 integers)
    • string - Unicode by default
    • conversion needed between the two, bytes are not a string!

In [ ]:
b = [0, 9, 32, 48, 65, 66, 67]
b_string = bytes(b)
print(b_string.decode())

Text files

Reading

  • series of lines - no need to read explicitly
  • newlines still present
  • context handlers - "with"

In [ ]:
# text file example

open("test", "w", newline='').write("this is\nä\x80\ff\r\ntest\xc3\x28")
print("this is\nä\x80\ff\r\ntest\xc3\x28")

In [ ]:
text = open("test", "r", newline='\r\n').read()
text2 = open("test2", "r").read()
print(text2)
text2

Pitfalls and solutions

  • unknown encoding
    • use binary
    • set errors to ignore, replace, or something else
  • newline issues
    • strip(), rstrip(), then add back
    • newline option

Plain text

  • NLTK
  • data cleaning/normalization

Problem - word count of Moby Dick

  • dictionary
  • collections counter
  • NLTK tokens

In [ ]:
moby = []

with open("moby_dick_01.txt") as ch01:
    for line in ch01:
        moby.append(line)
        
print(moby[:10])

Data Cleaning - String Fu

  • string methods - upper(), lower(), replace(), find(), index(), in, startswith(), endswith(), etc
  • str.translate()
  • regular expressions - https://docs.python.org/3/howto/regex.html ‘Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems’ - Jamie Zawiski

In [ ]:
with open("heathrowdata.txt") as LHR_input:
    for row in LHR_input:
        print(row)

In [ ]:
lines = LHR.split("\n")
header_1 = lines[0].split()
header_2 = [""] * 2 + lines[1].split()
print(header_2)
header = ["\n".join(x) for x in zip(header_1, header_2)]
print(header)
#df = pandas.read_fwf(LHR)
records = [dict(zip(header, [y for y in line.split() ])) for line in lines[2:]]

print(records[2])
df = pandas.DataFrame([x.split() for x in lines])
df2 = pandas.DataFrame(records[2:])
print(df.shape)
print(df2)
help(pandas.read_fwf)

In [ ]:
with open("temp_data_01.txt") as input_file:
    for row in input_file:
        print(row)

In [ ]:
with open("temp_data_01.csv") as input_file:
    for row in input_file:
        print(row)

In [ ]:
with open("temp_data_pipes_01.txt") as input_file:
    for row in input_file:
        print(row)

In [ ]:
with open("Meteorite_Landings.tsv") as input_file:
    for row in input_file:
        print(row)

In [ ]:
with open("london.json") as input_file:
    for row in input_file:
        print(row)

In [ ]:
with open("landslide.json") as input_file:
    for row in input_file:
        print(row)

In [ ]:
open("test2", "wb").write(b"this,is,a\ntest\x00,null,file")
import csv
for x in csv.reader(open("test2", "r")):
    print(x)

# Cleaning NULL (\x00) bytes from a data file
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()

# alternative
reader = csv.reader(x.replace('\0', '') for x in mycsv)

In [ ]:
with open("chicago.json") as input_file:
    for row in input_file:
        print(row)

In [ ]:
with open("mars.json") as input_file:
    for row in input_file:
        print(row)

In [ ]:
with open("landslide.xml") as input_file:
    for row in input_file:
        print(row)

In [ ]:
with open("observations.xml") as input_file:
    for row in input_file:
        print(row)

In [ ]:
with open("weather_01.xml") as input_file:
    for row in input_file:
        print(row)

JSON files

  • one giant object, vs list of objects

In [ ]:

XML files

  • xmltodict

HTML and Scraping

Excel files


In [ ]:
import sys
import struct
print(get_size(LHR))
print(get_size(lines))
print(get_size(records))
print(get_size(df2))
df2.info(memory_usage='deep')

In [ ]:
def get_size(obj, seen=None):
    """Recursively finds size of objects"""
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    # Important mark as seen *before* entering recursion to gracefully handle
    # self-referential objects
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])
    return size

Tips and tricks

  • Use commandline (unix) tools
  • sets for uniqueness
  • dictionaries for matching
  • list and dictionary comprehensions

Optimizations

  • avoid premature optimizations
  • processor time is generally cheaper than human time
  • beware of loops - move things out
  • avoid repeated string operations (concatenation, etc)
  • parallelization and concurrency
  • avoid reading and especially avoid writing to disk (or DB, or virtual memory)
  • divide and conquer (map/reduce)
  • sorting can help (binary search)

Very large files

(Example: based on Grainger product feed -> MongoDB, combining 4 files)

  • 4 delimited flat files, unsorted
    • items - ~2 million rows; sku, description, categories, dimensions, compliance, brand, MPN, etc.
    • attributes - 20 million rows; sku, attr_id, attr_name, attr_value
    • alternates/accessories - sku, type, alt_sku
    • cross reference - sku, competitor, alt part number

Testing?

problems with testing over very large data sets

Small sample for sanity check

Loud errors

Be sparing with exceptions

Getting Data

Where is the data?

  • Database
  • Online
  • Files

Data from a database

  • Control over what fields are extracted and how
  • Control over output format
  • less likely to need cleaning/normalization

Python tool of choice - SQLAlchemy and/or low level drivers

  • ORM
  • Standardized interface for queries
  • Results as objects

In [ ]:
import sqlalchemy
# SQLAlchemy example

Data from online sources

  • FTP - Python module - ftplib
  • SFTP - FTP (sort of) over SSH - pysftp
  • API - REST interfaces over http/https - requests
  • Scraping - http/https - requests/scrapy/beautiful soup/ect

In [ ]:
#FTP example

In [ ]:
#SFTP example

In [ ]:
# API example