Extracting & Cleaning Data with Python

$~$

“I Have a Data File, Now What?”

$~$

Naomi Ceder

naomi@naomiceder.tech
@NaomiCeder
projects.naomiceder.tech

My main qualification? I'm quite old...

Who am I?

Python since 2001
Author of Quick Python Book, 2nd Edition (3rd edition late 2017)
Python Software Foundation, Vice Chair of Board of Directors
Python Training (ask your company to hire me)
Long time data grunt - student records, product info, transactions, etc.

About you...

Do you have data cleaning issues?
What problems are you trying to solve?

Workshop Philosophy

“All happy families are alike; each unhappy family is unhappy in its own way.” - Tolstoy

“What could possibly go wrong?”

“If anything can go wrong, it will” - Murphy

Doing things the hardway - can't we just use pandas?

If things work you can use high level tools and never care... but what about when they don't work?

Pandas objects do take additional memory overhead.

I need you to...

Ask questions
Try the samples
Raise issues
Stop me if I'm being boring or irrelevant

Listening to me drone on will be boring... for all of us.

Workshop plan

Introduction
- Tools
- Getting Help
- Debugging

Sample problems - code and discussion (depending on interest)
- plain text - word counting Moby Dick
- Fixed Width
- CSV
- JSON
- XML
- Excel
- HTML/Scraping
- binary

Tools

Python 3.6
- pip
- conda if you're using the Anaconda distribution
Jupyter notebook (handy, not required)
- https://notebooks.azure.com/naomiceder/libraries/data-cleaning
virtual environments
- virtual environments
- docker
pandas

This notebook

Getting help

Python - docs.python.org
Library documentation
dir() and help() from Python shell

Python.org documentation

Python Tutorial (sleep with this under your pillow!) - https://docs.python.org/3/tutorial/index.html
Standard Library - https://docs.python.org/3/library/index.html
pandas - Tutorials, cookbook, etc - http://pandas.pydata.org/pandas-docs/stable/tutorials.html
More Python libraries - PyPI, soon to be known as "the warehouse"
- PyPI - https://pypi.python.org/pypi
- “The warehouse” - https://warehouse.python.org, https://pypi.org/



In [ ]:

    
#Using dir() and help()
import pandas

Debugging

Use `print` a lot

simple
fast
“Did the code get here, and what was x when it did?”
Not so good for edge cases in loops, large structures, etc.
a bit awkward to clean up afterwards



In [ ]:

    
for x in range(10):
    # do various things here... 
    y = x*x
    print("y: ", y)

logging

https://docs.python.org/3/howto/logging-cookbook.html, https://docs.python.org/3/howto/logging.html
useful in production
configurable levels
multiple handlers
file record
more work to set up
less clean-up, just set the debug levels



In [ ]:

    
import logging

# create the logger
logger = logging.getLogger('my_process')
logger.setLevel(logging.DEBUG)

# set up file for debug level messages
file_handler = logging.FileHandler('process.log')
file_handler.setLevel(logging.DEBUG)
logger.addHandler(file_handler)

# setup console for errors only
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.ERROR)
logger.addHandler(console_handler)

logger.debug("This goes only to the file")
logger.error("This only goes to the console and the file")



In [ ]:

    
print(open('process.log').read())

Python (or IDE) debugger

https://docs.python.org/3/library/pdb.html
good for complex situations
conditional break points
stepping through a process
more work and time than print
needs some clean-up



In [ ]:

    
import pdb

for x in range(10):
    # do various things here... 
    y = x*x
    pdb.set_trace()

Opening and reading files

Opening

encoding and errors
binary vs text
universal newline
"with" context handler

Binary files

open as binary
bytes, not strings
struct module
"endianness" or byte order
Format specifier to unpack must match the one used to pack



In [ ]:

    
import sys
sys.byteorder



In [ ]:

    
# example of binary files
import struct
answer = 42
month = 5
day = 6

# pack 3 ints into binary
buffer = struct.pack("III", answer, month, day)
print("Buffer as bytes:", buffer)

# write to file in binary mode
open("test_binary", "wb").write(buffer)
print("Decoded to string :", buffer.decode())
print("Unpacked to tuple:", struct.unpack("III", buffer))

# read from file in binary mode
buffer2 = open("test_binary", "rb").read()
print("Read from file:", buffer2)
print(int(buffer[0]))
print(buffer[0])

strings vs. bytes

no difference in Python 2
- series of bytes, but treated as strings
- conversion needed for unicode
big difference in Python 3
- bytes - series of bytes, but really treated as bytes (0-256 integers)
- string - Unicode by default
- conversion needed between the two, bytes are not a string!



In [ ]:

    
b = [0, 9, 32, 48, 65, 66, 67]
b_string = bytes(b)
print(b_string.decode())

Text files

open as text
treat as series of lines
need encoding (UTF-8 by default)
- Unicode - https://docs.python.org/3/howto/unicode.html
errors settings

Reading

series of lines - no need to read explicitly
newlines still present
context handlers - "with"



In [ ]:

    
# text file example

open("test", "w", newline='').write("this is\nä\x80\ff\r\ntest\xc3\x28")
print("this is\nä\x80\ff\r\ntest\xc3\x28")



In [ ]:

    
text = open("test", "r", newline='\r\n').read()
text2 = open("test2", "r").read()
print(text2)
text2

Pitfalls and solutions

unknown encoding
- use binary
- set errors to ignore, replace, or something else
newline issues
- strip(), rstrip(), then add back
- newline option

Plain text

NLTK
data cleaning/normalization

Problem - word count of Moby Dick

dictionary
collections counter
NLTK tokens



In [ ]:

    
moby = []

with open("moby_dick_01.txt") as ch01:
    for line in ch01:
        moby.append(line)
        
print(moby[:10])

Data Cleaning - String Fu

string methods - upper(), lower(), replace(), find(), index(), in, startswith(), endswith(), etc
str.translate()
regular expressions - https://docs.python.org/3/howto/regex.html ‘Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems’ - Jamie Zawiski

Fixed Width Records

Problem - London weather

plain python
FixedWidth library
pandas

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html

https://pypi.python.org/pypi/FixedWidth/0.99.3

http://docs.astropy.org/en/stable/io/ascii/



In [ ]:

    
with open("heathrowdata.txt") as LHR_input:
    for row in LHR_input:
        print(row)



In [ ]:

    
lines = LHR.split("\n")
header_1 = lines[0].split()
header_2 = [""] * 2 + lines[1].split()
print(header_2)
header = ["\n".join(x) for x in zip(header_1, header_2)]
print(header)
#df = pandas.read_fwf(LHR)
records = [dict(zip(header, [y for y in line.split() ])) for line in lines[2:]]

print(records[2])
df = pandas.DataFrame([x.split() for x in lines])
df2 = pandas.DataFrame(records[2:])
print(df.shape)
print(df2)
help(pandas.read_fwf)

Delimited (CSV) files

https://docs.python.org/3/library/csv.html



In [ ]:

    
with open("temp_data_01.txt") as input_file:
    for row in input_file:
        print(row)



In [ ]:

    
with open("temp_data_01.csv") as input_file:
    for row in input_file:
        print(row)



In [ ]:

    
with open("temp_data_pipes_01.txt") as input_file:
    for row in input_file:
        print(row)



In [ ]:

    
with open("Meteorite_Landings.tsv") as input_file:
    for row in input_file:
        print(row)



In [ ]:

    
with open("london.json") as input_file:
    for row in input_file:
        print(row)



In [ ]:

    
with open("landslide.json") as input_file:
    for row in input_file:
        print(row)



In [ ]:

    
open("test2", "wb").write(b"this,is,a\ntest\x00,null,file")
import csv
for x in csv.reader(open("test2", "r")):
    print(x)

# Cleaning NULL (\x00) bytes from a data file
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()

# alternative
reader = csv.reader(x.replace('\0', '') for x in mycsv)



In [ ]:

    
with open("chicago.json") as input_file:
    for row in input_file:
        print(row)



In [ ]:

    
with open("mars.json") as input_file:
    for row in input_file:
        print(row)



In [ ]:

    
with open("landslide.xml") as input_file:
    for row in input_file:
        print(row)



In [ ]:

    
with open("observations.xml") as input_file:
    for row in input_file:
        print(row)



In [ ]:

    
with open("weather_01.xml") as input_file:
    for row in input_file:
        print(row)

JSON files

one giant object, vs list of objects



In [ ]:

XML files

xmltodict

HTML and Scraping

Excel files



In [ ]:

    
import sys
import struct
print(get_size(LHR))
print(get_size(lines))
print(get_size(records))
print(get_size(df2))
df2.info(memory_usage='deep')



In [ ]:

    
def get_size(obj, seen=None):
    """Recursively finds size of objects"""
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    # Important mark as seen *before* entering recursion to gracefully handle
    # self-referential objects
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])
    return size

Tips and tricks

Use commandline (unix) tools
sets for uniqueness
dictionaries for matching
list and dictionary comprehensions

Optimizations

avoid premature optimizations
processor time is generally cheaper than human time
beware of loops - move things out
avoid repeated string operations (concatenation, etc)
parallelization and concurrency
avoid reading and especially avoid writing to disk (or DB, or virtual memory)
divide and conquer (map/reduce)
sorting can help (binary search)

Very large files

(Example: based on Grainger product feed -> MongoDB, combining 4 files)

4 delimited flat files, unsorted
- items - ~2 million rows; sku, description, categories, dimensions, compliance, brand, MPN, etc.
- attributes - 20 million rows; sku, attr_id, attr_name, attr_value
- alternates/accessories - sku, type, alt_sku
- cross reference - sku, competitor, alt part number

Testing?

problems with testing over very large data sets

Small sample for sanity check

Loud errors

Be sparing with exceptions

Getting Data

Where is the data?

Database
Online
Files

Data from a database

Control over what fields are extracted and how
Control over output format
less likely to need cleaning/normalization

Python tool of choice - SQLAlchemy and/or low level drivers

ORM
Standardized interface for queries
Results as objects



In [ ]:

    
import sqlalchemy
# SQLAlchemy example

Data from online sources

FTP - Python module - ftplib
SFTP - FTP (sort of) over SSH - pysftp
API - REST interfaces over http/https - requests
Scraping - http/https - requests/scrapy/beautiful soup/ect



In [ ]:

    
#FTP example



In [ ]:

    
#SFTP example



In [ ]:

    
# API example