Tools for CSV: csvkit

Like jq for JSON, csvkit provides a suite of tools for working with data in comma separated value (CSV) formats. Again, one of the chief virtues of the csvkit tools is the way they can be combined with other common unix tools. Also the documentation is quite good.


In [ ]:
import tarfile
import re
import os
from itertools import count


# You have a copy of this file in your `data` directory. Tate provides the data in a single TAR (tape archive) file
DATA_PATH = '../data/tate-collection-1.2.tar.gz'
DATA_FOBJ = tarfile.open(DATA_PATH)

# We can use Python's tools for working with tar files to inspect the data package
# For instance by listing the files it contains without unpacking it
FILES = DATA_FOBJ.getmembers()
for i, f in enumerate(FILES[:10]):
    print('{0} \t {1}'.format(i, f))

In [ ]:
# This time we're only going to extract the CSV file
DATA_FOBJ.extractall(path='../data/tate-collection', members=FILES[0:4])

In [ ]:
CSV_FILE_PATH = '../data/tate-collection/collection-1.2/artist_data.csv'

Exploring

csvkit's csvlook tool gives us a way to inspect the contents of CSV files without opening them up in a program like Excel (especially when the file is so large it might crash our program)


In [ ]:
%%bash
csvlook ../data/tate-collection/collection-1.2/artist_data.csv | head -n 5

What we're seeing here is effectively a plain text display of our CSV data … not super pretty but faster than opening Excel. Let's make this more useful

Selecting and Slicing

We can combine this facility with the csvcut tool to see only subsets of our data. First, let's use the -n flag to print out the column headings


In [ ]:
!csvcut -n ../data/tate-collection/collection-1.2/artist_data.csv

We can look at just a subset of columns …


In [ ]:
!csvcut -c 1,2,5,6,9 ../data/tate-collection/collection-1.2/artist_data.csv

Once you know the column names you can also use those to subset or slice:


In [ ]:
!csvcut -c name,url ../data/tate-collection/collection-1.2/artist_data.csv

Now combine these tools together using unix pipes


In [ ]:
!csvcut -c name,dates ../data/tate-collection/collection-1.2/artist_data.csv | csvlook | head -n 10

In our case, this still doesn't look that helpful because some of these columns (like name) are really wide

Summary Statistics

The csvkit tools do offer other ways of peeking at out data that can prove useful to us — for instance through providing summary statistics.


In [ ]:
!csvcut -c name,yearOfBirth,yearOfDeath,url ../data/tate-collection/collection-1.2/artist_data.csv | \
csvstat

Super useful!

Searching

We can know use tools for searching (or "grepping") to find specific bits of information in our file — csvkit provides a csvgrep tool for doing just this


In [ ]:
!csvcut -c name,yearOfBirth,yearOfDeath ../data/tate-collection/collection-1.2/artist_data.csv | \
csvgrep -c yearOfBirth -m 1497 | csvlook

In [ ]:
!csvcut -c name,yearOfBirth,yearOfDeath ../data/tate-collection/collection-1.2/artist_data.csv | \
csvgrep -c yearOfDeath -m 2005 | csvlook

Let's sort by the year of birth to see which of our 2005 decedents were the oldest


In [ ]:
!csvcut -c name,yearOfBirth,yearOfDeath ../data/tate-collection/collection-1.2/artist_data.csv | \
csvgrep -c yearOfDeath -m 2005 | csvsort -c yearOfBirth | csvlook

Power Usage

One of the really cool features of csvkit is that it allows us to move back and forth between CSVs and relational database structures quickly and easily — and even to treat our CSVs (for the purpose of rapid exploration) as though they were databases we could run SQL queries against


In [ ]:
!csvsql --query "select name from artist_data where yearOfBirth > 1700;" \
../data/tate-collection/collection-1.2/artist_data.csv | csvlook

This facility is best for one-off ad hoc queries and, since csvkit is building a little in-memory database behind the scenes to make it possible, large datasets can be very slow to work with this way.

However, csvkit SQL tools also make it really easy to turn CSVs into a proper database — by creating the necessary SQL statements for you and trying to guess the correct data types for columns. csvsql will even load the data for you if you have an SQL database installed.


In [ ]:
!csvsql -i sqlite ../data/tate-collection/collection-1.2/artist_data.csv

In [ ]:
!csvsql --db sqlite:///tate_artists.db --insert ../data/tate-collection/collection-1.2/artist_data.csv

In [ ]:
!sql2csv --db sqlite:///tate_artists.db --query "select * from artist_data"

Imported successfully, now we can run any SQL query we want


In [ ]:
!sql2csv --db sqlite:///tate_artists.db --query "select * from artist_data where gender='Female'"

In [ ]:
!sql2csv --db sqlite:///tate_artists.db --query "select name,dates from artist_data where gender='Female'" | \
    csvlook

In [ ]: