Like jq
for JSON, csvkit provides a suite of tools for working with data in comma separated value (CSV) formats. Again, one of the chief virtues of the csvkit
tools is the way they can be combined with other common unix tools. Also the documentation is quite good.
In [ ]:
import tarfile
import re
import os
from itertools import count
# You have a copy of this file in your `data` directory. Tate provides the data in a single TAR (tape archive) file
DATA_PATH = '../data/tate-collection-1.2.tar.gz'
DATA_FOBJ = tarfile.open(DATA_PATH)
# We can use Python's tools for working with tar files to inspect the data package
# For instance by listing the files it contains without unpacking it
FILES = DATA_FOBJ.getmembers()
for i, f in enumerate(FILES[:10]):
print('{0} \t {1}'.format(i, f))
In [ ]:
# This time we're only going to extract the CSV file
DATA_FOBJ.extractall(path='../data/tate-collection', members=FILES[0:4])
In [ ]:
CSV_FILE_PATH = '../data/tate-collection/collection-1.2/artist_data.csv'
In [ ]:
%%bash
csvlook ../data/tate-collection/collection-1.2/artist_data.csv | head -n 5
What we're seeing here is effectively a plain text display of our CSV data … not super pretty but faster than opening Excel. Let's make this more useful
In [ ]:
!csvcut -n ../data/tate-collection/collection-1.2/artist_data.csv
We can look at just a subset of columns …
In [ ]:
!csvcut -c 1,2,5,6,9 ../data/tate-collection/collection-1.2/artist_data.csv
Once you know the column names you can also use those to subset or slice:
In [ ]:
!csvcut -c name,url ../data/tate-collection/collection-1.2/artist_data.csv
Now combine these tools together using unix pipes
In [ ]:
!csvcut -c name,dates ../data/tate-collection/collection-1.2/artist_data.csv | csvlook | head -n 10
In our case, this still doesn't look that helpful because some of these columns (like name) are really wide
In [ ]:
!csvcut -c name,yearOfBirth,yearOfDeath,url ../data/tate-collection/collection-1.2/artist_data.csv | \
csvstat
Super useful!
In [ ]:
!csvcut -c name,yearOfBirth,yearOfDeath ../data/tate-collection/collection-1.2/artist_data.csv | \
csvgrep -c yearOfBirth -m 1497 | csvlook
In [ ]:
!csvcut -c name,yearOfBirth,yearOfDeath ../data/tate-collection/collection-1.2/artist_data.csv | \
csvgrep -c yearOfDeath -m 2005 | csvlook
Let's sort by the year of birth to see which of our 2005 decedents were the oldest
In [ ]:
!csvcut -c name,yearOfBirth,yearOfDeath ../data/tate-collection/collection-1.2/artist_data.csv | \
csvgrep -c yearOfDeath -m 2005 | csvsort -c yearOfBirth | csvlook
In [ ]:
!csvsql --query "select name from artist_data where yearOfBirth > 1700;" \
../data/tate-collection/collection-1.2/artist_data.csv | csvlook
This facility is best for one-off ad hoc queries and, since csvkit
is building a little in-memory database behind the scenes to make it possible, large datasets can be very slow to work with this way.
However, csvkit
SQL tools also make it really easy to turn CSVs into a proper database — by creating the necessary SQL statements for you and trying to guess the correct data types for columns. csvsql
will even load the data for you if you have an SQL database installed.
In [ ]:
!csvsql -i sqlite ../data/tate-collection/collection-1.2/artist_data.csv
In [ ]:
!csvsql --db sqlite:///tate_artists.db --insert ../data/tate-collection/collection-1.2/artist_data.csv
In [ ]:
!sql2csv --db sqlite:///tate_artists.db --query "select * from artist_data"
Imported successfully, now we can run any SQL query we want
In [ ]:
!sql2csv --db sqlite:///tate_artists.db --query "select * from artist_data where gender='Female'"
In [ ]:
!sql2csv --db sqlite:///tate_artists.db --query "select name,dates from artist_data where gender='Female'" | \
csvlook
In [ ]: