You might be surprised how much data slicing and dicing you can do from the command line using some simple tools and I/O redirection + piping. (See A Quick Introduction to Pipes and Redirection). We've already seen I/O redirection where we took the output of a command and wrote it to a file (/tmp/t.csv
):
$ iconv -c -f utf-8 -t ascii SampleSuperstoreSales.csv > /tmp/t.csv
pip install csvkit
Now, let me introduce you to the grep
command that lets us filter the lines in a file according to a regular expression. Here's how to find all rows that contain Annie Cyprus
:
In [29]:
! grep 'Annie Cyprus' data/SampleSuperstoreSales.csv | head -3
We didn't have to write any code. We didn't have to jump into a development environment or editor. We just asked for the sales for Annie. If we want to write the data to a file, we simply redirect it:
In [2]:
! grep 'Annie Cyprus' data/SampleSuperstoreSales.csv > /tmp/Annie.csv
! head -3 /tmp/Annie.csv # show first 3 lines of that new file
csvkit is an amazing package with lots of cool CSV utilities for use on the command line. csvgrep
is one of them.
If we want a specific row ID, then we need to use the more powerful csvgrep
not just grep
. We use a different regular expression that looks for a specific string at the left edge of a line (^
means the beginning of a line, $
means end of line or end of record):
In [12]:
! csvgrep -c 1 -r '^80$' -e latin1 data/SampleSuperstoreSales.csv
What if you want, say, two different rows added to another file? We do two grep
s and a >>
concatenation redirection:
In [13]:
! csvgrep -c 1 -r '^80$' -e latin1 data/SampleSuperstoreSales.csv > /tmp/two.csv # write first row
! csvgrep -c 1 -r '^160$' -e latin1 data/SampleSuperstoreSales.csv >> /tmp/two.csv # append second row
! cat /tmp/two.csv
If we'd like to see just the header row, we can use head
:
In [14]:
! head -1 data/SampleSuperstoreSales.csv
If, on the other hand, we want to see everything but that row, we can use tail
(which I pipe to head
so then I see only the first two lines of output):
In [15]:
! tail +2 data/SampleSuperstoreSales.csv | head -2
The output would normally be many thousands of lines here so I have piped the output to the head
command to print just the first two rows. We can pipe many commands together, sending the output of one command as input to the next command.
In [1]:
! grep Technology, data/SampleSuperstoreSales.csv | grep High, | wc -l
In [18]:
! csvcut -c 12 -e latin1 data/SampleSuperstoreSales.csv | head -10
Actually, hang on a second. We don't want the Customer Name
header to appear in the list so we combine with the tail
we just saw to strip the header.
In [19]:
! csvcut -c 12 -e latin1 data/SampleSuperstoreSales.csv | tail +2 | head -10
What if we want a unique list? All we have to do is sort and then call uniq
:
In [20]:
! csvcut -c 12 -e latin1 data/SampleSuperstoreSales.csv | tail +2 | sort | uniq | head -10
You can get multiple columns at once in the order specified. For example, here is how to get the sales ID and the customer name together (name first then ID):
In [24]:
! csvcut -c 12,2 -e latin1 data/SampleSuperstoreSales.csv |head -10
Naturally, we can write any of this output to a file using the >
redirection operator. Let's do that and put each of those columns into a separate file and then paste
them back with the customer name first.
In [26]:
! csvcut -c 2 -e latin1 data/SampleSuperstoreSales.csv > /tmp/IDs
! csvcut -c 12 -e latin1 data/SampleSuperstoreSales.csv > /tmp/names
! paste /tmp/names /tmp/IDs | head -10
Amazing, right?! This is often a very efficient means of manipulating data files because you are directly talking to the operating system instead of through Python libraries. We also don't have to write any code, we just have to know some syntax for terminal commands.
Not impressed yet? Ok, how about creating a histogram indicating the number of sales per customer sorted in reverse numerical order? We've already got the list of customers and we can use an argument on uniq
to get the count instead of just making a unique set. Then, we can use a second sort
with arguments to reverse sort and use numeric rather than text-based sorting. This gives us a histogram:
In [27]:
! csvcut -c 12 -e latin1 data/SampleSuperstoreSales.csv | tail +2 | sort | uniq -c | sort -r -n | head -10
In [28]:
! csvcut -c 8 -e latin1 data/SampleSuperstoreSales.csv | tail +2 | sort | uniq -c | sort -r -n | head -10