CSV command-line kung fu

You might be surprised how much data slicing and dicing you can do from the command line using some simple tools and I/O redirection + piping. (See A Quick Introduction to Pipes and Redirection). We've already seen I/O redirection where we took the output of a command and wrote it to a file (/tmp/t.csv):

$ iconv -c -f utf-8 -t ascii SampleSuperstoreSales.csv > /tmp/t.csv

Set up

pip install csvkit

Extracting rows with grep

Now, let me introduce you to the grep command that lets us filter the lines in a file according to a regular expression. Here's how to find all rows that contain Annie Cyprus:


In [29]:
! grep 'Annie Cyprus' data/SampleSuperstoreSales.csv | head -3


249,1702,5/6/11,High,23,67.24,0.06,Regular Air,4.90,2.84,0.93,Annie Cyprus,Nunavut,Nunavut,Home Office,Office Supplies,Pens & Art Supplies,SANFORD Liquid Accent� Tank-Style Highlighters,Wrap Bag,0.54,5/7/11
669,4676,8/31/11,High,11,1210.0515,0.04,Regular Air,-104.25,125.99,7.69,Annie Cyprus,Nunavut,Nunavut,Home Office,Technology,Telephones and Communication,Timeport L7089,Small Box,0.58,9/1/11
670,4676,8/31/11,High,50,187.83,0.03,Regular Air,85.96,3.75,0.5,Annie Cyprus,Nunavut,Nunavut,Home Office,Office Supplies,Labels,Avery 510,Small Box,0.37,9/2/11

We didn't have to write any code. We didn't have to jump into a development environment or editor. We just asked for the sales for Annie. If we want to write the data to a file, we simply redirect it:


In [2]:
! grep 'Annie Cyprus' data/SampleSuperstoreSales.csv > /tmp/Annie.csv
! head -3 /tmp/Annie.csv # show first 3 lines of that new file


249,1702,5/6/11,High,23,67.24,0.06,Regular Air,4.90,2.84,0.93,Annie Cyprus,Nunavut,Nunavut,Home Office,Office Supplies,Pens & Art Supplies,SANFORD Liquid Accent� Tank-Style Highlighters,Wrap Bag,0.54,5/7/11
669,4676,8/31/11,High,11,1210.0515,0.04,Regular Air,-104.25,125.99,7.69,Annie Cyprus,Nunavut,Nunavut,Home Office,Technology,Telephones and Communication,Timeport L7089,Small Box,0.58,9/1/11
670,4676,8/31/11,High,50,187.83,0.03,Regular Air,85.96,3.75,0.5,Annie Cyprus,Nunavut,Nunavut,Home Office,Office Supplies,Labels,Avery 510,Small Box,0.37,9/2/11

Filtering with csvgrep

csvkit is an amazing package with lots of cool CSV utilities for use on the command line. csvgrep is one of them.

If we want a specific row ID, then we need to use the more powerful csvgrep not just grep. We use a different regular expression that looks for a specific string at the left edge of a line (^ means the beginning of a line, $ means end of line or end of record):


In [12]:
! csvgrep -c 1 -r '^80$' -e latin1 data/SampleSuperstoreSales.csv


Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Profit,Unit Price,Shipping Cost,Customer Name,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date
80,483,7/10/11,High,30,4965.7595,0.08,Regular Air,1198.97,195.99,3.99,Clay Rozendal,Nunavut,Nunavut,Corporate,Technology,Telephones and Communication,R380,Small Box,0.58,7/12/11

What if you want, say, two different rows added to another file? We do two greps and a >> concatenation redirection:


In [13]:
! csvgrep -c 1 -r '^80$' -e latin1 data/SampleSuperstoreSales.csv > /tmp/two.csv # write first row
! csvgrep -c 1 -r '^160$' -e latin1 data/SampleSuperstoreSales.csv >> /tmp/two.csv # append second row
! cat /tmp/two.csv


Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Profit,Unit Price,Shipping Cost,Customer Name,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date
80,483,7/10/11,High,30,4965.7595,0.08,Regular Air,1198.97,195.99,3.99,Clay Rozendal,Nunavut,Nunavut,Corporate,Technology,Telephones and Communication,R380,Small Box,0.58,7/12/11
Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Profit,Unit Price,Shipping Cost,Customer Name,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date
160,995,5/30/11,Medium,46,1815.49,0.03,Regular Air,782.91,39.89,3.04,Neola Schneider,Nunavut,Nunavut,Home Office,Furniture,Office Furnishings,Ultra Commercial Grade Dual Valve Door Closer,Wrap Bag,0.53,5/31/11

Beginning, end of files

If we'd like to see just the header row, we can use head:


In [14]:
! head -1 data/SampleSuperstoreSales.csv


Row ID,Order ID,Order Date,Order Priority,Order Quantity,Sales,Discount,Ship Mode,Profit,Unit Price,Shipping Cost,Customer Name,Province,Region,Customer Segment,Product Category,Product Sub-Category,Product Name,Product Container,Product Base Margin,Ship Date

If, on the other hand, we want to see everything but that row, we can use tail (which I pipe to head so then I see only the first two lines of output):


In [15]:
! tail +2 data/SampleSuperstoreSales.csv | head -2


1,3,10/13/10,Low,6,261.54,0.04,Regular Air,-213.25,38.94,35,Muhammed MacIntyre,Nunavut,Nunavut,Small Business,Office Supplies,Storage & Organization,"Eldon Base for stackable storage shelf, platinum",Large Box,0.8,10/20/10
49,293,10/1/12,High,49,10123.02,0.07,Delivery Truck,457.81,208.16,68.02,Barry French,Nunavut,Nunavut,Consumer,Office Supplies,Appliances,"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",Jumbo Drum,0.58,10/2/12
tail: stdout: Broken pipe

The output would normally be many thousands of lines here so I have piped the output to the head command to print just the first two rows. We can pipe many commands together, sending the output of one command as input to the next command.

Exercise

Count how many sales items there are in the Technology product category that are also High order priorities? Hint: wc -l counts the number of lines.


In [1]:
! grep Technology, data/SampleSuperstoreSales.csv | grep High, | wc -l


     449

Extracting columns with csvcut

Extracting columns is also pretty easy with csvcut. For example, let's say we wanted to get the customer name column (which is 12th by my count).


In [18]:
! csvcut -c 12 -e latin1 data/SampleSuperstoreSales.csv | head -10


Customer Name
Muhammed MacIntyre
Barry French
Barry French
Clay Rozendal
Carlos Soltero
Carlos Soltero
Carl Jackson
Carl Jackson
Monica Federle

Actually, hang on a second. We don't want the Customer Name header to appear in the list so we combine with the tail we just saw to strip the header.


In [19]:
! csvcut -c 12 -e latin1 data/SampleSuperstoreSales.csv | tail +2 | head -10


Muhammed MacIntyre
Barry French
Barry French
Clay Rozendal
Carlos Soltero
Carlos Soltero
Carl Jackson
Carl Jackson
Monica Federle
Dorothy Badders
tail: stdout: Broken pipe

What if we want a unique list? All we have to do is sort and then call uniq:


In [20]:
! csvcut -c 12 -e latin1 data/SampleSuperstoreSales.csv | tail +2 | sort | uniq | head -10


Aaron Bergman
Aaron Hawkins
Aaron Smayling
Adam Bellavance
Adam Hart
Adam Shillingsburg
Adrian Barton
Adrian Hane
Adrian Shami
Aimee Bixby

You can get multiple columns at once in the order specified. For example, here is how to get the sales ID and the customer name together (name first then ID):


In [24]:
! csvcut -c 12,2 -e latin1 data/SampleSuperstoreSales.csv |head -10


Customer Name,Order ID
Muhammed MacIntyre,3
Barry French,293
Barry French,293
Clay Rozendal,483
Carlos Soltero,515
Carlos Soltero,515
Carl Jackson,613
Carl Jackson,613
Monica Federle,643

Naturally, we can write any of this output to a file using the > redirection operator. Let's do that and put each of those columns into a separate file and then paste them back with the customer name first.


In [26]:
! csvcut -c 2 -e latin1 data/SampleSuperstoreSales.csv > /tmp/IDs
! csvcut -c 12 -e latin1 data/SampleSuperstoreSales.csv > /tmp/names
! paste /tmp/names /tmp/IDs | head -10


Customer Name	Order ID
Muhammed MacIntyre	3
Barry French	293
Barry French	293
Clay Rozendal	483
Carlos Soltero	515
Carlos Soltero	515
Carl Jackson	613
Carl Jackson	613
Monica Federle	643

Amazing, right?! This is often a very efficient means of manipulating data files because you are directly talking to the operating system instead of through Python libraries. We also don't have to write any code, we just have to know some syntax for terminal commands.

Not impressed yet? Ok, how about creating a histogram indicating the number of sales per customer sorted in reverse numerical order? We've already got the list of customers and we can use an argument on uniq to get the count instead of just making a unique set. Then, we can use a second sort with arguments to reverse sort and use numeric rather than text-based sorting. This gives us a histogram:


In [27]:
! csvcut -c 12 -e latin1 data/SampleSuperstoreSales.csv | tail +2 | sort | uniq -c | sort -r -n | head -10


  41 Darren Budd
  38 Ed Braxton
  35 Brad Thomas
  33 Carlos Soltero
  30 Patrick Jones
  29 Tony Sayre
  28 Nora Price
  28 Mark Cousins
  28 Lena Creighton
  28 Joy Smith

Exercise

Modify the command so that you get a histogram of the shipping mode.


In [28]:
! csvcut -c 8 -e latin1 data/SampleSuperstoreSales.csv | tail +2 | sort | uniq -c | sort -r -n | head -10


6270 Regular Air
1146 Delivery Truck
 983 Express Air