x lines of Python

Read and write CSV files

This notebook accompanies the agilescientific.com blog post of the same name:

x lines of Python: read and write CSV

Introduction

CSV files are the de facto standard way to store data on the web. They are human-readable, easy to parse with multiple tools, and they compress easily. So you need to know how to read and write them in Python.

Nine times out of ten, the way to read and write CSV files with Python is with pandas. We'll do that first. But that's not always an option (maybe you don't want the dependency in your code), so we'll do it lots of ways:

With pandas
2 ways with the built-in package csv...
- csv.reader
- csv.DictReader
With NumPy.

We'll also use pandas to read a couple of CSV files from the web (as opposed to from flat files on our computer).

Using `pandas`

Using pd.read_csv.



In [1]:

    
import pandas as pd



In [3]:

    
fname = "../data/periods.csv"



In [4]:

    
df = pd.read_csv(fname)



In [5]:

    
df









    Out[5]:







  
    
      
      name
      abbreviation
      start
      end
    
  
  
    
      0
      Cambrian
      C
      541.00
      485.40
    
    
      1
      Ordovician
      O
      485.40
      443.80
    
    
      2
      Silurian
      S
      443.80
      419.20
    
    
      3
      Devonian
      D
      419.20
      358.90
    
    
      4
      Mississipian
      Miss
      358.90
      323.20
    
    
      5
      Pennsylvanian
      Penn
      323.20
      298.90
    
    
      6
      Permian
      P
      298.90
      252.20
    
    
      7
      Triassic
      Tr
      252.20
      201.30
    
    
      8
      Jurassic
      J
      201.30
      145.00
    
    
      9
      Cretaceous
      K
      154.00
      66.00
    
    
      10
      Palaeogene
      Pg
      66.00
      20.03
    
    
      11
      Neogene
      Ng
      23.03
      2.58
    
    
      12
      Quaternary
      Q
      2.58
      0.00

We can get the start of the Permian like this:



In [6]:

    
df[df.name=="Permian"].start









    Out[6]:





6    298.9
Name: start, dtype: float64

Let's fix the start of the Cretaceous:



In [7]:

    
df.loc[df.name=='Cretaceous', 'start'] = 145.0
df.loc[df.name=='Cretaceous', 'start']









    Out[7]:





9    145.0
Name: start, dtype: float64

After you have changed or added to a DataFrame, pandas also makes it very easy to write a CSV file containing your data.



In [8]:

    
df.to_csv("../data/pdout.csv")

Using `csv.reader`

Docs for the ordinary reader.



In [9]:

    
import csv



In [10]:

    
with open(fname) as f:
    reader = csv.reader(f)
    data = [row for row in reader]



In [11]:

    
data









    Out[11]:





[['name', 'abbreviation', 'start', 'end'],
 ['Cambrian', 'C', '541', '485.4'],
 ['Ordovician', 'O', '485.4', '443.8'],
 ['Silurian', 'S', '443.8', '419.2'],
 ['Devonian', 'D', '419.2', '358.9'],
 ['Mississipian', 'Miss', '358.9', '323.2'],
 ['Pennsylvanian', 'Penn', '323.2', '298.9'],
 ['Permian', 'P', '298.9', '252.2'],
 ['Triassic', 'Tr', '252.2', '201.3'],
 ['Jurassic', 'J', '201.3', '145'],
 ['Cretaceous', 'K', '154', '66'],
 ['Palaeogene', 'Pg', '66', '20.03'],
 ['Neogene', 'Ng', '23.03', '2.58'],
 ['Quaternary', 'Q', '2.58', '0.0']]



In [12]:

    
[d[2] for d in data if d[0]=="Permian"]









    Out[12]:





['298.9']

Note that we needed to know the positions of the items in the rows, which we could only get by inspection. We could skip that header row if we wanted to, but there's a better way: use the header as the keys in a dictionary...

Using `csv.DictReader`

Docs for the DictReader.



In [13]:

    
with open(fname) as f:
    reader = csv.DictReader(f)
    data = [row for row in reader]



In [14]:

    
data









    Out[14]:





[{'abbreviation': 'C', 'end': '485.4', 'name': 'Cambrian', 'start': '541'},
 {'abbreviation': 'O', 'end': '443.8', 'name': 'Ordovician', 'start': '485.4'},
 {'abbreviation': 'S', 'end': '419.2', 'name': 'Silurian', 'start': '443.8'},
 {'abbreviation': 'D', 'end': '358.9', 'name': 'Devonian', 'start': '419.2'},
 {'abbreviation': 'Miss',
  'end': '323.2',
  'name': 'Mississipian',
  'start': '358.9'},
 {'abbreviation': 'Penn',
  'end': '298.9',
  'name': 'Pennsylvanian',
  'start': '323.2'},
 {'abbreviation': 'P', 'end': '252.2', 'name': 'Permian', 'start': '298.9'},
 {'abbreviation': 'Tr', 'end': '201.3', 'name': 'Triassic', 'start': '252.2'},
 {'abbreviation': 'J', 'end': '145', 'name': 'Jurassic', 'start': '201.3'},
 {'abbreviation': 'K', 'end': '66', 'name': 'Cretaceous', 'start': '154'},
 {'abbreviation': 'Pg', 'end': '20.03', 'name': 'Palaeogene', 'start': '66'},
 {'abbreviation': 'Ng', 'end': '2.58', 'name': 'Neogene', 'start': '23.03'},
 {'abbreviation': 'Q', 'end': '0.0', 'name': 'Quaternary', 'start': '2.58'}]



In [15]:

    
[d['start'] for d in data if d['name']=="Permian"]









    Out[15]:





['298.9']

There is a corresponding DictWriter class for writing CSVs.

Note that pandas has lots of file readers, including ones for:

Excel files
JSON
SAS
Stata

It can even read the clipboard!

Bonus: reading a CSV file from the web

For example, this one is hosted by GitHub. It's publicly readable, so we don't need to authenticate.

In the X Lines of Python: Machine Learning notebook, I read the online file into a buffer, but it turns out you don't need to do this — you can just give pd.read_csv() a URL!



In [24]:

    
import requests
import io

df = pd.read_csv('https://raw.githubusercontent.com/agile-geoscience/xlines/master/data/periods.csv')

df.head()









    Out[24]:







  
    
      
      name
      abbreviation
      start
      end
    
  
  
    
      0
      Cambrian
      C
      541.0
      485.4
    
    
      1
      Ordovician
      O
      485.4
      443.8
    
    
      2
      Silurian
      S
      443.8
      419.2
    
    
      3
      Devonian
      D
      419.2
      358.9
    
    
      4
      Mississipian
      Miss
      358.9
      323.2

Bonus: using NumPy

pandas is perfect for this CSV because it's really a table, containing a mixture of data types (strings and floats).

Nonetheless, we can read it as an array... I'm not really into 'named arrays', so I'll just read the two numeric columns.

We'll use np.genfromtxt.



In [17]:

    
import numpy as np



In [18]:

    
x = np.genfromtxt(fname, delimiter=',', skip_header=1, usecols=[2,3])
x









    Out[18]:





array([[ 541.  ,  485.4 ],
       [ 485.4 ,  443.8 ],
       [ 443.8 ,  419.2 ],
       [ 419.2 ,  358.9 ],
       [ 358.9 ,  323.2 ],
       [ 323.2 ,  298.9 ],
       [ 298.9 ,  252.2 ],
       [ 252.2 ,  201.3 ],
       [ 201.3 ,  145.  ],
       [ 154.  ,   66.  ],
       [  66.  ,   20.03],
       [  23.03,    2.58],
       [   2.58,    0.  ]])

We can write a CSV like so:



In [19]:

    
np.savetxt("../data/npout.csv", x, delimiter=",", header="start,end")

Bonus: reading a CSV file from Google Docs

It used to be easy to anonymously read a public file directly from Google Docs, but now you need an API key. It's not too hard to set up, but you'll need to read some docs.

When you have an API key, put it here...



In [20]:

    
key = "PUT YOUR KEY HERE"



In [22]:

    
import json



In [23]:

    
url = "https://sheets.googleapis.com/v4/spreadsheets/{id}/values/{sheet}"

meta = {"id": "1YlnEGT8uHpRllk7rjAgFFl8V6B5-kl02DBie11PjG9Q",
        "sheet": "Sheet1"
        }
url = url.format(**meta)

params = {"key": key}

r = requests.get(url, params=params)
j = json.loads(r.text)['values']
df = pd.DataFrame(j[1:], columns=j[0])
df.head()









    Out[23]:







  
    
      
      name
      abbreviation
      start
      end
    
  
  
    
      0
      Cambrian
      C
      541
      485.4
    
    
      1
      Ordovician
      O
      485.4
      443.8
    
    
      2
      Silurian
      S
      443.8
      419.2
    
    
      3
      Devonian
      D
      419.2
      358.9
    
    
      4
      Mississipian
      Miss
      358.9
      323.2



In [ ]:

	name	abbreviation	start	end
0	Cambrian	C	541.00	485.40
1	Ordovician	O	485.40	443.80
2	Silurian	S	443.80	419.20
3	Devonian	D	419.20	358.90
4	Mississipian	Miss	358.90	323.20
5	Pennsylvanian	Penn	323.20	298.90
6	Permian	P	298.90	252.20
7	Triassic	Tr	252.20	201.30
8	Jurassic	J	201.30	145.00
9	Cretaceous	K	154.00	66.00
10	Palaeogene	Pg	66.00	20.03
11	Neogene	Ng	23.03	2.58
12	Quaternary	Q	2.58	0.00

	name	abbreviation	start	end
0	Cambrian	C	541.0	485.4
1	Ordovician	O	485.4	443.8
2	Silurian	S	443.8	419.2
3	Devonian	D	419.2	358.9
4	Mississipian	Miss	358.9	323.2