Title: Loading A CSV Into Pandas
Slug: pandas_dataframe_importing_csv
Summary: Loading A CSV Into Pandas
Date: 2016-05-01 12:00
Category: Python
Tags: Data Wrangling
Authors: Chris Albon

import modules


In [1]:
import pandas as pd
import numpy as np

Create dataframe (that we will be importing)


In [2]:
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, ".", "."],
        'postTestScore': ["25,000", "94,000", 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df


Out[2]:
first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25,000
1 Molly Jacobson 52 24 94,000
2 Tina . 36 31 57
3 Jake Milner 24 . 62
4 Amy Cooze 73 . 70

Save dataframe as csv in the working director


In [3]:
df.to_csv('pandas_dataframe_importing_csv/example.csv')

Load a csv


In [4]:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv')
df


Out[4]:
Unnamed: 0 first_name last_name age preTestScore postTestScore
0 0 Jason Miller 42 4 25,000
1 1 Molly Jacobson 52 24 94,000
2 2 Tina . 36 31 57
3 3 Jake Milner 24 . 62
4 4 Amy Cooze 73 . 70

Load a csv with no headers


In [5]:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', header=None)
df


Out[5]:
0 1 2 3 4 5
0 NaN first_name last_name age preTestScore postTestScore
1 0.0 Jason Miller 42 4 25,000
2 1.0 Molly Jacobson 52 24 94,000
3 2.0 Tina . 36 31 57
4 3.0 Jake Milner 24 . 62
5 4.0 Amy Cooze 73 . 70

Load a csv while specifying column names


In [6]:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df


Out[6]:
UID First Name Last Name Age Pre-Test Score Post-Test Score
0 NaN first_name last_name age preTestScore postTestScore
1 0.0 Jason Miller 42 4 25,000
2 1.0 Molly Jacobson 52 24 94,000
3 2.0 Tina . 36 31 57
4 3.0 Jake Milner 24 . 62
5 4.0 Amy Cooze 73 . 70

Load a csv with setting the index column to UID


In [7]:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col='UID', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df


Out[7]:
First Name Last Name Age Pre-Test Score Post-Test Score
UID
NaN first_name last_name age preTestScore postTestScore
0.0 Jason Miller 42 4 25,000
1.0 Molly Jacobson 52 24 94,000
2.0 Tina . 36 31 57
3.0 Jake Milner 24 . 62
4.0 Amy Cooze 73 . 70

Load a csv while setting the index columns to First Name and Last Name


In [8]:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col=['First Name', 'Last Name'], names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
df


Out[8]:
UID Age Pre-Test Score Post-Test Score
First Name Last Name
first_name last_name NaN age preTestScore postTestScore
Jason Miller 0.0 42 4 25,000
Molly Jacobson 1.0 52 24 94,000
Tina . 2.0 36 31 57
Jake Milner 3.0 24 . 62
Amy Cooze 4.0 73 . 70

Load a csv while specifying "." as missing values


In [9]:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=['.'])
pd.isnull(df)


Out[9]:
Unnamed: 0 first_name last_name age preTestScore postTestScore
0 False False False False False False
1 False False False False False False
2 False False True False False False
3 False False False False True False
4 False False False False True False

Load a csv while specifying "." and "NA" as missing values in the Last Name column and "." as missing values in Pre-Test Score column


In [10]:
sentinels = {'Last Name': ['.', 'NA'], 'Pre-Test Score': ['.']}

In [11]:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels)
df


Out[11]:
Unnamed: 0 first_name last_name age preTestScore postTestScore
0 0 Jason Miller 42 4 25,000
1 1 Molly Jacobson 52 24 94,000
2 2 Tina . 36 31 57
3 3 Jake Milner 24 . 62
4 4 Amy Cooze 73 . 70

Load a csv while skipping the top 3 rows


In [12]:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels, skiprows=3)
df


Out[12]:
2 Tina . 36 31 57
0 3 Jake Milner 24 . 62
1 4 Amy Cooze 73 . 70

Load a csv while interpreting "," in strings around numbers as thousands seperators


In [13]:
df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', thousands=',')
df


Out[13]:
Unnamed: 0 first_name last_name age preTestScore postTestScore
0 0 Jason Miller 42 4 25000
1 1 Molly Jacobson 52 24 94000
2 2 Tina . 36 31 57
3 3 Jake Milner 24 . 62
4 4 Amy Cooze 73 . 70