Accompanying the PB Python article here
In [1]:
import pandas as pd
import numpy as np
In [2]:
df = pd.read_csv("https://github.com/chris1610/pbpython/blob/master/data/sales_data_types.csv?raw=True")
In [3]:
df
Out[3]:
Use df.info and df.dtypes to look at the types that pandas automatically infers based on the data
In [4]:
df.info()
df.dtypes
Since the 2016 and 2017 columns were read in as objects, trying to add the values will result in string concatenation not numerical addition
In [5]:
df['2016'] + df['2017']
Out[5]:
The simplest way to to convert to a type is using astype.
We can apply it to the customer number first.
In [6]:
df['Customer Number'].astype('int')
Out[6]:
The code above does not alter the original dataframe
In [7]:
df.dtypes
Out[7]:
Assign the new integer customer number back to the original frame and check the type
In [8]:
df["Customer Number"] = df['Customer Number'].astype('int')
df.dtypes
Out[8]:
In [9]:
df
Out[9]:
The data all looks good for the Customer Number.
If we try to convert the Jan Units column, we will get an error.
In [10]:
df['Jan Units'].astype('int')
In a similar manner we get an error if we try to convert the sales column
In [11]:
df['2016'].astype('float')
We can try to use astype with a bool type but that does not give expected results
In [12]:
df['Active'].astype('bool')
Out[12]:
In [13]:
# astype can take a dictionary of column names and data types
df.astype({'Customer Number': 'int', 'Customer Name': 'str'}).dtypes
Out[13]:
In order to convert the currency and percentages, we need to use custom functions
In [14]:
def convert_currency(val):
"""
Convert the string number value to a float
- Remove $
- Remove commas
- Convert to float type
"""
new_val = val.replace(',','').replace('$', '')
return float(new_val)
In [15]:
def convert_percent(val):
"""
Convert the percentage string to an actual floating point percent
"""
new_val = val.replace('%', '')
return float(new_val) / 100
Use apply to convert the 2016 and 2017 columns to floating point numbers
In [16]:
df['2016'].apply(convert_currency)
Out[16]:
In [17]:
df['2017'].apply(convert_currency)
Out[17]:
We could use a lambda function as well but it may be more difficult for new users to understand
In [18]:
df['2016'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')
Out[18]:
In [19]:
# Assign the converted values back to the columns
df['2016'] = df['2016'].apply(convert_currency)
df['2017'] = df['2017'].apply(convert_currency)
Use a lambda function to convert the percentage strings to numbers
In [20]:
df['Percent Growth'].apply(lambda x: x.replace('%', '')).astype('float') / 100
Out[20]:
In [21]:
df['Percent Growth'] = df['Percent Growth'].apply(convert_percent)
In [22]:
df.dtypes
Out[22]:
In [23]:
# Let's look at the data so far
df
Out[23]:
pd.to_numeric is another option for handling column conversions when invalid values are included
In [24]:
pd.to_numeric(df['Jan Units'], errors='coerce')
Out[24]:
In [25]:
# Fill in the NaN with 0
pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)
Out[25]:
Make sure to populate the original column of data
In [26]:
df["Jan Units"] = pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)
pd.to_datetime is very useful for working with date conversions
In [27]:
pd.to_datetime(df[['Month', 'Day', 'Year']])
Out[27]:
In [28]:
df["Start_Date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
In [29]:
# Check out the dataframe
df
Out[29]:
Use np.where to convert the active column to a boolean
In [30]:
df["Active"] = np.where(df["Active"] == "Y", True, False)
In [31]:
df
Out[31]:
In [32]:
df.dtypes
Out[32]:
Many of the examples shown above can be used when reading in data using dtypes or converters arguments
In [33]:
df_2 = pd.read_csv("https://github.com/chris1610/pbpython/blob/master/data/sales_data_types.csv?raw=True",
dtype={'Customer Number':'int'},
converters={'2016':convert_currency,
'2017': convert_currency,
'Percent Growth': convert_percent,
'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'),
'Active': lambda x: np.where(x == "Y", True, False)
})
In [34]:
df_2.dtypes
Out[34]:
In [35]:
df_2
Out[35]:
In [36]:
# This can not be applied at the time the data is read in
df_2["Start_Date"] = pd.to_datetime(df_2[['Month', 'Day', 'Year']])
In [37]:
df_2
Out[37]: