Pandas Data Types

Accompanying the PB Python article here



In [1]:

    
import pandas as pd
import numpy as np



In [2]:

    
df = pd.read_csv("https://github.com/chris1610/pbpython/blob/master/data/sales_data_types.csv?raw=True")



In [3]:

    
df









    Out[3]:







  
    
      
      Customer Number
      Customer Name
      2016
      2017
      Percent Growth
      Jan Units
      Month
      Day
      Year
      Active
    
  
  
    
      0
      10002.0
      Quest Industries
      $125,000.00
      $162500.00
      30.00%
      500
      1
      10
      2015
      Y
    
    
      1
      552278.0
      Smith Plumbing
      $920,000.00
      $101,2000.00
      10.00%
      700
      6
      15
      2014
      Y
    
    
      2
      23477.0
      ACME Industrial
      $50,000.00
      $62500.00
      25.00%
      125
      3
      29
      2016
      Y
    
    
      3
      24900.0
      Brekke LTD
      $350,000.00
      $490000.00
      4.00%
      75
      10
      27
      2015
      Y
    
    
      4
      651029.0
      Harbor Co
      $15,000.00
      $12750.00
      -15.00%
      Closed
      2
      2
      2014
      N

Use df.info and df.dtypes to look at the types that pandas automatically infers based on the data



In [4]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number    5 non-null float64
Customer Name      5 non-null object
2016               5 non-null object
2017               5 non-null object
Percent Growth     5 non-null object
Jan Units          5 non-null object
Month              5 non-null int64
Day                5 non-null int64
Year               5 non-null int64
Active             5 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 480.0+ bytes

df.dtypes

Since the 2016 and 2017 columns were read in as objects, trying to add the values will result in string concatenation not numerical addition



In [5]:

    
df['2016'] + df['2017']









    Out[5]:





0      $125,000.00$162500.00
1    $920,000.00$101,2000.00
2        $50,000.00$62500.00
3      $350,000.00$490000.00
4        $15,000.00$12750.00
dtype: object

The simplest way to to convert to a type is using astype.

We can apply it to the customer number first.



In [6]:

    
df['Customer Number'].astype('int')









    Out[6]:





0     10002
1    552278
2     23477
3     24900
4    651029
Name: Customer Number, dtype: int64

The code above does not alter the original dataframe



In [7]:

    
df.dtypes









    Out[7]:





Customer Number    float64
Customer Name       object
2016                object
2017                object
Percent Growth      object
Jan Units           object
Month                int64
Day                  int64
Year                 int64
Active              object
dtype: object

Assign the new integer customer number back to the original frame and check the type



In [8]:

    
df["Customer Number"] = df['Customer Number'].astype('int')
df.dtypes









    Out[8]:





Customer Number     int64
Customer Name      object
2016               object
2017               object
Percent Growth     object
Jan Units          object
Month               int64
Day                 int64
Year                int64
Active             object
dtype: object



In [9]:

    
df









    Out[9]:







  
    
      
      Customer Number
      Customer Name
      2016
      2017
      Percent Growth
      Jan Units
      Month
      Day
      Year
      Active
    
  
  
    
      0
      10002
      Quest Industries
      $125,000.00
      $162500.00
      30.00%
      500
      1
      10
      2015
      Y
    
    
      1
      552278
      Smith Plumbing
      $920,000.00
      $101,2000.00
      10.00%
      700
      6
      15
      2014
      Y
    
    
      2
      23477
      ACME Industrial
      $50,000.00
      $62500.00
      25.00%
      125
      3
      29
      2016
      Y
    
    
      3
      24900
      Brekke LTD
      $350,000.00
      $490000.00
      4.00%
      75
      10
      27
      2015
      Y
    
    
      4
      651029
      Harbor Co
      $15,000.00
      $12750.00
      -15.00%
      Closed
      2
      2
      2014
      N

The data all looks good for the Customer Number.

If we try to convert the Jan Units column, we will get an error.



In [10]:

    
df['Jan Units'].astype('int')









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-31333711e4a4> in <module>()
----> 1 df['Jan Units'].astype('int')

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   4002         # else, only a single dtype is given
   4003         new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 4004                                      **kwargs)
   4005         return self._constructor(new_data).__finalize__(self)
   4006 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in astype(self, dtype, **kwargs)
   3460 
   3461     def astype(self, dtype, **kwargs):
-> 3462         return self.apply('astype', dtype=dtype, **kwargs)
   3463 
   3464     def convert(self, **kwargs):

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3327 
   3328             kwargs['mgr'] = self
-> 3329             applied = getattr(b, f)(**kwargs)
   3330             result_blocks = _extend_blocks(applied, result_blocks)
   3331 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in astype(self, dtype, copy, errors, values, **kwargs)
    542     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    543         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 544                             **kwargs)
    545 
    546     def _astype(self, dtype, copy=False, errors='raise', values=None,

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs)
    623 
    624                 # _astype_nansafe works fine with 1-d only
--> 625                 values = astype_nansafe(values.ravel(), dtype, copy=True)
    626                 values = values.reshape(self.shape)
    627 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy)
    690     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
    691         # work around NumPy brokenness, #1987
--> 692         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    693 
    694     if dtype.name in ("datetime64", "timedelta64"):

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

pandas/_libs/src/util.pxd in util.set_value_at_unsafe()

ValueError: invalid literal for int() with base 10: 'Closed'

In a similar manner we get an error if we try to convert the sales column



In [11]:

    
df['2016'].astype('float')









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-999869d577b0> in <module>()
----> 1 df['2016'].astype('float')

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   4002         # else, only a single dtype is given
   4003         new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 4004                                      **kwargs)
   4005         return self._constructor(new_data).__finalize__(self)
   4006 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in astype(self, dtype, **kwargs)
   3460 
   3461     def astype(self, dtype, **kwargs):
-> 3462         return self.apply('astype', dtype=dtype, **kwargs)
   3463 
   3464     def convert(self, **kwargs):

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3327 
   3328             kwargs['mgr'] = self
-> 3329             applied = getattr(b, f)(**kwargs)
   3330             result_blocks = _extend_blocks(applied, result_blocks)
   3331 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in astype(self, dtype, copy, errors, values, **kwargs)
    542     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    543         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 544                             **kwargs)
    545 
    546     def _astype(self, dtype, copy=False, errors='raise', values=None,

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs)
    623 
    624                 # _astype_nansafe works fine with 1-d only
--> 625                 values = astype_nansafe(values.ravel(), dtype, copy=True)
    626                 values = values.reshape(self.shape)
    627 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy)
    701 
    702     if copy:
--> 703         return arr.astype(dtype)
    704     return arr.view(dtype)
    705 

ValueError: could not convert string to float: '$15,000.00'

We can try to use astype with a bool type but that does not give expected results



In [12]:

    
df['Active'].astype('bool')









    Out[12]:





0    True
1    True
2    True
3    True
4    True
Name: Active, dtype: bool



In [13]:

    
# astype can take a dictionary of column names and data types
df.astype({'Customer Number': 'int', 'Customer Name': 'str'}).dtypes









    Out[13]:





Customer Number     int64
Customer Name      object
2016               object
2017               object
Percent Growth     object
Jan Units          object
Month               int64
Day                 int64
Year                int64
Active             object
dtype: object

In order to convert the currency and percentages, we need to use custom functions



In [14]:

    
def convert_currency(val):
    """
    Convert the string number value to a float
     - Remove $
     - Remove commas
     - Convert to float type
    """
    new_val = val.replace(',','').replace('$', '')
    return float(new_val)



In [15]:

    
def convert_percent(val):
    """
    Convert the percentage string to an actual floating point percent
    """
    new_val = val.replace('%', '')
    return float(new_val) / 100

Use apply to convert the 2016 and 2017 columns to floating point numbers



In [16]:

    
df['2016'].apply(convert_currency)









    Out[16]:





0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64



In [17]:

    
df['2017'].apply(convert_currency)









    Out[17]:





0     162500.0
1    1012000.0
2      62500.0
3     490000.0
4      12750.0
Name: 2017, dtype: float64

We could use a lambda function as well but it may be more difficult for new users to understand



In [18]:

    
df['2016'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')









    Out[18]:





0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64



In [19]:

    
# Assign the converted values back to the columns
df['2016'] = df['2016'].apply(convert_currency)
df['2017'] = df['2017'].apply(convert_currency)

Use a lambda function to convert the percentage strings to numbers



In [20]:

    
df['Percent Growth'].apply(lambda x: x.replace('%', '')).astype('float') / 100









    Out[20]:





0    0.30
1    0.10
2    0.25
3    0.04
4   -0.15
Name: Percent Growth, dtype: float64



In [21]:

    
df['Percent Growth'] = df['Percent Growth'].apply(convert_percent)



In [22]:

    
df.dtypes









    Out[22]:





Customer Number      int64
Customer Name       object
2016               float64
2017               float64
Percent Growth     float64
Jan Units           object
Month                int64
Day                  int64
Year                 int64
Active              object
dtype: object



In [23]:

    
# Let's look at the data so far
df









    Out[23]:







  
    
      
      Customer Number
      Customer Name
      2016
      2017
      Percent Growth
      Jan Units
      Month
      Day
      Year
      Active
    
  
  
    
      0
      10002
      Quest Industries
      125000.0
      162500.0
      0.30
      500
      1
      10
      2015
      Y
    
    
      1
      552278
      Smith Plumbing
      920000.0
      1012000.0
      0.10
      700
      6
      15
      2014
      Y
    
    
      2
      23477
      ACME Industrial
      50000.0
      62500.0
      0.25
      125
      3
      29
      2016
      Y
    
    
      3
      24900
      Brekke LTD
      350000.0
      490000.0
      0.04
      75
      10
      27
      2015
      Y
    
    
      4
      651029
      Harbor Co
      15000.0
      12750.0
      -0.15
      Closed
      2
      2
      2014
      N

pd.to_numeric is another option for handling column conversions when invalid values are included



In [24]:

    
pd.to_numeric(df['Jan Units'], errors='coerce')









    Out[24]:





0    500.0
1    700.0
2    125.0
3     75.0
4      NaN
Name: Jan Units, dtype: float64



In [25]:

    
# Fill in the NaN with 0
pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)









    Out[25]:





0    500.0
1    700.0
2    125.0
3     75.0
4      0.0
Name: Jan Units, dtype: float64

Make sure to populate the original column of data



In [26]:

    
df["Jan Units"] = pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)

pd.to_datetime is very useful for working with date conversions



In [27]:

    
pd.to_datetime(df[['Month', 'Day', 'Year']])









    Out[27]:





0   2015-01-10
1   2014-06-15
2   2016-03-29
3   2015-10-27
4   2014-02-02
dtype: datetime64[ns]



In [28]:

    
df["Start_Date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])



In [29]:

    
# Check out the dataframe
df









    Out[29]:







  
    
      
      Customer Number
      Customer Name
      2016
      2017
      Percent Growth
      Jan Units
      Month
      Day
      Year
      Active
      Start_Date
    
  
  
    
      0
      10002
      Quest Industries
      125000.0
      162500.0
      0.30
      500.0
      1
      10
      2015
      Y
      2015-01-10
    
    
      1
      552278
      Smith Plumbing
      920000.0
      1012000.0
      0.10
      700.0
      6
      15
      2014
      Y
      2014-06-15
    
    
      2
      23477
      ACME Industrial
      50000.0
      62500.0
      0.25
      125.0
      3
      29
      2016
      Y
      2016-03-29
    
    
      3
      24900
      Brekke LTD
      350000.0
      490000.0
      0.04
      75.0
      10
      27
      2015
      Y
      2015-10-27
    
    
      4
      651029
      Harbor Co
      15000.0
      12750.0
      -0.15
      0.0
      2
      2
      2014
      N
      2014-02-02

Use np.where to convert the active column to a boolean



In [30]:

    
df["Active"] = np.where(df["Active"] == "Y", True, False)



In [31]:

    
df









    Out[31]:







  
    
      
      Customer Number
      Customer Name
      2016
      2017
      Percent Growth
      Jan Units
      Month
      Day
      Year
      Active
      Start_Date
    
  
  
    
      0
      10002
      Quest Industries
      125000.0
      162500.0
      0.30
      500.0
      1
      10
      2015
      True
      2015-01-10
    
    
      1
      552278
      Smith Plumbing
      920000.0
      1012000.0
      0.10
      700.0
      6
      15
      2014
      True
      2014-06-15
    
    
      2
      23477
      ACME Industrial
      50000.0
      62500.0
      0.25
      125.0
      3
      29
      2016
      True
      2016-03-29
    
    
      3
      24900
      Brekke LTD
      350000.0
      490000.0
      0.04
      75.0
      10
      27
      2015
      True
      2015-10-27
    
    
      4
      651029
      Harbor Co
      15000.0
      12750.0
      -0.15
      0.0
      2
      2
      2014
      False
      2014-02-02



In [32]:

    
df.dtypes









    Out[32]:





Customer Number             int64
Customer Name              object
2016                      float64
2017                      float64
Percent Growth            float64
Jan Units                 float64
Month                       int64
Day                         int64
Year                        int64
Active                       bool
Start_Date         datetime64[ns]
dtype: object

Many of the examples shown above can be used when reading in data using dtypes or converters arguments



In [33]:

    
df_2 = pd.read_csv("https://github.com/chris1610/pbpython/blob/master/data/sales_data_types.csv?raw=True", 
                   dtype={'Customer Number':'int'},
                   converters={'2016':convert_currency,
                               '2017': convert_currency,
                               'Percent Growth': convert_percent,
                               'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'),
                               'Active': lambda x: np.where(x == "Y", True, False)
                              })



In [34]:

    
df_2.dtypes









    Out[34]:





Customer Number      int64
Customer Name       object
2016               float64
2017               float64
Percent Growth     float64
Jan Units          float64
Month                int64
Day                  int64
Year                 int64
Active              object
dtype: object



In [35]:

    
df_2









    Out[35]:







  
    
      
      Customer Number
      Customer Name
      2016
      2017
      Percent Growth
      Jan Units
      Month
      Day
      Year
      Active
    
  
  
    
      0
      10002
      Quest Industries
      125000.0
      162500.0
      0.30
      500.0
      1
      10
      2015
      True
    
    
      1
      552278
      Smith Plumbing
      920000.0
      1012000.0
      0.10
      700.0
      6
      15
      2014
      True
    
    
      2
      23477
      ACME Industrial
      50000.0
      62500.0
      0.25
      125.0
      3
      29
      2016
      True
    
    
      3
      24900
      Brekke LTD
      350000.0
      490000.0
      0.04
      75.0
      10
      27
      2015
      True
    
    
      4
      651029
      Harbor Co
      15000.0
      12750.0
      -0.15
      NaN
      2
      2
      2014
      False



In [36]:

    
# This can not be applied at the time the data is read in
df_2["Start_Date"] = pd.to_datetime(df_2[['Month', 'Day', 'Year']])



In [37]:

    
df_2









    Out[37]:







  
    
      
      Customer Number
      Customer Name
      2016
      2017
      Percent Growth
      Jan Units
      Month
      Day
      Year
      Active
      Start_Date
    
  
  
    
      0
      10002
      Quest Industries
      125000.0
      162500.0
      0.30
      500.0
      1
      10
      2015
      True
      2015-01-10
    
    
      1
      552278
      Smith Plumbing
      920000.0
      1012000.0
      0.10
      700.0
      6
      15
      2014
      True
      2014-06-15
    
    
      2
      23477
      ACME Industrial
      50000.0
      62500.0
      0.25
      125.0
      3
      29
      2016
      True
      2016-03-29
    
    
      3
      24900
      Brekke LTD
      350000.0
      490000.0
      0.04
      75.0
      10
      27
      2015
      True
      2015-10-27
    
    
      4
      651029
      Harbor Co
      15000.0
      12750.0
      -0.15
      NaN
      2
      2
      2014
      False
      2014-02-02

	Customer Number	Customer Name	2016	2017	Percent Growth	Jan Units	Month	Day	Year	Active
0	10002.0	Quest Industries	$125,000.00	$162500.00	30.00%	500	1	10	2015	Y
1	552278.0	Smith Plumbing	$920,000.00	$101,2000.00	10.00%	700	6	15	2014	Y
2	23477.0	ACME Industrial	$50,000.00	$62500.00	25.00%	125	3	29	2016	Y
3	24900.0	Brekke LTD	$350,000.00	$490000.00	4.00%	75	10	27	2015	Y
4	651029.0	Harbor Co	$15,000.00	$12750.00	-15.00%	Closed	2	2	2014	N

	Customer Number	Customer Name	2016	2017	Percent Growth	Jan Units	Month	Day	Year	Active
0	10002	Quest Industries	$125,000.00	$162500.00	30.00%	500	1	10	2015	Y
1	552278	Smith Plumbing	$920,000.00	$101,2000.00	10.00%	700	6	15	2014	Y
2	23477	ACME Industrial	$50,000.00	$62500.00	25.00%	125	3	29	2016	Y
3	24900	Brekke LTD	$350,000.00	$490000.00	4.00%	75	10	27	2015	Y
4	651029	Harbor Co	$15,000.00	$12750.00	-15.00%	Closed	2	2	2014	N

	Customer Number	Customer Name	2016	2017	Percent Growth	Jan Units	Month	Day	Year	Active
0	10002	Quest Industries	125000.0	162500.0	0.30	500	1	10	2015	Y
1	552278	Smith Plumbing	920000.0	1012000.0	0.10	700	6	15	2014	Y
2	23477	ACME Industrial	50000.0	62500.0	0.25	125	3	29	2016	Y
3	24900	Brekke LTD	350000.0	490000.0	0.04	75	10	27	2015	Y
4	651029	Harbor Co	15000.0	12750.0	-0.15	Closed	2	2	2014	N