In [1]:

    
%load_ext cython

Case Study: Slow Pandas dates

Batches of data are collected from field instruments. These instruments capture the date in three separate columns: day, month and year.

Data is processed in Pandas, but currently it is slow to convert the three columns into datetimes.

Example (randomised) data



In [2]:

    
import numpy as np
import pandas as pd

def make_sample_data(size):
    d = dict(
        # Years: 1980 - 2015
        year=np.random.randint(1980, 2016, int(size)),
        # Months 1 - 12
        month=np.random.randint(1, 13, int(size)),
        # Day number: 1 - 28
        day=np.random.randint(1, 28, int(size)),
        )
    return pd.DataFrame(d)

Start with few data



In [3]:

    
df = make_sample_data(5)
df

Goal: make single `datetime` column

Let's see the Python code first:



In [4]:

    
import datetime

def create_datetime_py(year, month, day):
    """ Take year, month, day and return a datetime """
    return datetime.datetime(year, month, day, 0, 0, 0, 0, None)

Use the Python conversion function

Pandas has an apply() method that runs your function on a bunch of columns.

You must provide a function that receives a row, and your function must return a value. All the output values get put into a new Pandas series.



In [5]:

    
# Refer to fields by name! Very cool 👍
df.apply(lambda x : create_datetime_py(
        x['year'], x['month'], x['day']), axis=1)









    Out[5]:





0   1996-09-08
1   2012-03-02
2   2003-04-25
3   2013-01-27
4   2007-09-18
dtype: datetime64[ns]

Note: the type is "datetime64[ns]".

Awkward to type that all out each time. Let's make a convenient function.



In [6]:

    
def make_datetime_py(df):
    return df.apply(lambda x : create_datetime_py(
        x['year'], x['month'], x['day']), axis=1)

Then we can just call it like so:



In [7]:

    
make_datetime_py(df)









    Out[7]:





0   1996-09-08
1   2012-03-02
2   2003-04-25
3   2013-01-27
4   2007-09-18
dtype: datetime64[ns]

Problem: this is slow

With lots of data, the conversion to a datetime column takes a very long time! Let's try a bunch of data:



In [8]:

    
df_big = make_sample_data(100000)

%timeit make_datetime_py(df_big)









    



1 loop, best of 3: 2.6 s per loop

What to do?

The first thing is to check whether there is a low-level PXD interface file for the Python datetime object.

Let's use Cython!



In [9]:

    
%%cython
# cython: boundscheck = False
# cython: wraparound = False
from cpython.datetime cimport (
    import_datetime, datetime_new, datetime, timedelta)
from pandas import Timestamp

import_datetime()

cpdef convert_arrays_ts(
        long[:] year, long[:] month, long[:] day, 
        long long[:] out):
    """ Result goes into `out`  """
    cdef int i, n = year.shape[0]
    cdef datetime dt
    for i in range(n):
        dt = <datetime>datetime_new(
                year[i], month[i], day[i], 0, 0, 0, 0, None)
        out[i] = Timestamp(dt).value

Utility function for applying our conversion



In [10]:

    
def make_datetime_cy(df, method):
    s = pd.Series(np.zeros(len(df), dtype='datetime64[ns]'))
    method(df['year'].values, df['month'].values, df['day'].values,
               s.values.view('int64')) 
    return s



In [11]:

    
# Test it out
make_datetime_cy(df, convert_arrays_ts)









    Out[11]:





0   1996-09-08
1   2012-03-02
2   2003-04-25
3   2013-01-27
4   2007-09-18
dtype: datetime64[ns]

Speed Test



In [12]:

    
df_big = make_sample_data(100000)

%timeit make_datetime_py(df_big)
%timeit make_datetime_cy(df_big, convert_arrays_ts)









    



1 loop, best of 3: 2.96 s per loop
10 loops, best of 3: 87.7 ms per loop

XX / XX

Check annotation

Eliminate the Python overhead



In [13]:

    
%%cython -a
# cython: boundscheck = False
# cython: wraparound = False
from cpython.datetime cimport (
    import_datetime, datetime_new, datetime, timedelta,
    timedelta_seconds, timedelta_days)

import_datetime()  # <-- Pretty important

cpdef convert_arrays_dt(long[:] year, long[:] month, long[:] day, 
        long long[:] out):
    """ Result goes into `out`  """
    cdef int i, n = year.shape[0]
    cdef datetime dt, epoch = datetime_new(1970, 1, 1, 0, 0, 0, 0, None)
    cdef timedelta td
    cdef long seconds
    for i in range(n):
        dt = <datetime>datetime_new(
                year[i], month[i], day[i], 0, 0, 0, 0, None)
        td = <timedelta>(dt - epoch)
        seconds = timedelta_days(td) * 86400 
        out[i] = seconds * 1000000000  # Nanoseconds, remember?









    Out[13]:









    
    Cython: _cython_magic_13b6c719ea7226d3de9d43538eca339f.pyx
    
    


Generated by Cython 0.25.2

    Yellow lines hint at Python interaction.

    Click on a line that starts with a "+" to see the C code that Cython generated for it.

+01: # cython: boundscheck = False
  __pyx_t_1 = PyDict_New(); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_1) < 0) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 02: # cython: wraparound = False
 03: from cpython.datetime cimport (
 04:     import_datetime, datetime_new, datetime, timedelta,
 05:     timedelta_seconds, timedelta_days)
 06: 
+07: import_datetime()  # <-- Pretty important
  __pyx_f_7cpython_8datetime_import_datetime();
 08: 
+09: cpdef convert_arrays_dt(long[:] year, long[:] month, long[:] day,
static PyObject *__pyx_pw_46_cython_magic_13b6c719ea7226d3de9d43538eca339f_1convert_arrays_dt(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static PyObject *__pyx_f_46_cython_magic_13b6c719ea7226d3de9d43538eca339f_convert_arrays_dt(__Pyx_memviewslice __pyx_v_year, __Pyx_memviewslice __pyx_v_month, __Pyx_memviewslice __pyx_v_day, __Pyx_memviewslice __pyx_v_out, CYTHON_UNUSED int __pyx_skip_dispatch) {
  int __pyx_v_i;
  int __pyx_v_n;
  PyDateTime_DateTime *__pyx_v_dt = 0;
  PyDateTime_DateTime *__pyx_v_epoch = 0;
  PyDateTime_Delta *__pyx_v_td = 0;
  long __pyx_v_seconds;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("convert_arrays_dt", 0);
/* … */
  /* function exit code */
  __pyx_r = Py_None; __Pyx_INCREF(Py_None);
  goto __pyx_L0;
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_7);
  __Pyx_AddTraceback("_cython_magic_13b6c719ea7226d3de9d43538eca339f.convert_arrays_dt", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = 0;
  __pyx_L0:;
  __Pyx_XDECREF((PyObject *)__pyx_v_dt);
  __Pyx_XDECREF((PyObject *)__pyx_v_epoch);
  __Pyx_XDECREF((PyObject *)__pyx_v_td);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

/* Python wrapper */
static PyObject *__pyx_pw_46_cython_magic_13b6c719ea7226d3de9d43538eca339f_1convert_arrays_dt(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static char __pyx_doc_46_cython_magic_13b6c719ea7226d3de9d43538eca339f_convert_arrays_dt[] = " Result goes into `out`  ";
static PyObject *__pyx_pw_46_cython_magic_13b6c719ea7226d3de9d43538eca339f_1convert_arrays_dt(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  __Pyx_memviewslice __pyx_v_year = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_month = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_day = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_out = { 0, 0, { 0 }, { 0 }, { 0 } };
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("convert_arrays_dt (wrapper)", 0);
  {
    static PyObject **__pyx_pyargnames[] = {&__pyx_n_s_year,&__pyx_n_s_month,&__pyx_n_s_day,&__pyx_n_s_out,0};
    PyObject* values[4] = {0,0,0,0};
    if (unlikely(__pyx_kwds)) {
      Py_ssize_t kw_args;
      const Py_ssize_t pos_args = PyTuple_GET_SIZE(__pyx_args);
      switch (pos_args) {
        case  4: values[3] = PyTuple_GET_ITEM(__pyx_args, 3);
        case  3: values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
        case  2: values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
        case  1: values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = PyDict_Size(__pyx_kwds);
      switch (pos_args) {
        case  0:
        if (likely((values[0] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_year)) != 0)) kw_args--;
        else goto __pyx_L5_argtuple_error;
        case  1:
        if (likely((values[1] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_month)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("convert_arrays_dt", 1, 4, 4, 1); __PYX_ERR(0, 9, __pyx_L3_error)
        }
        case  2:
        if (likely((values[2] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_day)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("convert_arrays_dt", 1, 4, 4, 2); __PYX_ERR(0, 9, __pyx_L3_error)
        }
        case  3:
        if (likely((values[3] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_out)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("convert_arrays_dt", 1, 4, 4, 3); __PYX_ERR(0, 9, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_pyargnames, 0, values, pos_args, "convert_arrays_dt") < 0)) __PYX_ERR(0, 9, __pyx_L3_error)
      }
    } else if (PyTuple_GET_SIZE(__pyx_args) != 4) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
      values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
      values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
      values[3] = PyTuple_GET_ITEM(__pyx_args, 3);
    }
    __pyx_v_year = __Pyx_PyObject_to_MemoryviewSlice_ds_long(values[0]); if (unlikely(!__pyx_v_year.memview)) __PYX_ERR(0, 9, __pyx_L3_error)
    __pyx_v_month = __Pyx_PyObject_to_MemoryviewSlice_ds_long(values[1]); if (unlikely(!__pyx_v_month.memview)) __PYX_ERR(0, 9, __pyx_L3_error)
    __pyx_v_day = __Pyx_PyObject_to_MemoryviewSlice_ds_long(values[2]); if (unlikely(!__pyx_v_day.memview)) __PYX_ERR(0, 9, __pyx_L3_error)
    __pyx_v_out = __Pyx_PyObject_to_MemoryviewSlice_ds_PY_LONG_LONG(values[3]); if (unlikely(!__pyx_v_out.memview)) __PYX_ERR(0, 10, __pyx_L3_error)
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("convert_arrays_dt", 1, 4, 4, PyTuple_GET_SIZE(__pyx_args)); __PYX_ERR(0, 9, __pyx_L3_error)
  __pyx_L3_error:;
  __Pyx_AddTraceback("_cython_magic_13b6c719ea7226d3de9d43538eca339f.convert_arrays_dt", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_46_cython_magic_13b6c719ea7226d3de9d43538eca339f_convert_arrays_dt(__pyx_self, __pyx_v_year, __pyx_v_month, __pyx_v_day, __pyx_v_out);

  /* function exit code */
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_46_cython_magic_13b6c719ea7226d3de9d43538eca339f_convert_arrays_dt(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_year, __Pyx_memviewslice __pyx_v_month, __Pyx_memviewslice __pyx_v_day, __Pyx_memviewslice __pyx_v_out) {
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("convert_arrays_dt", 0);
  __Pyx_XDECREF(__pyx_r);
  if (unlikely(!__pyx_v_year.memview)) { __Pyx_RaiseUnboundLocalError("year"); __PYX_ERR(0, 9, __pyx_L1_error) }
  if (unlikely(!__pyx_v_month.memview)) { __Pyx_RaiseUnboundLocalError("month"); __PYX_ERR(0, 9, __pyx_L1_error) }
  if (unlikely(!__pyx_v_day.memview)) { __Pyx_RaiseUnboundLocalError("day"); __PYX_ERR(0, 9, __pyx_L1_error) }
  if (unlikely(!__pyx_v_out.memview)) { __Pyx_RaiseUnboundLocalError("out"); __PYX_ERR(0, 9, __pyx_L1_error) }
  __pyx_t_1 = __pyx_f_46_cython_magic_13b6c719ea7226d3de9d43538eca339f_convert_arrays_dt(__pyx_v_year, __pyx_v_month, __pyx_v_day, __pyx_v_out, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 9, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_r = __pyx_t_1;
  __pyx_t_1 = 0;
  goto __pyx_L0;

  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_AddTraceback("_cython_magic_13b6c719ea7226d3de9d43538eca339f.convert_arrays_dt", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XDEC_MEMVIEW(&__pyx_v_year, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_month, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_day, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_out, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
 10:         long long[:] out):
 11:     """ Result goes into `out`  """
+12:     cdef int i, n = year.shape[0]
  __pyx_v_n = (__pyx_v_year.shape[0]);
+13:     cdef datetime dt, epoch = datetime_new(1970, 1, 1, 0, 0, 0, 0, None)
  __pyx_t_1 = __pyx_f_7cpython_8datetime_datetime_new(0x7B2, 1, 1, 0, 0, 0, 0, Py_None); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 13, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (!(likely(((__pyx_t_1) == Py_None) || likely(__Pyx_TypeTest(__pyx_t_1, __pyx_ptype_7cpython_8datetime_datetime))))) __PYX_ERR(0, 13, __pyx_L1_error)
  __pyx_v_epoch = ((PyDateTime_DateTime *)__pyx_t_1);
  __pyx_t_1 = 0;
 14:     cdef timedelta td
 15:     cdef long seconds
+16:     for i in range(n):
  __pyx_t_2 = __pyx_v_n;
  for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
    __pyx_v_i = __pyx_t_3;
+17:         dt = <datetime>datetime_new(
    __pyx_t_1 = __pyx_f_7cpython_8datetime_datetime_new((*((long *) ( /* dim=0 */ (__pyx_v_year.data + __pyx_t_4 * __pyx_v_year.strides[0]) ))), (*((long *) ( /* dim=0 */ (__pyx_v_month.data + __pyx_t_5 * __pyx_v_month.strides[0]) ))), (*((long *) ( /* dim=0 */ (__pyx_v_day.data + __pyx_t_6 * __pyx_v_day.strides[0]) ))), 0, 0, 0, 0, Py_None); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 17, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __pyx_t_7 = __pyx_t_1;
    __Pyx_INCREF(__pyx_t_7);
    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
    __Pyx_XDECREF_SET(__pyx_v_dt, ((PyDateTime_DateTime *)__pyx_t_7));
    __pyx_t_7 = 0;
+18:                 year[i], month[i], day[i], 0, 0, 0, 0, None)
    __pyx_t_4 = __pyx_v_i;
    __pyx_t_5 = __pyx_v_i;
    __pyx_t_6 = __pyx_v_i;
+19:         td = <timedelta>(dt - epoch)
    __pyx_t_7 = PyNumber_Subtract(((PyObject *)__pyx_v_dt), ((PyObject *)__pyx_v_epoch)); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 19, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_7);
    __pyx_t_1 = __pyx_t_7;
    __Pyx_INCREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
    __Pyx_XDECREF_SET(__pyx_v_td, ((PyDateTime_Delta *)__pyx_t_1));
    __pyx_t_1 = 0;
+20:         seconds = timedelta_days(td) * 86400
    __pyx_v_seconds = (__pyx_f_7cpython_8datetime_timedelta_days(((PyObject *)__pyx_v_td)) * 0x15180);
+21:         out[i] = seconds * 1000000000  # Nanoseconds, remember?
    __pyx_t_8 = __pyx_v_i;
    *((PY_LONG_LONG *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_8 * __pyx_v_out.strides[0]) )) = (__pyx_v_seconds * 0x3B9ACA00);
  }

Test it out



In [14]:

    
make_datetime_cy(df, convert_arrays_dt)









    Out[14]:





0   1996-09-08
1   2012-03-02
2   2003-04-25
3   2013-01-27
4   2007-09-18
dtype: datetime64[ns]

Speed Test



In [15]:

    
df_big = make_sample_data(100000)

%timeit make_datetime_py(df_big)
%timeit make_datetime_cy(df_big, convert_arrays_ts)
%timeit make_datetime_cy(df_big, convert_arrays_dt)









    



1 loop, best of 3: 3 s per loop
10 loops, best of 3: 88.9 ms per loop
100 loops, best of 3: 6.79 ms per loop

XX / XX

Using C standard library



In [16]:

    
%%cython -a
# cython: boundscheck = False
# cython: wraparound = False
from libc.time cimport mktime, tm, timezone

cdef inline long to_unix(long year, long month, long day):
    """ month: 1 - 12, day: 1 - 31    
        Result is in UTC. """
    cdef tm tms
    tms.tm_year = year - 1900  # years since 1900 !!
    tms.tm_mon = month - 1     # 0 to 11 !!  
    tms.tm_mday = day          # 1 - 31  
    tms.tm_hour, tms.tm_min, tms.tm_sec  = 0, 0, 0
    return mktime(&tms) - timezone

cpdef convert_arrays_libc(
        long[:] year, long[:] month, long[:] day, 
        long long[:] out):
    """ Result goes into `out`  """
    cdef int i, n = year.shape[0]
    cdef long unix
    for i in range(n):
        unix = to_unix(year[i], month[i], day[i])
        #print(unix)
        #out[i] = to_unix(year[i], month[i], day[i]) * 1000000000  
        out[i] = unix * 1000000000









    Out[16]:









    
    Cython: _cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0.pyx
    
    


Generated by Cython 0.25.2

    Yellow lines hint at Python interaction.

    Click on a line that starts with a "+" to see the C code that Cython generated for it.

+01: # cython: boundscheck = False
  __pyx_t_1 = PyDict_New(); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_1) < 0) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 02: # cython: wraparound = False
 03: from libc.time cimport mktime, tm, timezone
 04: 
+05: cdef inline long to_unix(long year, long month, long day):
static CYTHON_INLINE long __pyx_f_46_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0_to_unix(long __pyx_v_year, long __pyx_v_month, long __pyx_v_day) {
  struct tm __pyx_v_tms;
  long __pyx_r;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("to_unix", 0);
/* … */
  /* function exit code */
  __pyx_L0:;
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
 06:     """ month: 1 - 12, day: 1 - 31    
 07:         Result is in UTC. """
 08:     cdef tm tms
+09:     tms.tm_year = year - 1900  # years since 1900 !!
  __pyx_v_tms.tm_year = (__pyx_v_year - 0x76C);
+10:     tms.tm_mon = month - 1     # 0 to 11 !!  
  __pyx_v_tms.tm_mon = (__pyx_v_month - 1);
+11:     tms.tm_mday = day          # 1 - 31  
  __pyx_v_tms.tm_mday = __pyx_v_day;
+12:     tms.tm_hour, tms.tm_min, tms.tm_sec  = 0, 0, 0
  __pyx_t_1 = 0;
  __pyx_t_2 = 0;
  __pyx_t_3 = 0;
  __pyx_v_tms.tm_hour = __pyx_t_1;
  __pyx_v_tms.tm_min = __pyx_t_2;
  __pyx_v_tms.tm_sec = __pyx_t_3;
+13:     return mktime(&tms) - timezone
  __pyx_r = (mktime((&__pyx_v_tms)) - timezone);
  goto __pyx_L0;
 14: 
+15: cpdef convert_arrays_libc(
static PyObject *__pyx_pw_46_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0_1convert_arrays_libc(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static PyObject *__pyx_f_46_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0_convert_arrays_libc(__Pyx_memviewslice __pyx_v_year, __Pyx_memviewslice __pyx_v_month, __Pyx_memviewslice __pyx_v_day, __Pyx_memviewslice __pyx_v_out, CYTHON_UNUSED int __pyx_skip_dispatch) {
  int __pyx_v_i;
  int __pyx_v_n;
  long __pyx_v_unix;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("convert_arrays_libc", 0);
/* … */
  /* function exit code */
  __pyx_r = Py_None; __Pyx_INCREF(Py_None);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

/* Python wrapper */
static PyObject *__pyx_pw_46_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0_1convert_arrays_libc(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static char __pyx_doc_46_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0_convert_arrays_libc[] = " Result goes into `out`  ";
static PyObject *__pyx_pw_46_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0_1convert_arrays_libc(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  __Pyx_memviewslice __pyx_v_year = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_month = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_day = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_out = { 0, 0, { 0 }, { 0 }, { 0 } };
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("convert_arrays_libc (wrapper)", 0);
  {
    static PyObject **__pyx_pyargnames[] = {&__pyx_n_s_year,&__pyx_n_s_month,&__pyx_n_s_day,&__pyx_n_s_out,0};
    PyObject* values[4] = {0,0,0,0};
    if (unlikely(__pyx_kwds)) {
      Py_ssize_t kw_args;
      const Py_ssize_t pos_args = PyTuple_GET_SIZE(__pyx_args);
      switch (pos_args) {
        case  4: values[3] = PyTuple_GET_ITEM(__pyx_args, 3);
        case  3: values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
        case  2: values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
        case  1: values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = PyDict_Size(__pyx_kwds);
      switch (pos_args) {
        case  0:
        if (likely((values[0] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_year)) != 0)) kw_args--;
        else goto __pyx_L5_argtuple_error;
        case  1:
        if (likely((values[1] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_month)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("convert_arrays_libc", 1, 4, 4, 1); __PYX_ERR(0, 15, __pyx_L3_error)
        }
        case  2:
        if (likely((values[2] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_day)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("convert_arrays_libc", 1, 4, 4, 2); __PYX_ERR(0, 15, __pyx_L3_error)
        }
        case  3:
        if (likely((values[3] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_out)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("convert_arrays_libc", 1, 4, 4, 3); __PYX_ERR(0, 15, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_pyargnames, 0, values, pos_args, "convert_arrays_libc") < 0)) __PYX_ERR(0, 15, __pyx_L3_error)
      }
    } else if (PyTuple_GET_SIZE(__pyx_args) != 4) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
      values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
      values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
      values[3] = PyTuple_GET_ITEM(__pyx_args, 3);
    }
    __pyx_v_year = __Pyx_PyObject_to_MemoryviewSlice_ds_long(values[0]); if (unlikely(!__pyx_v_year.memview)) __PYX_ERR(0, 16, __pyx_L3_error)
    __pyx_v_month = __Pyx_PyObject_to_MemoryviewSlice_ds_long(values[1]); if (unlikely(!__pyx_v_month.memview)) __PYX_ERR(0, 16, __pyx_L3_error)
    __pyx_v_day = __Pyx_PyObject_to_MemoryviewSlice_ds_long(values[2]); if (unlikely(!__pyx_v_day.memview)) __PYX_ERR(0, 16, __pyx_L3_error)
    __pyx_v_out = __Pyx_PyObject_to_MemoryviewSlice_ds_PY_LONG_LONG(values[3]); if (unlikely(!__pyx_v_out.memview)) __PYX_ERR(0, 17, __pyx_L3_error)
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("convert_arrays_libc", 1, 4, 4, PyTuple_GET_SIZE(__pyx_args)); __PYX_ERR(0, 15, __pyx_L3_error)
  __pyx_L3_error:;
  __Pyx_AddTraceback("_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0.convert_arrays_libc", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_46_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0_convert_arrays_libc(__pyx_self, __pyx_v_year, __pyx_v_month, __pyx_v_day, __pyx_v_out);

  /* function exit code */
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_46_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0_convert_arrays_libc(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_year, __Pyx_memviewslice __pyx_v_month, __Pyx_memviewslice __pyx_v_day, __Pyx_memviewslice __pyx_v_out) {
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("convert_arrays_libc", 0);
  __Pyx_XDECREF(__pyx_r);
  if (unlikely(!__pyx_v_year.memview)) { __Pyx_RaiseUnboundLocalError("year"); __PYX_ERR(0, 15, __pyx_L1_error) }
  if (unlikely(!__pyx_v_month.memview)) { __Pyx_RaiseUnboundLocalError("month"); __PYX_ERR(0, 15, __pyx_L1_error) }
  if (unlikely(!__pyx_v_day.memview)) { __Pyx_RaiseUnboundLocalError("day"); __PYX_ERR(0, 15, __pyx_L1_error) }
  if (unlikely(!__pyx_v_out.memview)) { __Pyx_RaiseUnboundLocalError("out"); __PYX_ERR(0, 15, __pyx_L1_error) }
  __pyx_t_1 = __pyx_f_46_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0_convert_arrays_libc(__pyx_v_year, __pyx_v_month, __pyx_v_day, __pyx_v_out, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 15, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_r = __pyx_t_1;
  __pyx_t_1 = 0;
  goto __pyx_L0;

  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_AddTraceback("_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0.convert_arrays_libc", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XDEC_MEMVIEW(&__pyx_v_year, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_month, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_day, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_out, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
 16:         long[:] year, long[:] month, long[:] day,
 17:         long long[:] out):
 18:     """ Result goes into `out`  """
+19:     cdef int i, n = year.shape[0]
  __pyx_v_n = (__pyx_v_year.shape[0]);
 20:     cdef long unix
+21:     for i in range(n):
  __pyx_t_1 = __pyx_v_n;
  for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2+=1) {
    __pyx_v_i = __pyx_t_2;
+22:         unix = to_unix(year[i], month[i], day[i])
    __pyx_t_3 = __pyx_v_i;
    __pyx_t_4 = __pyx_v_i;
    __pyx_t_5 = __pyx_v_i;
    __pyx_v_unix = __pyx_f_46_cython_magic_44b8bbbb5d6bcf3b8ad948e5ec67afe0_to_unix((*((long *) ( /* dim=0 */ (__pyx_v_year.data + __pyx_t_3 * __pyx_v_year.strides[0]) ))), (*((long *) ( /* dim=0 */ (__pyx_v_month.data + __pyx_t_4 * __pyx_v_month.strides[0]) ))), (*((long *) ( /* dim=0 */ (__pyx_v_day.data + __pyx_t_5 * __pyx_v_day.strides[0]) ))));
 23:         #print(unix)
 24:         #out[i] = to_unix(year[i], month[i], day[i]) * 1000000000  
+25:         out[i] = unix * 1000000000
    __pyx_t_6 = __pyx_v_i;
    *((PY_LONG_LONG *) ( /* dim=0 */ (__pyx_v_out.data + __pyx_t_6 * __pyx_v_out.strides[0]) )) = (__pyx_v_unix * 0x3B9ACA00);
  }



In [17]:

    
make_datetime_cy(df, convert_arrays_libc)









    Out[17]:





0   1996-09-08 00:00:00
1   2012-03-01 23:00:00
2   2003-04-25 00:00:00
3   2013-01-26 23:00:00
4   2007-09-18 00:00:00
dtype: datetime64[ns]



In [18]:

    
df_big = make_sample_data(100000)

%timeit make_datetime_py(df_big)
%timeit make_datetime_cy(df_big, convert_arrays_dt)
%timeit make_datetime_cy(df_big, convert_arrays_ts)
%timeit make_datetime_cy(df_big, convert_arrays_libc)









    



1 loop, best of 3: 2.61 s per loop
100 loops, best of 3: 6.08 ms per loop
10 loops, best of 3: 79.9 ms per loop
1 loop, best of 3: 595 ms per loop



In [ ]:

Case Study: Slow Pandas dates

Example (randomised) data

Start with few data

Goal: make single datetime column

Use the Python conversion function

Problem: this is slow

What to do?

Let's use Cython!

Utility function for applying our conversion

Speed Test

Eliminate the Python overhead

Test it out

Speed Test

Using C standard library

Goal: make single `datetime` column