What is correct MONTH/YEAR_OF_BIRTH variable

There are two pairs of month, year variables (Example with reference numbers):

  1. R00003.00 and R00005.00 (from 1979)
  2. R04101.00 and R04103.00 (from 1981)

Under Important information about using age data in the user guide for age:

The 1981 birth dates should be used to determine age with the 1979 dates used only as a backup. Differences between 1979 and 1981 birth dates remained for approximately 200-250 respondents after the 1981 fielding; editing on a case-by-case basis was performed by CHRR staff on only the 1981 variable.


In [1]:
import os

import numpy as np
import pandas as pd

In [2]:
afqt = pd.read_csv(os.path.join('..', 'data', 'external',
                                'afqt', 'afqt.csv'),
                   index_col=False, header=0)

column_labels = dict()
column_labels['R0000100'] = 'IDENTIFIER'
column_labels['R0000500'] = 'YEAR_OF_BIRTH_1979'
column_labels['R0000300'] = 'MONTH_OF_BIRTH_1979'

column_labels['R0410100'] = 'MONTH_OF_BIRTH_1981'
column_labels['R0410300'] = 'YEAR_OF_BIRTH_1981'

afqt.rename(columns=column_labels, inplace=True)

In [3]:
### Construct MONTH_OF_BIRTH and YEAR_OF_BIRTH
# Replace -5 values with np.nan
afqt.loc[
    afqt.YEAR_OF_BIRTH_1981 == -5, 'YEAR_OF_BIRTH_1981'] = np.nan
afqt.loc[
    afqt.MONTH_OF_BIRTH_1981 == -5, 'MONTH_OF_BIRTH_1981'] = np.nan

# Replace missings in 1981 with the values from the 1979 survey
afqt.loc[afqt.YEAR_OF_BIRTH_1981.isnull(),
         'YEAR_OF_BIRTH_1981'] = afqt.YEAR_OF_BIRTH_1979
afqt.loc[afqt.MONTH_OF_BIRTH_1981.isnull(),
         'MONTH_OF_BIRTH_1981'] = afqt.MONTH_OF_BIRTH_1979

# Now cast to integers, also a checks for nans
afqt['MONTH_OF_BIRTH'] = afqt.MONTH_OF_BIRTH_1981.astype(int)
afqt['YEAR_OF_BIRTH'] = afqt.YEAR_OF_BIRTH_1981.astype(int)

# Drop old variables
afqt.drop([
    i for i in afqt.keys() if i.startswith('MONTH_OF_BIRTH_') |
    i.startswith('YEAR_OF_BIRTH_')], axis=1, inplace=True)