Find this notebook in https://tinyurl.com/epidemium-ramp

Introduction

Cancer is still a terrible disease. Surprisingly, the rate of cancer incidence and mortality varies substantially across regions worldwide. This raises several questions as of unknown preventive and risk factors.

Within the Epidemium initiative, the Cancer Baseline project aims at collecting open-data aggregate cancer mortality risks (y) worldwide and potential explanatory factors (X) to model $y=f(X)$, hence, trying to shed light on new cancer-related factors.

For this first RAMP, you will be the first ones to analyze the data collected by more than 30 volunteers over three months and you will compete on the best cancer mortality prediction model $y=f(X$). May you lead to a new generation of solutions against cancer!

If you want to join the project after the RAMP http://wiki.epidemium.cc/wiki/Baseline#How_to_start

Tools and setup

The simple way: Install the Anaconda python distribution https://www.continuum.io/downloads

The fine-grained way:. Install each of the following tools

  • Python
  • Jupyter
  • Scikit-learn
  • Pandas
  • seaborn

Data description

Global description

  • $y$: Cancer mortality risk by region, gender, cancer type, year, and origin (only for the USA).
  • $X$: Explanatory variables. They includes
    • the distribution of the population by region, age and gender
    • incidence of cancer by age and gender
    • and other variables, described below. Often, these variables are only present for some years (eg. 2000, 2007, and 2012) but they can extrapolated in-between.

Note regarding the names of variables

  • g_* : if the variable takes different values for different genders, the name of the variable starts with g_. Exemple: for Males in Corèze, “smoker_prevalence” is the number of smokers in Corèze whereas “g_smoker_prevalence” is the number of male smokers of Crèze
  • a_* : if the variable takes different values depends on age, the name of the variables starts with a_. Exemple: for Males in Corèze, “smoker_prevalence” is the number of smokers in Corèze “g_smoker_prevalence” is the number of male smokers of Crèze
  • ga_* : if both g and a are combined.

Column description

  • alcool_consumption: average alcohol consumption in liters per person and per year (at this stage of the project, slight adjustments are permitted depending on the country. Exemple: Germany: average alcohol consumption for people for sex X aged 16 and over. USA: alcohol consumption in 2009)
  • alcool_consumption_beer: average beer consumption per person in 2009 - litre/person/year
  • alcool_consumption_spirit: The spirit consumption for each person in 2009 - litre/person/year
  • alcool_consumption_wine: The wine consumption for each person in 2009 - litre/person/year
  • alcool_death: deaths per 100,000 population due to alcohol use in Germany 2011
  • arsenic_concentration: arsenic concentrations in soil.
  • arsenic_emission: Amount of arsenic from emissions
  • benzo(a)pyren_emission: Amount of benzo(a)pyren from emissions
  • benzo(b)fluoranthen_emission: Amount of benzo(b)fluoranthen from emissions
  • benzo(k)fluoranthen_emissio: Amount of benzo(k)fluoranthen from emissions
  • beryllium_emission: level of beryllium in the environment from facilities that manufacture or process Beryllium in 1990 - kg/capita/year
  • bmi_18.5-: underweight (Body Mass Index < 18.5 kg/m2). Type : rate
  • bmi_18.5_25: healthy weight BMI from 18.5 to 25 kg/m2. Type : rate
  • bmi_25_30: overweight from 25 to 30 kg/m2. Type : rate
  • bmi_30+: obese BMI >= 30 kg/m2. Type : rate
  • bmi_score: BMI (source for many countries: MRC-HPA Centre for Environment and Health)
  • cadmium_emission: Amount of cadmium from emissions
  • cholesterol_prevalence: The percentage of population with high cholesterin levels in Germany
  • chromium_emission: Amount of chromium from emissions
  • coal_to_electricity: Electricity production in 2013, Percentage of electricity produced from coal
  • copper_emission: Amount of copper from emissions
  • dietary_characteristics_alcohol: Use beer, Wine, Sparkling Wine, spirits, Others (eg. Alcopops, alcoholic cocktails) everyday
  • dietary_characteristics_calcis: Quantity of calcis in daily food of Germany
  • dietary_characteristics_cereals_bread: Use bread, Bakery products, Cereal / -products, Dishes based on cereal / products constitute everyday
  • dietary_characteristics_cheese_milk: Use Milk / cheese and products thereof, Meals based on milk / products constitute everyday
  • dietary_characteristics_cholesterol: Quantity of Cholesterin in daily food of Germany
  • dietary_characteristics_coffee_tea: Use water, Coffee and tea (green / black), Herbal, fruit tea, Fruit juices / nectars, sodas everyday
  • dietary_characteristics_confectionery: Use confectionery everyday
  • dietary_characteristics_eggs: Use Egg, dishes based on eggs everyday
  • dietary_characteristics_fat: Use Fats and oils everyday of men
  • dietary_characteristics_fish: Use Fish / crustaceans and products thereof, Dishes based on fish / shellfish everyday
  • dietary_characteristics_fruit: Use Fruit products (Juice) everyday
  • dietary_characteristics_iodine: Quantity of Iodine in daily food of Germany
  • dietary_characteristics_meat: Use Meat / -products and sausages, Dishes based on meat everyday
  • dietary_characteristics_potatoes: Use potato products, Dishes based on potatoes everyday
  • dietary_characteristics_proteins: Quantity of protein in daily food of Germany
  • dietary_characteristics_snacks: Use snack Mixes everyday
  • dietary_characteristics_soup: Use Soups and stews everyday
  • dietary_characteristics_vegetables: Use Vegetables (Juice), mushrooms, legumes, Dishes based on vegetables everyday
  • dietary_characteristics_vitamine_A: Quantity of vitamin A (retinol equivalents) in daily food of Germany
  • dietary_characteristics_vitamine_B1: Quantity of vitamin B1 in daily food of Germany
  • dietary_characteristics_vitamine_B12: Quantity of vitamin B12 in daily food of Germany
  • dietary_characteristics_vitamine_B2: Quantity of vitamin B2 in daily food of Germany
  • dietary_characteristics_vitamine_B6: Quantity of vitamin B6 in daily food of Germany
  • dietary_characteristics_vitamine_C: Quantity of vitamin C in daily food of Germany
  • dietary_characteristics_vitamine_D: Quantity of vitamin D in daily food of Germany
  • dietary_characteristics_vitamine_E: Quantity of vitamin E in daily food of Germany
  • dioxin_emission: Amount of dioxin from emissions
  • diphteria_vacc: The percentage of people vaccinated against diphteria
  • drugs_crack: Crack/cocaine users prevalence estimates pop. Aged 15 to 64. Type : rate
  • drugs_inject: Injecting drugs prevalence estimates pop. Aged 15 to 64. Type : rate
  • drugs_opiates: Opiates users prevalence estimates pop. Aged 15 to 64. Type : rate
  • fastfood_spending: The amount of money for fastfood in 2007 - USD/capita/year
  • first_birth_age: The average age at first birth in 2006
  • first_menstruation_age: the age at first menstruation in Germany
  • gas_to_electricity: Electricity production in 2013, Percentage of electricity produced from natural gas
  • gov_health_spending_perperson: Government health spending per person (international \$), WHO
  • HAV: The number of cases of acute hepatitis A in 2012 - cases/population/year
  • HBV: The number of cases of acute hepatitis A in 2012 - cases/population/year
  • hcb_emission: Amount of HCB from emissions
  • hepb_vacc: The percentage of people vaccinated against Hepatitis B
  • hib_vacc: The percentage of people vaccinated against hib
  • HIV: The number of AIDS cases in Germany from 1982 to 2009
  • HPV_vacc_1+, 2+, 3+: completion of HPV vaccination among adolescents ages 13-17 for sex in 2013, equal or greater 1,2,3 HPV vaccination respectively
  • Indeno(1|2|3-cd)pyren_emission: Amount of Indeno(1|2|3-cd)pyren from emissions
  • inequality_index: the Gini index by the World Bank
  • lead_emission: Amount of lead from emissions
  • liver_transplant_prevalence: Living donor liver transplants excluded/Million people/number of centers in 2009-2011
  • life_expectancy: this is used as a measure of aggregate mortality rates at a given year. Some factor may reduce cancer mortality because it increases the risk to die from other aspects before: as a simple benefit/risk check, a preventive factor should not have a negative impact on life expectancy.
  • g_long_term_unemployment: Male/Female long term unemployment rate (source: International Labour Organization)
  • measles_vacc_1: The percentage of people vaccinated against measles (the first time)
  • measles_vacc_2: The percentage of people vaccinated against measles (the second time)
  • menC_vacc: The percentage of people vaccinated against Meningokokken C
  • mercury_emission: Amount of mercury from emissions
  • metal_water_concentration: Concentrations of metals in rainwater
  • mumps_vacc_1: The percentage of people vaccinated against mumps (the first time)
  • mumps_vacc_2: The percentage of people vaccinated against mumps (the second time)
  • nickel_emission: Amount of nickel from emissions
  • nmvok_emission: Amount of nmvok from emissions
  • nox_emission: Amount of NOx from emissions
  • nuclear_to_electricity: Electricity production in 2013, Percentage of electricity produced from nuclear energy
  • other_to_electricity: Electricity production in 2013, Percentage of electricity produced from others
  • individual_health_spending: Out-of-pocket share of total health spending (%) (source: Global Health Expenditure Database)
  • pah_emission: Amount of PAH from emissions
  • pcb_emission: Amount of PCB from emissions
  • pertussis_vacc: The percentage of people vaccinated against pertussis
  • pm10_emission: Amount of pm10 from emissions
  • pm2.5_emission: Amount of pm2.5 from emissions
  • pneumo_vacc: The percentage of people vaccinated against pneumococcal
  • population: population (male and females combine)
  • g_population_60+: share of males resp. females aged 60+
  • urban_population: share of persons living in urban areas
  • polio_vacc: The percentage of people vaccinated against Poliomyelitis
  • radon_level: average radon level in 2015 - Bq/m3
  • renewable_to_electricity: Electricity production in 2013, Percentage of electricity produced from renewable energy
  • rubella_vacc_1: The percentage of people vaccinated against rubella (the first time)
  • rubella_vacc_2: The percentage of people vaccinated against rubella (the second time)
  • shale_oil: Shale Production in 2012 - kg/capita/year
  • smoker_prevalence: The percentage of people smoking. Type: Rate
  • smoking_10_19cigarets: Moderate smoking (10 - 19 cigarets per day). Type : rate
  • smoking_20+cigarets: Heavy smoking (20 cigarets or more per day). Type : rate
  • so2_emission: Amount of SO2 from emissions
  • tetanus_vacc: The percentage of people vaccinated against tetanus
  • transplants_cases: The number of transplants in 2012 - cases/population/year
  • tsp_emission: Amount of tsp from emissions
  • uninsured: The estimated number of uninsured individuals under age 65 in the county in 2006 - cases/population/year
  • varicella_vacc_1: The percentage of people vaccinated against varicella (the first time)
  • varicella_vacc_2: The percentage of people vaccinated against varicella (the second time)

Challenge

The vast majority of the variables are extremely sparse and part of the challange will be to:

  • select the right variables to include in the model,
  • potentialy impute missing values, or extrapolate between different dates.

Imports & config


In [79]:
%matplotlib inline

In [2]:
import numpy  as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

In [3]:
pd.set_option('display.max_columns', None)

Meta data analysis

Let's start with a meta-data analysis, ie. number of missing values, number of unique categories etc.


In [80]:
filename = 'data/public/train.csv'

In [81]:
df = pd.read_csv(filename)

In [82]:
df.shape


Out[82]:
(62091, 223)

In [83]:
df.head(3)


Out[83]:
RegionType Part of Region Year Gender Age MainOrigin asbestos_exposure asbestos_production arsenic_emission arsenic_concentration beryllium_emission cadmium_emission cadmium_intake cadmium_concentration_cortex cadmium_concentration_soil cadmium_export chromium_emission chromium_concentration crystalline_silica_concentration hexavalent_chromium_compounds nickel_emission nickel_concentration thorium_deposit thorium_concentration metal_water_concentration copper_emission mercury_emission lead_emission pm10_emission pm2.5_emission so2_emission nmvok_emission tsp_emission nox_emission benzo(a)pyren_emission benzo(b)fluoranthen_emission benzo(k)fluoranthen_emission Indeno(1|2|3-cd)pyren_emission pah_emission pcb_emission dioxin_emission hcb_emission aspartame_consumption no_sugar fluoride_consumption vitamine_D_summer vitamine_D_winter ga_dietary_characteristics_fat ga_dietary_characteristics_cholesterol ga_dietary_characteristics_proteins ga_dietary_characteristics_vitamine_A ga_dietary_characteristics_vitamine_D ga_dietary_characteristics_vitamine_E ga_dietary_characteristics_vitamine_B1 ga_dietary_characteristics_vitamine_B2 ga_dietary_characteristics_vitamine_B6 ga_dietary_characteristics_vitamine_B12 ga_dietary_characteristics_vitamine_C ga_dietary_characteristics_iodine ga_dietary_characteristics_calcis g_dietary_characteristics_fruit_0 g_dietary_characteristics_fruit_0_1 g_dietary_characteristics_fruit_1_2 g_dietary_characteristics_fruit_2_3.1 g_dietary_characteristics_fruit_3_4 g_dietary_characteristics_fruit_4_5 g_dietary_characteristics_fruit_5+ g_dietary_characteristics_fruit g_dietary_characteristics_cereals_bread g_dietary_characteristics_vegetables g_dietary_characteristics_potatoes g_dietary_characteristics_meat g_dietary_characteristics_fish g_dietary_characteristics_eggs g_dietary_characteristics_cheese_milk g_dietary_characteristics_alcohol g_dietary_characteristics_soup g_dietary_characteristics_confectionery g_dietary_characteristics_snacks g_dietary_characteristics_coffee_tea cholesterol_prevalence transplants_donation transplants_prevalence liver_transplant_prevalence ga_HIV HIV_15_49 g_HPV_vac3_15+ g_HPV_vacc_1+ g_HPV_vacc_2+ g_HPV_vacc_3+ ga_HPV HCV g_HBV HAV leukemia hhv8 diphteria_vacc tetanus_vacc pertussis_vacc hib_vacc polio_vacc hepb_vacc measles_vacc_1 measles_vacc_2 mumps_vacc_1 mumps_vacc_2 rubella_vacc_1 rubella_vacc_2 varicella_vacc_1 varicella_vacc_2 menC_vacc pneumo_vacc radon_level radon_high uv_radiation x_rays gamma_rays g_smoker_prevalence g_past_smoker_prevalence g_no_smoker_prevalence g_smoking_20+cigarets g_smoking_10_19cigarets g_smoking_10-cigarets g_past_smoker_prevalence.1 g_occasion_smoker_prevalence g_smoker_prevalence_0_18 g_smoker_prevalence_18+ alcool_prevalence_beer alcool_prevalence_wine alcool_consumption alcool_consumption_beer alcool_death drugs_opiates drugs_crack drugs_inject overweight bmi_score bmi_18.5- bmi_18.5_25 bmi_25_30 bmi_30+ aluminium_production aluminium_water_concentration benzene_emission benzene_air_concentration butadiene_emission butadiene_air_concentration ethylene_emission formaldehyde_emission formaldehyde_water_concentration sulfuric_acid_emission vinyl_chlorid_emission vinyl_chlorid_water_concentration iso_9001 iso_14001 iso/iec_27001 iso_50001 iso_13485 iso_22000 mobile_subscription pork_consumption chicken_consumption beef_consumption adolescent_birth_15_19 income race_white race_mixed race_asian race_black race_other height first_birth never_birth coke_ovens_co2_emission sintering_co2_emission blast_furnaces_co2_emission oxygen_furnaces_co2_emission flared_gases_co2_emission combustion_plant_co2_emission fuels_sold_co2_emission uninsured_0_65 first_birth_age shale_oil fastfood_spending g_first_menstruation_age coal_to_electricity gas_to_electricity nuclear_to_electricity renewable_to_electricity other_to_electricity women_age_1st_marriage family_size_2009 persons_per_home birth_rate unemployment_rate_15_64_2009 companies_agri companies_indus companies_construction companies_services companies_admin dietary_lead bread_eaters dietary_bread water_drinkers dietary_water milk_drinkers dietary_milk soft_drinkers dietary_soft diatery_aluminium fish_eaters dietary_fish dietary_sodium dietary_dioxin dietary_deoxynivalenol dietary_acrylamide fried_potato_eaters dietary_fried_potato coffee_drinkers dietary_coffee target cancer_type
0 Country World France 2000 Female 0_100 Any NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 97 97 NaN 86 97 51 84 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 26.7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 34774 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 279.5 Brain, central nervous system (C70-72)
1 State United States of America Missouri 2007 Female 0_100 White NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 740.0 Colon, rectum and anus (C18-21)
2 State United States of America Colorado 2007 Female 0_100 Black NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2210.0 Lung (C33-34)

In [55]:
df.describe()


Out[55]:
Year asbestos_exposure asbestos_production arsenic_emission arsenic_concentration beryllium_emission cadmium_emission cadmium_intake cadmium_concentration_cortex cadmium_concentration_soil cadmium_export chromium_emission chromium_concentration crystalline_silica_concentration hexavalent_chromium_compounds nickel_emission nickel_concentration thorium_deposit thorium_concentration metal_water_concentration copper_emission mercury_emission lead_emission pm10_emission pm2.5_emission so2_emission nmvok_emission tsp_emission nox_emission benzo(a)pyren_emission benzo(b)fluoranthen_emission benzo(k)fluoranthen_emission Indeno(1|2|3-cd)pyren_emission pah_emission pcb_emission dioxin_emission hcb_emission aspartame_consumption no_sugar fluoride_consumption vitamine_D_summer vitamine_D_winter ga_dietary_characteristics_fat ga_dietary_characteristics_cholesterol ga_dietary_characteristics_proteins ga_dietary_characteristics_vitamine_A ga_dietary_characteristics_vitamine_D ga_dietary_characteristics_vitamine_E ga_dietary_characteristics_vitamine_B1 ga_dietary_characteristics_vitamine_B2 ga_dietary_characteristics_vitamine_B6 ga_dietary_characteristics_vitamine_B12 ga_dietary_characteristics_vitamine_C ga_dietary_characteristics_iodine ga_dietary_characteristics_calcis g_dietary_characteristics_fruit_0 g_dietary_characteristics_fruit_0_1 g_dietary_characteristics_fruit_1_2 g_dietary_characteristics_fruit_2_3.1 g_dietary_characteristics_fruit_3_4 g_dietary_characteristics_fruit_4_5 g_dietary_characteristics_fruit_5+ g_dietary_characteristics_fruit g_dietary_characteristics_cereals_bread g_dietary_characteristics_vegetables g_dietary_characteristics_potatoes g_dietary_characteristics_meat g_dietary_characteristics_fish g_dietary_characteristics_eggs g_dietary_characteristics_cheese_milk g_dietary_characteristics_alcohol g_dietary_characteristics_soup g_dietary_characteristics_confectionery g_dietary_characteristics_snacks g_dietary_characteristics_coffee_tea cholesterol_prevalence transplants_donation transplants_prevalence liver_transplant_prevalence ga_HIV HIV_15_49 g_HPV_vac3_15+ g_HPV_vacc_1+ g_HPV_vacc_2+ g_HPV_vacc_3+ ga_HPV HCV g_HBV HAV leukemia hhv8 diphteria_vacc tetanus_vacc pertussis_vacc polio_vacc hepb_vacc measles_vacc_1 measles_vacc_2 mumps_vacc_1 mumps_vacc_2 rubella_vacc_1 rubella_vacc_2 varicella_vacc_1 varicella_vacc_2 menC_vacc pneumo_vacc radon_level radon_high uv_radiation x_rays gamma_rays g_past_smoker_prevalence g_no_smoker_prevalence g_smoking_20+cigarets g_smoking_10_19cigarets g_smoking_10-cigarets g_past_smoker_prevalence.1 g_occasion_smoker_prevalence g_smoker_prevalence_0_18 g_smoker_prevalence_18+ alcool_prevalence_beer alcool_prevalence_wine alcool_consumption alcool_consumption_beer alcool_death drugs_opiates drugs_crack drugs_inject overweight bmi_score bmi_18.5- bmi_18.5_25 bmi_25_30 bmi_30+ aluminium_production aluminium_water_concentration benzene_emission benzene_air_concentration butadiene_emission butadiene_air_concentration ethylene_emission formaldehyde_emission formaldehyde_water_concentration sulfuric_acid_emission vinyl_chlorid_emission vinyl_chlorid_water_concentration iso_9001 iso_14001 iso/iec_27001 iso_50001 iso_13485 iso_22000 mobile_subscription pork_consumption chicken_consumption beef_consumption adolescent_birth_15_19 income race_white race_mixed race_asian race_black race_other height first_birth never_birth coke_ovens_co2_emission sintering_co2_emission blast_furnaces_co2_emission oxygen_furnaces_co2_emission flared_gases_co2_emission combustion_plant_co2_emission fuels_sold_co2_emission uninsured_0_65 first_birth_age shale_oil fastfood_spending g_first_menstruation_age coal_to_electricity gas_to_electricity nuclear_to_electricity renewable_to_electricity other_to_electricity women_age_1st_marriage family_size_2009 persons_per_home birth_rate unemployment_rate_15_64_2009 companies_agri companies_indus companies_construction companies_services companies_admin dietary_lead bread_eaters dietary_bread water_drinkers dietary_water milk_drinkers dietary_milk soft_drinkers dietary_soft diatery_aluminium fish_eaters dietary_fish dietary_sodium dietary_dioxin dietary_deoxynivalenol dietary_acrylamide fried_potato_eaters dietary_fried_potato coffee_drinkers dietary_coffee target
count 62679.000000 0 166.000000 1.480000e+02 38 181.000000 148.000000 36 36 0 596.000000 148.000000 0 0 0 148.000000 0 0 0 0 148.000000 148.000000 148.000000 148.000000 148.000000 148.000000 148.000000 148.000000 148.000000 148.000000 148.000000 148.000000 148.000000 148.000000 148.000000 1.480000e+02 148.000000 32.0 0 0 0 34.000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.313000e+03 0 0 47812.000000 0 0 0 0 0 0 1.346000e+03 1.367000e+03 0 0 47397.000000 47397.000000 0 47397.000000 27861.000000 47397.000000 0 0 0 0 0 0 0 0 0 1399.000000 159.000000 0 36 29 0 0 0 0 0 0 0 0 0 0 0 4496.000000 0 0 0 0 0 0 0 0 0 0 0 32.000000 32 0 32.00 0 32.00 0 0 35 0 0 32.0 127.000000 127.000000 127 127.000000 127.000000 127.000000 0 0 0 0 169.000000 28982.000000 0 0 0 0 0 0 0 0 33 33 33 33 33 33 33 0 0 1340.000000 1362.000000 17.0 0 0 0 0 0 3468.000000 588.000000 588.000000 588.000000 588.000000 588.000000 588.000000 588.000000 588.000000 588.000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 62679.000000
mean 2001.575025 NaN 0.142465 6.551802e-05 300 0.000032 0.000094 59 90 NaN 0.000074 0.000708 NaN NaN NaN 0.001239 NaN NaN NaN NaN 0.026467 0.000127 0.002664 2.849791 1.450256 5.240903 14.439512 191.671861 15.981971 0.369054 0.015352 0.011381 0.008431 2.244276 0.002906 8.327222e-07 0.000043 5.3 NaN NaN NaN 0.730956 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.039166e-06 NaN NaN 0.477991 NaN NaN NaN NaN NaN NaN 2.083764e-07 1.134499e-07 NaN NaN 86.461211 86.461211 NaN 86.461211 58.944797 85.171023 NaN NaN NaN NaN NaN NaN NaN NaN NaN 85.674053 0.088302 NaN 1477 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 9.503003 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8.241804 200 NaN 16.25 NaN 2.25 NaN NaN 5 NaN NaN 0.5 1252.834646 726.905512 0 0.472441 3.692913 18.149606 NaN NaN NaN NaN 0.001114 24372.556311 NaN NaN NaN NaN NaN NaN NaN NaN 1241 1642 2978 98 1139 6984 174 NaN NaN 51713.487312 665.179979 13.5 NaN NaN NaN NaN NaN 25.543832 2.310887 1.880547 0.011603 0.078508 0.160734 0.061990 0.098149 0.537909 0.141218 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1306.549900
std 6.928115 NaN 0.000431 9.428223e-07 0 0.000021 0.000002 0 0 NaN 0.000081 0.000008 NaN NaN NaN 0.000072 NaN NaN NaN NaN 0.000365 0.000001 0.000054 0.034888 0.053895 0.088077 0.430828 170.955294 0.288171 0.027884 0.000274 0.000254 0.000159 0.172248 0.000063 2.755747e-08 0.000001 0.0 NaN NaN NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.202000e-06 NaN NaN 1.755877 NaN NaN NaN NaN NaN NaN 2.046072e-07 1.334697e-07 NaN NaN 21.593261 21.593261 NaN 21.593261 42.739487 21.711363 NaN NaN NaN NaN NaN NaN NaN NaN NaN 22.711183 0.070309 NaN 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.272556 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 0 NaN 0.00 NaN 0.00 NaN NaN 0 NaN NaN 0.0 1280.910792 819.942492 0 0.764560 4.718272 17.767046 NaN NaN NaN NaN 0.000182 19007.509455 NaN NaN NaN NaN NaN NaN NaN NaN 0 0 0 0 0 0 0 NaN NaN 158668.642846 165.025427 0.0 NaN NaN NaN NaN NaN 2.970476 0.117506 0.281507 0.001927 0.012630 0.088169 0.010241 0.015638 0.083919 0.019945 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2893.029846
min 1985.000000 NaN 0.141844 6.456675e-05 300 0.000007 0.000091 59 90 NaN 0.000005 0.000699 NaN NaN NaN 0.001179 NaN NaN NaN NaN 0.025833 0.000125 0.002602 2.823842 1.394183 5.152575 14.087271 4.263573 15.692754 0.344928 0.014931 0.011010 0.008185 2.106322 0.002851 8.065691e-07 0.000042 5.3 NaN NaN NaN 0.730956 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.299475e-08 NaN NaN 0.000000 NaN NaN NaN NaN NaN NaN 0.000000e+00 0.000000e+00 NaN NaN 0.000000 0.000000 NaN 0.000000 0.000000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN 15.500000 0.010000 NaN 1477 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.100000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8.241804 200 NaN 16.25 NaN 2.25 NaN NaN 5 NaN NaN 0.5 188.000000 51.000000 0 0.000000 0.000000 0.000000 NaN NaN NaN NaN 0.000923 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN 1241 1642 2978 98 1139 6984 174 NaN NaN 0.000000 371.845051 13.5 NaN NaN NaN NaN NaN 20.448917 1.926032 1.085737 0.007825 0.050914 0.001108 0.036157 0.050476 0.363918 0.090387 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.200000
25% 1996.000000 NaN 0.142180 6.456675e-05 300 0.000013 0.000091 59 90 NaN 0.000010 0.000705 NaN NaN NaN 0.001179 NaN NaN NaN NaN 0.026549 0.000127 0.002602 2.823842 1.394183 5.152575 14.087271 4.425263 15.692754 0.344928 0.015356 0.011285 0.008400 2.106322 0.002883 8.065691e-07 0.000042 5.3 NaN NaN NaN 0.730956 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3.935369e-07 NaN NaN 0.060000 NaN NaN NaN NaN NaN NaN 4.124627e-08 2.855511e-08 NaN NaN 86.000000 86.000000 NaN 86.000000 0.000000 84.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN 74.000000 0.040000 NaN 1477 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6.560000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8.241804 200 NaN 16.25 NaN 2.25 NaN NaN 5 NaN NaN 0.5 242.000000 110.000000 0 0.000000 0.000000 2.000000 NaN NaN NaN NaN 0.000923 9915.000000 NaN NaN NaN NaN NaN NaN NaN NaN 1241 1642 2978 98 1139 6984 174 NaN NaN 0.000000 547.367403 13.5 NaN NaN NaN NaN NaN 23.300493 2.239536 1.662349 0.010303 0.070346 0.103173 0.056539 0.088401 0.473223 0.127891 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 116.950000
50% 2002.000000 NaN 0.142405 6.554936e-05 300 0.000025 0.000093 59 90 NaN 0.000067 0.000705 NaN NaN NaN 0.001197 NaN NaN NaN NaN 0.026601 0.000127 0.002637 2.825061 1.412005 5.174038 14.101632 345.405220 15.752602 0.351454 0.015364 0.011456 0.008445 2.134741 0.002883 8.105712e-07 0.000043 5.3 NaN NaN NaN 0.730956 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.367858e-06 NaN NaN 0.100000 NaN NaN NaN NaN NaN NaN 1.522939e-07 6.662859e-08 NaN NaN 94.000000 94.000000 NaN 94.000000 86.000000 93.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN 74.000000 0.050000 NaN 1477 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 9.700000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8.241804 200 NaN 16.25 NaN 2.25 NaN NaN 5 NaN NaN 0.5 467.000000 235.000000 0 0.000000 1.000000 10.000000 NaN NaN NaN NaN 0.001153 18477.000000 NaN NaN NaN NaN NaN NaN NaN NaN 1241 1642 2978 98 1139 6984 174 NaN NaN 0.000000 654.267682 13.5 NaN NaN NaN NaN NaN 25.199867 2.292997 1.895167 0.011184 0.078459 0.158474 0.061458 0.097294 0.532485 0.138942 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 328.950000
75% 2007.000000 NaN 0.142744 6.554936e-05 300 0.000059 0.000096 59 90 NaN 0.000113 0.000707 NaN NaN NaN 0.001236 NaN NaN NaN NaN 0.026653 0.000128 0.002694 2.850798 1.491012 5.311071 14.545496 347.757105 16.290821 0.369737 0.015689 0.011694 0.008619 2.230193 0.002884 8.518712e-07 0.000043 5.3 NaN NaN NaN 0.730956 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.386611e-06 NaN NaN 0.400000 NaN NaN NaN NaN NaN NaN 2.728600e-07 1.459483e-07 NaN NaN 97.000000 97.000000 NaN 97.000000 95.000000 96.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN 111.000000 0.130000 NaN 1477 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8.241804 200 NaN 16.25 NaN 2.25 NaN NaN 5 NaN NaN 0.5 2877.000000 1916.000000 0 1.000000 7.000000 41.000000 NaN NaN NaN NaN 0.001153 36908.000000 NaN NaN NaN NaN NaN NaN NaN NaN 1241 1642 2978 98 1139 6984 174 NaN NaN 1946.579659 734.805659 13.5 NaN NaN NaN NaN NaN 27.582966 2.360156 2.067503 0.012931 0.084777 0.226664 0.068004 0.106683 0.582748 0.153228 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 847.300000
max 2013.000000 NaN 0.143084 6.707200e-05 300 0.000059 0.000098 59 90 NaN 0.000353 0.000722 NaN NaN NaN 0.001363 NaN NaN NaN NaN 0.026813 0.000128 0.002745 2.912773 1.522902 5.361578 15.141798 347.757105 16.302235 0.416434 0.015689 0.011694 0.008619 2.542314 0.003023 8.712807e-07 0.000045 5.3 NaN NaN NaN 0.730956 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.065406e-05 NaN NaN 18.100000 NaN NaN NaN NaN NaN NaN 7.836792e-07 6.631131e-07 NaN NaN 99.000000 99.000000 NaN 99.000000 99.000000 99.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN 148.000000 0.200000 NaN 1477 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 23.010000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8.241804 200 NaN 16.25 NaN 2.25 NaN NaN 5 NaN NaN 0.5 3358.000000 1969.000000 0 2.000000 13.000000 47.000000 NaN NaN NaN NaN 0.001384 96245.000000 NaN NaN NaN NaN NaN NaN NaN NaN 1241 1642 2978 98 1139 6984 174 NaN NaN 789229.908459 1340.014378 13.5 NaN NaN NaN NaN NaN 32.404984 2.682073 2.525907 0.018709 0.119442 0.391271 0.089922 0.148683 0.821445 0.207521 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 26497.400000

The following utility function will help visualize the sparsity of the dataset


In [56]:
def meta_dataframe(df, uniq_examples=7):
    from collections import defaultdict
    res = defaultdict(list)
    for i in range(df.shape[1]):
        res['col_name'].append(df.columns[i])
        uniques = df.iloc[:,i].unique()
        notnull_rate = df.iloc[:,i].dropna().size / df.iloc[:,i].size
        res['n_uniques'].append(uniques.size)
        res['n_notnull'].append(notnull_rate)
        res['dtype'].append(df.iloc[:,i].dtype)
        for j in range(1, uniq_examples + 1):
            v = uniques[j-1] if j <= uniques.size else ''
            res['value_' + str(j)].append(v)
    return pd.DataFrame(res, columns=sorted(res.keys())).set_index('col_name')

In [57]:
meta_df = meta_dataframe(df.dropna(how='all', axis=[0, 1]))

In [58]:
meta_df.sort_values('n_notnull', ascending=False)


Out[58]:
dtype n_notnull n_uniques value_1 value_2 value_3 value_4 value_5 value_6 value_7
col_name
RegionType object 1.000000 6 Country State Departement Land Prefecture Region
Age object 1.000000 1 0_100
target float64 1.000000 27209 279.5 740 2210 148.05 481.4 2.4 311.55
Part of object 1.000000 6 World United States of America France Germany Japan UK, England and Wales
MainOrigin object 1.000000 5 Any White Black Native Asian
cancer_type object 1.000000 32 Brain, central nervous system (C70-72) Colon, rectum and anus (C18-21) Lung (C33-34) Rectum and anus (C19-21) Cervix uteri (C53) Mesothelioma (C45) Breast (C50)
Gender object 1.000000 2 Female Male
Region object 1.000000 255 France Missouri Colorado Georgia Slovakia Kyrgyzstan Peru
Year int64 1.000000 26 2000 2007 1996 1997 2010 2012 2006
HIV_15_49 float64 0.762807 43 0.1 NaN 0 0.2 1.1 0.4 0.7
measles_vacc_1 float64 0.756186 59 84 NaN 65 99 91 97 76
polio_vacc float64 0.756186 58 97 NaN 80 99 90 54 0
tetanus_vacc float64 0.756186 58 97 NaN 80 99 90 54 0
diphteria_vacc float64 0.756186 58 97 NaN 80 99 90 54 0
hib_vacc object 0.721326 139 86 NaN 0 30 98 Greece 99
income float64 0.462388 870 34774 NaN 10913 40115 24438 0 9143
hepb_vacc float64 0.444503 73 51 NaN 74 99 83 88 82
g_smoker_prevalence object 0.102650 110 26.7 NaN 28 #N/ 38.3 34.7 7.3
alcool_consumption float64 0.071731 132 NaN 10.05 17.47 5.55 4.72 12.21 12.5
women_age_1st_marriage float64 0.055330 95 NaN 30.5082 30.5145 23.39 25.52 24.5 28.1573
radon_level float64 0.022320 5 NaN 111 74 148 15.5
HAV float64 0.021810 37 NaN 6.66286e-08 2.22095e-08 2.85551e-08 3.17279e-08 8.88381e-08 9.20109e-08
fastfood_spending float64 0.021730 51 NaN 842.774 565.768 518.984 654.268 617.339 1205.71
g_HBV float64 0.021474 43 NaN 6.98014e-08 1.39603e-07 2.85551e-08 9.51837e-09 1.64985e-07 1.07875e-07
shale_oil float64 0.021379 20 NaN 1514.01 476696 0 2811.73 432.573 648.86
transplants_prevalence float64 0.020948 48 NaN 1.26947e-07 1.67253e-06 1.42816e-07 1.96768e-07 2.36122e-06 2.1835e-06
cadmium_export float64 0.009509 20 NaN 5.43956e-06 0.000352908 8.95491e-05 0.00013209 7.54304e-06 0.000129348
companies_indus float64 0.009381 97 NaN 0.0616856 0.0628141 0.0598837 0.0614723 0.0605005 0.0699262
companies_agri float64 0.009381 97 NaN 0.241577 0.132187 0.235744 0.137961 0.158474 0.20163
unemployment_rate_15_64_2009 float64 0.009381 97 NaN 0.0655347 0.0723124 0.0785899 0.0660401 0.0703712 0.0794741
... ... ... ... ... ... ... ... ... ... ...
nickel_emission float64 0.002361 5 NaN 0.00117857 0.00123615 0.00119682 0.00136271
chromium_emission float64 0.002361 5 NaN 0.000704908 0.000722139 0.000707478 0.000698785
dioxin_emission float64 0.002361 5 NaN 8.06569e-07 8.51871e-07 8.10571e-07 8.71281e-07
iso_22000 float64 0.002026 8 NaN 0 41 10 2 9 14
iso_9001 float64 0.002026 8 NaN 188 3358 375 242 467 853
iso_13485 float64 0.002026 6 NaN 0 13 1 2 7
iso_50001 float64 0.002026 4 NaN 0 2 1
iso/iec_27001 float64 0.002026 2 NaN 0
iso_14001 float64 0.002026 8 NaN 51 1969 174 110 235 390
arsenic_concentration float64 0.000606 2 NaN 300
cadmium_concentration_cortex float64 0.000574 2 NaN 90
x_rays float64 0.000574 2 NaN 1477
cadmium_intake float64 0.000574 2 NaN 59
formaldehyde_water_concentration float64 0.000558 2 NaN 5
vitamine_D_winter float64 0.000542 2 NaN 0.730956
coke_ovens_co2_emission float64 0.000526 2 NaN 1241
fuels_sold_co2_emission float64 0.000526 2 NaN 174
combustion_plant_co2_emission float64 0.000526 2 NaN 6984
flared_gases_co2_emission float64 0.000526 2 NaN 1139
oxygen_furnaces_co2_emission float64 0.000526 2 NaN 98
blast_furnaces_co2_emission float64 0.000526 2 NaN 2978
sintering_co2_emission float64 0.000526 2 NaN 1642
vinyl_chlorid_water_concentration float64 0.000511 2 NaN 0.5
butadiene_air_concentration float64 0.000511 2 NaN 2.25
benzene_air_concentration float64 0.000511 2 NaN 16.25
aluminium_water_concentration float64 0.000511 2 NaN 200
aspartame_consumption float64 0.000511 2 NaN 5.3
aluminium_production float64 0.000511 2 NaN 8.2418
gamma_rays float64 0.000463 2 NaN 60
g_first_menstruation_age float64 0.000271 2 NaN 13.5

88 rows × 10 columns

How much sparse is the data


In [59]:
meta_df.n_notnull.sort_values(ascending=True, inplace=False).plot.barh(figsize=(10, 6), fontsize=1, alpha=.7);



In [60]:
meta_df.n_notnull.plot.hist(figsize=(10, 6), bins=30, alpha=.7);


Prediction model

We are going to follow the scikit-learn API specs. Basically,

  • inherit from BaseEstimator,
  • initiate all of the arguments and configurations in the __init__() function,
  • implement a fit() and a predict() function.

More information in the official documentation.


In [61]:
from sklearn.base import BaseEstimator

In [62]:
from sklearn.cross_validation import train_test_split

In [63]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score, ShuffleSplit

In [64]:
from sklearn.preprocessing import Imputer
from sklearn.metrics import mean_squared_error

In [65]:
from sklearn.pipeline import make_pipeline

In [66]:
df_tmp = df[df['Part of'] == 'France']

In [70]:
df.isnull().any(how='all', axis=0).sum()


Out[70]:
214

In [74]:
df = df.drop(df_tmp.index)

In [21]:
class FeatureExtractor(object):
    # The columns you want to include without pre-processing
    core_cols = ['Year']
    # These columns must be discarded. They are only useful in case you would like to 
    # do joins with external data
    region_cols = ['RegionType', 'Part of', 'Region']
    # Categorical columns. They must be processed (use pd.get_dummies for the simplest way)
    categ_cols = ['Gender', 'Age', 'MainOrigin']
    # the different factors to include in the model
    additional_cols = ['HIV_15_49']

    def __init__(self):
        pass

    def fit(self, X_df, y_array):
        pass

    def transform(self, X_df):
        ret = X_df[self.core_cols].copy()
        # dummify the categorical variables
        for col in self.categ_cols:
            ret = ret.join(pd.get_dummies(X_df[col], prefix=col[:3]))
        # add extra information
        for col in self.additional_cols:
            ret[col] = X_df[col]
        return ret.values

In [22]:
class Regressor(BaseEstimator):
    def __init__(self):
        self.clf = make_pipeline(
            Imputer(strategy='median'), 
            RandomForestRegressor(n_estimators=20, max_depth=None))

    def fit(self, X, y):
        return self.clf.fit(X, y)

    def predict(self, X):
        return self.clf.predict(X)

In [28]:
df_features = df.drop('target', axis=1)
y = df.target.values

df_train, df_test, y_train, y_test = train_test_split(df_features, y, test_size=0.5, random_state=42)

Instanciating our model


In [29]:
feature_extractor = FeatureExtractor()
model = Regressor()

Feature processing and training


In [30]:
X_train = feature_extractor.transform(df_train)
model.fit(X_train, y_train)


Out[30]:
Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))])

Testing our model


In [32]:
X_test = feature_extractor.transform(df_test)
y_pred = model.predict(X_test)
print('RMSE = ', np.sqrt(mean_squared_error(y_test, y_pred)))


RMSE =  2976.77818377

In [ ]: