Breakout: Data Exploration and Visualization


In [ ]:
# Start with our normal batch of imports and settings
from __future__ import print_function, division

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns; sns.set()

1. Load the Data

  1. Download the files:
  2. load females and males data separately (try setting the index_col argument to set the index to the first column)
  3. combine using pd.concat into a single dataframe

In [ ]:
females = pd.read_csv('data/femaleVisitsToPhysician.csv')
males = pd.read_csv('data/maleVisitsToPhysician.csv')

2. Visualize the data

For each gender, the data shows the per capita consultations by age and year. Use pd.pivot_table and plot the data.

Also, as you create these plots, experiment with sns.set_palette to get a color scheme which helps convey the information you're interested in.

  1. Use a pivot table to index the data by age and gender
  2. Plot age vs per capita visits for females, one line per year
  3. Plot age vs per capita visits for males, one line per year

3. Effect of the 2010 Copayment Elimination

The copayment for GP visits was eliminated in 2010. Let's see whether there is any indication that this affected the rate of visits

  1. Add a column to the data called with_copay, which is True if the year is prior to 2010, and False otherwise
  2. Use a pivot table to plot the mean visits per capita for the years with a copay and without (one plot each for men and women)
  3. Plot the percentage increase in per capita visits as a function of age. What age ranges did the copay change most affect?

Let's try to pull some information out of the data that's not obviously available.

Notice that the age column and the year column are intertwined... that is, by subtracting the age from the year, we can find the birth year of the group of people recorded.

  1. Create a new column in the data containing the birth year.
  2. Plot the population by birth year, with a different color line for each observation year. What does this tell you about immigrations and deaths in the population?
  3. Plot the per capita visits by birth year, with a different color line for each observation year. Are there any generations which have consistently more or consistently fewer visits than those in adjacent years?

5. Bonus: Exploring Titanic Survivors

If you finish the above tasks, try this more open-ended exploration on a different dataset.

Seaborn includes a dataset representing the individuals who were on-board the ill-fated maiden voyage of the Titanic. It has information about their age, gender, class, fare paid, the deck their quarters were on, whether they were traveling with someone, and whether they survived.

This is a fairly open-ended exploration, but try answering these questions:

  1. Did age influence chances of survival?
  2. Did gender influence chances of survival?
  3. Did wealth (measured by class or by fair paid) influence chances of survival?
  4. Did the deck the person was on influence chances of survival?

See what sort of interesting relationships you can find between the various pieces of data.


In [ ]:
# load the titanic data
titanic = sns.load_dataset('titanic')