Ice Cream

Instructions / Notes:

Read these carefully

  • Read and execute each cell in order, without skipping forward
  • You may create new Jupyter notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!- just make sure that your final answer dataframes and answers use the set variables outlined below
  • Have fun!

In [1]:
# Run the following to import necessary packages and import dataset. Do not use any additional plotting libraries.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
datafile = "dataset/icecream.csv"
df = pd.read_csv(datafile)
df


Out[1]:
month ice_cream_sales temperature deaths_drowning humidity
0 12 4.75 40 2 30
1 11 4.78 50 3 20
2 1 4.82 55 4 70
3 2 4.83 58 4 70
4 3 4.84 60 5 20
5 10 4.88 55 6 30
6 5 4.91 68 9 20
7 9 4.92 70 9 10
8 4 4.93 75 8 50
9 7 4.93 80 11 10
10 6 4.94 83 12 90
11 8 4.95 88 11 50

The dataset above contains the ice cream sales, temperature, number of deaths by drowning and humidity level in a city during a timespan of 12 months.


In [19]:
# Here are the correlation coefficients between pairs of columns
corr = df.corr()
corr


Out[19]:
month ice_cream_sales temperature deaths_drowning humidity
month 1.000000 -0.164441 -0.207638 -0.051340 -0.481519
ice_cream_sales -0.164441 1.000000 0.937996 0.952490 0.083232
temperature -0.207638 0.937996 1.000000 0.950942 0.191497
deaths_drowning -0.051340 0.952490 0.950942 1.000000 0.080003
humidity -0.481519 0.083232 0.191497 0.080003 1.000000

In [60]:
abs_corr = np.abs(df.corr())

indices = corr.index
corr_pairs = []

for i, idx_i in enumerate(indices):
    for j, c in enumerate(abs_corr[idx_i]):
        if c > .9 and i < j:
            corr_pairs.append((idx_i, indices[j]))

In [61]:
corr_pairs


Out[61]:
[('ice_cream_sales', 'temperature'),
 ('ice_cream_sales', 'deaths_drowning'),
 ('temperature', 'deaths_drowning')]

Identify strong (i.e., correleation coefficient > 0.9) and meaningful correlations among pairs of columns in this dataset. Append these pairs of correlated columns in the following form [column_x, column_y] to the variable below.


In [7]:
correlations = []
correlations.append(['ice_cream_sales', 'temperature'])

# do not touch
correlations.sort()
print(correlations)


[['ice_cream_sales', 'temperature']]

Clue

Some of the correlations you found above may be spurious: https://en.wikipedia.org/wiki/Spurious_relationship -- Only include meaningful correlations to the list!

If this clue changes your answer, try again below. Otherwise, if you are confident in your answer above, leave the following untouched.


In [ ]:
# meaningful_correlation.append(['column_x', 'column_y'])
correlations_clue = []

# do not touch
correlations_clue.sort()
print(correlations_clue)