Sales

Instructions / Notes:

Read these carefully

  • Read and execute each cell in order, without skipping forward
  • You may create new Jupyter notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!- just make sure that your final answer dataframes and answers use the set variables outlined below
  • Have fun!

In [1]:
# Run the following to import necessary packages and import dataset. Do not use any additional plotting libraries.

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

d1 = "dataset/sales1.csv"
d2 = "dataset/sales2.csv"
d3 = "dataset/sales3.csv"
d4 = "dataset/sales4.csv"

df1 = pd.read_csv(d1)
df2 = pd.read_csv(d2)
df3 = pd.read_csv(d3)
df4 = pd.read_csv(d4)
df1.head(n=5)   # Print n number of rows from top of dataset


Out[1]:
time avg_sales
0 10 8.04
1 8 6.95
2 13 7.58
3 9 8.81
4 11 8.33

Each of the 4 dataframes loaded above represents a company's average sales over time. Check the descriptive statistics below.


In [2]:
df1.describe()


Out[2]:
time avg_sales
count 11.000000 11.000000
mean 9.000000 7.500909
std 3.316625 2.031568
min 4.000000 4.260000
25% 6.500000 6.315000
50% 9.000000 7.580000
75% 11.500000 8.570000
max 14.000000 10.840000

In [3]:
df2.describe()


Out[3]:
time avg_sales
count 11.000000 11.000000
mean 9.000000 7.500909
std 3.316625 2.031657
min 4.000000 3.100000
25% 6.500000 6.695000
50% 9.000000 8.140000
75% 11.500000 8.950000
max 14.000000 9.260000

In [4]:
df3.describe()


Out[4]:
time avg_sales
count 11.000000 11.000000
mean 9.000000 7.500000
std 3.316625 2.030424
min 4.000000 5.390000
25% 6.500000 6.250000
50% 9.000000 7.110000
75% 11.500000 7.980000
max 14.000000 12.740000

In [5]:
df4.describe()


Out[5]:
time avg_sales
count 11.000000 11.000000
mean 9.000000 7.500909
std 3.316625 2.030579
min 8.000000 5.250000
25% 8.000000 6.170000
50% 8.000000 7.040000
75% 8.000000 8.190000
max 19.000000 12.500000

Can you identify a dataset that is least likely to represent a company's sales over time? Set the following variable to 'Yes' or 'No'.


In [4]:
least_rep_dataset_exists = 'Yes'

If you answered 'Yes' which dataset is least likely to represent a company's sales over time? Set the following variable to 1, 2, 3, or 4.


In [5]:
least_rep_dataset = 4

Clue

Pandas has a handy function to generate scatterplots: https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.plot.scatter.html

If this clue changes your answer, try again below. Otherwise, if you are confident in your answer above, leave the following untouched.


In [13]:
df1.head()


Out[13]:
time avg_sales
0 10 8.04
1 8 6.95
2 13 7.58
3 9 8.81
4 11 8.33

In [16]:
# Show your revised analysis below
df_all = [df1, df2, df3, df4]

for df in df_all:
    df.plot.scatter('time', 'avg_sales')



In [18]:
least_rep_dataset_exists_clue = 'Yes'

In [19]:
least_rep_dataset_clue = 4

In [ ]: