Sales

Instructions / Notes:

Read these carefully

Read and execute each cell in order, without skipping forward
You may create new Jupyter notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!- just make sure that your final answer dataframes and answers use the set variables outlined below
Have fun!



In [1]:

    
# Run the following to import necessary packages and import dataset. Do not use any additional plotting libraries.

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

d1 = "dataset/sales1.csv"
d2 = "dataset/sales2.csv"
d3 = "dataset/sales3.csv"
d4 = "dataset/sales4.csv"

df1 = pd.read_csv(d1)
df2 = pd.read_csv(d2)
df3 = pd.read_csv(d3)
df4 = pd.read_csv(d4)
df1.head(n=5)   # Print n number of rows from top of dataset

Each of the 4 dataframes loaded above represents a company's average sales over time. Check the descriptive statistics below.



In [2]:

    
df1.describe()



In [3]:

    
df2.describe()



In [4]:

    
df3.describe()



In [5]:

    
df4.describe()

Can you identify a dataset that is least likely to represent a company's sales over time? Set the following variable to 'Yes' or 'No'.



In [4]:

    
least_rep_dataset_exists = 'Yes'

If you answered 'Yes' which dataset is least likely to represent a company's sales over time? Set the following variable to 1, 2, 3, or 4.



In [5]:

    
least_rep_dataset = 4

Clue

Pandas has a handy function to generate scatterplots: https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.plot.scatter.html

If this clue changes your answer, try again below. Otherwise, if you are confident in your answer above, leave the following untouched.



In [13]:

    
df1.head()



In [16]:

    
# Show your revised analysis below
df_all = [df1, df2, df3, df4]

for df in df_all:
    df.plot.scatter('time', 'avg_sales')



In [18]:

    
least_rep_dataset_exists_clue = 'Yes'



In [19]:

    
least_rep_dataset_clue = 4



In [ ]:

	time	avg_sales
0	10	8.04
1	8	6.95
2	13	7.58
3	9	8.81
4	11	8.33

	time	avg_sales
count	11.000000	11.000000
mean	9.000000	7.500909
std	3.316625	2.031568
min	4.000000	4.260000
25%	6.500000	6.315000
50%	9.000000	7.580000
75%	11.500000	8.570000
max	14.000000	10.840000

	time	avg_sales
count	11.000000	11.000000
mean	9.000000	7.500909
std	3.316625	2.031657
min	4.000000	3.100000
25%	6.500000	6.695000
50%	9.000000	8.140000
75%	11.500000	8.950000
max	14.000000	9.260000

	time	avg_sales
count	11.000000	11.000000
mean	9.000000	7.500000
std	3.316625	2.030424
min	4.000000	5.390000
25%	6.500000	6.250000
50%	9.000000	7.110000
75%	11.500000	7.980000
max	14.000000	12.740000

	time	avg_sales
count	11.000000	11.000000
mean	9.000000	7.500909
std	3.316625	2.030579
min	8.000000	5.250000
25%	8.000000	6.170000
50%	8.000000	7.040000
75%	8.000000	8.190000
max	19.000000	12.500000

Table of Contents

Sales

Instructions / Notes:

Each of the 4 dataframes loaded above represents a company's average sales over time. Check the descriptive statistics below.

Clue

Pandas has a handy function to generate scatterplots: https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.plot.scatter.html