Exploring the data

When we look at spreadsheets or large amounts of data, its hard for us to understand what is really happening. But when we visually interpret the data then everything starts making sense.

Question 1. Find the total number of hard disks for a given model

Question 2. Find total failures for a hard disk models

Question 3. How do you compute failure rate for a model

Question 4. Given a model and capacity bytes, what does failure count look like

Question 5. Let us count how many days each hard disk ran

Question 6. Find the average running time for failed hard disks and average running time for hard disks that have not failed

Question 7. How about using hours (SMART_9) column now and co-relate it with failure

Question 8. Given the data , identify the model and capacity of the hard disk to buy based on how long it runs

Step by step approach

First let us look at our data



In [ ]:

    
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np



In [ ]:

    
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize']=15,10



In [ ]:

    
df = pd.read_csv('data/data.csv')

Let us take a sneak peek at the data



In [ ]:

    
df.head()

What is the size of the dataset?



In [ ]:

    
df.shape

Now we see that there are different models of hard disks, let us list them



In [ ]:

    
df_model = pd.DataFrame(df.model.unique(),columns=['model'])



In [ ]:

    
df_model.head()



In [ ]:

    
df_model.count()[0]

let us see how many models are there in total



In [ ]:

    
print "Total number of distinct models : "+ str(df_model.count()[0])



In [ ]:



In [ ]:

    
# Exerice 1: Find the distinct number of serial numbers



In [ ]:



In [ ]:

    
# Exercise 2: Find the distinct number of capacity bytes



In [ ]:

Question 1. Find the total number of hard disks for a given model

Now let us see how many hard disks are there for each model and visualize it.

We see that serial number represents the hard disk and they are related to a model i.e multiple serial numbers belongs to one type of model



In [ ]:

    
df_model_serial = pd.DataFrame(df.groupby(['model']).serial.nunique())



In [ ]:

    
df_model_serial.head()



In [ ]:

    
df_model_serial = df_model_serial.reset_index()



In [ ]:

    
df_model_serial.head()



In [ ]:

    
df_model_serial.columns = ['model','total_HD']



In [ ]:

    
df_model_serial.head(39)



In [ ]:

    
df_model_serial.plot(kind="barh",x="model",y="total_HD")

Sort and plot



In [ ]:

    
df_model_serial.sort_values(by='total_HD',inplace=True)



In [ ]:

    
df_model_serial.plot(kind="barh",x="model",y="total_HD")



In [ ]:

    
#Exercise 3: Find the count of different capacity bytes for a model and plot with and without sorting



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Question 2. Find failures for a hard disk models



In [ ]:

    
df_fail = pd.DataFrame(df.groupby('model').failure.sum())



In [ ]:

    
df_fail.head()



In [ ]:

    
df_fail = df_fail.reset_index()



In [ ]:

    
df_fail.head()



In [ ]:

    
df_fail.plot(kind="barh",x="model",y="failure",figsize=(18,10))



In [ ]:

    
# Exercise 4 : sort the above data frame and plot it



In [ ]:



In [ ]:



In [ ]:

Question 3. How do you compute failure rate for a model

Now let us express the failure / total number of hard disks as ratio. This will give us an understanding of models and their failure behavior

To get that data, instead of computing again, we can join the 2 data frames that were previously computed and compute the ratio



In [ ]:

    
merged_df = df_model_serial.merge(df_fail,how='inner',on='model')



In [ ]:

    
merged_df.head()

let us see in total hard disks for a model, how many failed and how many did not



In [ ]:

    
merged_df['success'] = merged_df.total_HD - merged_df.failure



In [ ]:

    
merged_df.head()



In [ ]:

    
merged_df.plot(kind="bar",x="model",y=["failure","success"],subplots=True)

now let us compute the ratio of failure number/total_hard_disk of hard disk



In [ ]:

    
merged_df['ratio_failure'] = merged_df.failure / merged_df.total_HD



In [ ]:

    
merged_df.head(25)



In [ ]:

    
merged_df.sort_values(by="ratio_failure",ascending=False,inplace=True)



In [ ]:

    
merged_df.head()



In [ ]:

    
merged_df.plot(kind="bar",x="model",y="ratio_failure")

The higher the ratio value is , the model is prone to failure



In [ ]:

    
#Exercise: Find ratio of success and plot it



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

    
#Exercise : Plot multiple bar charts comparing ratio of success and failure



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Now we know which models fail the most, let us introduce a new feature in our analysis, capacity.

We are going feature by feature the reason being, the more features we add that add value to the outcome, we see how our understanding of the data starts to change.

Let us look at the capacity



In [ ]:

    
df_capacity = pd.DataFrame(df.capacity.unique(),columns=['capacity'])



In [ ]:

    
df_capacity.head()



In [ ]:

    
df_capacity.shape



In [ ]:

    
#Exercise : For a given capacity bytes, find the total number of failures and plot it



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Question 4. Given a model and capacity bytes, what does failure count look like



In [ ]:

    
df_fail_mod_cap = pd.DataFrame(df.groupby(['model','capacity']).failure.sum())



In [ ]:

    
df_fail_mod_cap.head()



In [ ]:

    
df_fail_mod_cap = df_fail_mod_cap.reset_index()



In [ ]:

    
df_fail_mod_cap.head(25)



In [ ]:

    
df_fail_mod_cap.plot(x="capacity",y="failure",kind="bar",figsize=(20,5))

Looking at this chart can you tell what is not being represented right?

We are having repeated entries for the same capacity and this really does not give us insights on the relation between capacity data and the models.



In [ ]:

    
df_fail_mod_cap.head()



In [ ]:

    
df_fail_mod_cap_pivot = df_fail_mod_cap.pivot("model","capacity","failure")



In [ ]:

    
df_fail_mod_cap_pivot.head()

we see that for some models and their respective capacitys we do not have a fail count, lets fill it with 0



In [ ]:

    
df_fail_mod_cap.fillna(0,inplace=True)
df_fail_mod_cap.head()



In [ ]:

    
sns.heatmap(df_fail_mod_cap_pivot)

This heat map gives us a better understanding of model, capacity vs failure



In [ ]:

    
#Exercise : Find count of success for a model with different capacities and plot it



In [ ]:



In [ ]:



In [ ]:



In [ ]:

The above charts give us an explanation of which models failed the most, which models had the most number of hard disks running , the ratio of hard disk : failure rate and hard disk and for a given capacity of a model what the failure count looks like

Hard disk data is time series data, so let us start using time

Question 5. Let us count how many days each hard disk ran



In [ ]:

    
df_days = pd.DataFrame(df.groupby(['capacity','serial']).date.count())



In [ ]:

    
df_days = df_days.reset_index()
df_days.head()



In [ ]:

    
df_days.columns = ['capacity','serial','total_days']



In [ ]:

    
df_days.head()



In [ ]:

    
df_days.capacity.value_counts()



In [ ]:

    
df_days.shape



In [ ]:

    
df_days_pivot = df_days.pivot('capacity','serial','total_days')



In [ ]:

    
df_days_pivot.head()



In [ ]:

    
df_days_pivot.fillna(0,inplace=True)



In [ ]:

    
df_days_pivot.head()



In [ ]:

    
# Exercise : Visualize the above dataframe

Question 6. Find the average running time for failed hard disks and average running time for hard disks that have not failed



In [ ]:

    
df_fail_days = pd.DataFrame(df[['capacity','serial','failure']].loc[df['failure'] == 1 ])



In [ ]:

    
df_fail_days.head()

now let us merge the previous data frame which had serial number and count of days



In [ ]:

    
df_fail_count = df_days.merge(df_fail_days,how="left",on=['capacity','serial'])



In [ ]:

    
df_fail_count.head()



In [ ]:

    
df_fail_count.fillna(0,inplace=True)



In [ ]:

    
df_fail_count.head()



In [ ]:

    
df_fail_count.dtypes



In [ ]:

    
g = sns.FacetGrid(df_fail_count, col="failure",hue='failure',size=5,aspect=1.5)
g.map_dataframe(plt.scatter,x='capacity',y='total_days')

Now what can we do with this data? Is this useful? What can I generate from the above data that gives me a little more insight ?

We can generate what is the average time of failure and average success time for capacity



In [ ]:

    
df_fail_count_avg = pd.DataFrame(df_fail_count.groupby(['capacity','failure']).total_days.mean())



In [ ]:

    
df_fail_count_avg.head()



In [ ]:

    
df_fail_count_avg = df_fail_count_avg.reset_index()



In [ ]:

    
df_fail_count_avg.head()



In [ ]:

    
df_fail_count_avg_pivot = df_fail_count_avg.pivot('capacity','failure','total_days')



In [ ]:

    
df_fail_count_avg_pivot.head()



In [ ]:

    
df_fail_count_avg_pivot.plot(kind="bar")

Question 7. How about using hours (SMART_9) column now and co-relate it with failure



In [ ]:

    
df_hours = df[['serial','capacity','failure','smart_9']]



In [ ]:

    
df_hours.head()



In [ ]:

    
df_hours.shape

Now we want to know upto when for a given hard disk and capacity , how long the hard disk ran



In [ ]:

    
df_hours_max = pd.DataFrame(df_hours.groupby(['serial','capacity']).smart_9.max())



In [ ]:

    
df_hours_max.head()



In [ ]:

    
df_hours_max.shape



In [ ]:

    
df_hours_max = df_hours_max.reset_index()



In [ ]:

    
df_hours_max_merge = df_hours_max.merge(df_hours,on=['serial','capacity','smart_9'],how='inner')



In [ ]:

    
df_hours_max_merge.head()



In [ ]:

    
df_hours_max_merge_pivot = pd.pivot_table(df_hours_max_merge,index='capacity',columns='failure',values='smart_9'
                                          ,aggfunc='mean')



In [ ]:

    
df_hours_max_merge_pivot.head()



In [ ]:

    
df_hours_max_merge_pivot.plot(kind='bar')

Question 8. Given the data , identify the model and capacity of the hard disk to buy based on how long it runs



In [ ]:

    
df_model_capacity_hours = df[['model','capacity','failure','smart_9']]



In [ ]:

    
df_model_capacity_hours.head()

Let us convert bytes to gigabytes and round it to the nearest number



In [ ]:

    
df_model_capacity_hours.capacity = df_model_capacity_hours.capacity / 1024 ** 3



In [ ]:

    
df_model_capacity_hours.head()



In [ ]:

    
df_model_capacity_hours.capacity = df_model_capacity_hours.capacity.astype(np.int64)



In [ ]:

    
df_model_capacity_hours.head()



In [ ]:

    
df_model_capacity_hours_pivot = pd.pivot_table(data=df_model_capacity_hours,index='model',columns=['failure','capacity'],
                                               values='smart_9',aggfunc='mean')



In [ ]:

    
df_model_capacity_hours_pivot.head()



In [ ]:

    
df_model_capacity_hours_pivot.fillna(0,inplace=True)



In [ ]:

    
df_model_capacity_hours_pivot.head()



In [ ]:

    
df_model_capacity_hours_pivot.plot(kind="barh")

The above visualization is confusing as the bars reflect combination of failure and hours count



In [ ]:

    
sns.heatmap(df_model_capacity_hours_pivot)



In [ ]: