Exploring the data

When we look at spreadsheets or large amounts of data, its hard for us to understand what is really happening. But when we visually interpret the data then everything starts making sense.

Question 1. Find the total number of hard disks for a given model

Question 2. Find total failures for a hard disk models

Question 3. How do you compute failure rate for a model

Question 4. Given a model and capacity bytes, what does failure count look like

Question 5. Let us count how many days each hard disk ran

Question 6. Find the average running time for failed hard disks and average running time for hard disks that have not failed

Question 7. How about using hours (SMART_9) column now and co-relate it with failure

Question 8. Given the data , identify the model and capacity of the hard disk to buy based on how long it runs

Step by step approach

First let us look at our data


In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [ ]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize']=15,10

In [ ]:
df = pd.read_csv('data/data.csv')

Let us take a sneak peek at the data


In [ ]:
df.head()

What is the size of the dataset?


In [ ]:
df.shape

Now we see that there are different models of hard disks, let us list them


In [ ]:
df_model = pd.DataFrame(df.model.unique(),columns=['model'])

In [ ]:
df_model.head()

In [ ]:
df_model.count()[0]

let us see how many models are there in total


In [ ]:
print "Total number of distinct models : "+ str(df_model.count()[0])

In [ ]:


In [ ]:
# Exerice 1: Find the distinct number of serial numbers

In [ ]:


In [ ]:
# Exercise 2: Find the distinct number of capacity bytes

In [ ]:

Question 1. Find the total number of hard disks for a given model

Now let us see how many hard disks are there for each model and visualize it.

We see that serial number represents the hard disk and they are related to a model i.e multiple serial numbers belongs to one type of model


In [ ]:
df_model_serial = pd.DataFrame(df.groupby(['model']).serial.nunique())

In [ ]:
df_model_serial.head()

In [ ]:
df_model_serial = df_model_serial.reset_index()

In [ ]:
df_model_serial.head()

In [ ]:
df_model_serial.columns = ['model','total_HD']

In [ ]:
df_model_serial.head(39)

In [ ]:
df_model_serial.plot(kind="barh",x="model",y="total_HD")

Sort and plot


In [ ]:
df_model_serial.sort_values(by='total_HD',inplace=True)

In [ ]:
df_model_serial.plot(kind="barh",x="model",y="total_HD")

In [ ]:
#Exercise 3: Find the count of different capacity bytes for a model and plot with and without sorting

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Question 2. Find failures for a hard disk models


In [ ]:
df_fail = pd.DataFrame(df.groupby('model').failure.sum())

In [ ]:
df_fail.head()

In [ ]:
df_fail = df_fail.reset_index()

In [ ]:
df_fail.head()

In [ ]:
df_fail.plot(kind="barh",x="model",y="failure",figsize=(18,10))

In [ ]:
# Exercise 4 : sort the above data frame and plot it

In [ ]:


In [ ]:


In [ ]:

Question 3. How do you compute failure rate for a model

Now let us express the failure / total number of hard disks as ratio. This will give us an understanding of models and their failure behavior

To get that data, instead of computing again, we can join the 2 data frames that were previously computed and compute the ratio


In [ ]:
merged_df = df_model_serial.merge(df_fail,how='inner',on='model')

In [ ]:
merged_df.head()

let us see in total hard disks for a model, how many failed and how many did not


In [ ]:
merged_df['success'] = merged_df.total_HD - merged_df.failure

In [ ]:
merged_df.head()

In [ ]:
merged_df.plot(kind="bar",x="model",y=["failure","success"],subplots=True)

now let us compute the ratio of failure number/total_hard_disk of hard disk


In [ ]:
merged_df['ratio_failure'] = merged_df.failure / merged_df.total_HD

In [ ]:
merged_df.head(25)

In [ ]:
merged_df.sort_values(by="ratio_failure",ascending=False,inplace=True)

In [ ]:
merged_df.head()

In [ ]:
merged_df.plot(kind="bar",x="model",y="ratio_failure")

The higher the ratio value is , the model is prone to failure


In [ ]:
#Exercise: Find ratio of success and plot it

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:
#Exercise : Plot multiple bar charts comparing ratio of success and failure

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Now we know which models fail the most, let us introduce a new feature in our analysis, capacity.

We are going feature by feature the reason being, the more features we add that add value to the outcome, we see how our understanding of the data starts to change.

Let us look at the capacity


In [ ]:
df_capacity = pd.DataFrame(df.capacity.unique(),columns=['capacity'])

In [ ]:
df_capacity.head()

In [ ]:
df_capacity.shape

In [ ]:
#Exercise : For a given capacity bytes, find the total number of failures and plot it

In [ ]:


In [ ]:


In [ ]:


In [ ]:

Question 4. Given a model and capacity bytes, what does failure count look like


In [ ]:
df_fail_mod_cap = pd.DataFrame(df.groupby(['model','capacity']).failure.sum())

In [ ]:
df_fail_mod_cap.head()

In [ ]:
df_fail_mod_cap = df_fail_mod_cap.reset_index()

In [ ]:
df_fail_mod_cap.head(25)

In [ ]:
df_fail_mod_cap.plot(x="capacity",y="failure",kind="bar",figsize=(20,5))

Looking at this chart can you tell what is not being represented right?

We are having repeated entries for the same capacity and this really does not give us insights on the relation between capacity data and the models.


In [ ]:
df_fail_mod_cap.head()

In [ ]:
df_fail_mod_cap_pivot = df_fail_mod_cap.pivot("model","capacity","failure")

In [ ]:
df_fail_mod_cap_pivot.head()

we see that for some models and their respective capacitys we do not have a fail count, lets fill it with 0


In [ ]:
df_fail_mod_cap.fillna(0,inplace=True)
df_fail_mod_cap.head()

In [ ]:
sns.heatmap(df_fail_mod_cap_pivot)

This heat map gives us a better understanding of model, capacity vs failure


In [ ]:
#Exercise : Find count of success for a model with different capacities and plot it

In [ ]:


In [ ]:


In [ ]:


In [ ]:

The above charts give us an explanation of which models failed the most, which models had the most number of hard disks running , the ratio of hard disk : failure rate and hard disk and for a given capacity of a model what the failure count looks like

Hard disk data is time series data, so let us start using time

Question 5. Let us count how many days each hard disk ran


In [ ]:
df_days = pd.DataFrame(df.groupby(['capacity','serial']).date.count())

In [ ]:
df_days = df_days.reset_index()
df_days.head()

In [ ]:
df_days.columns = ['capacity','serial','total_days']

In [ ]:
df_days.head()

In [ ]:
df_days.capacity.value_counts()

In [ ]:
df_days.shape

In [ ]:
df_days_pivot = df_days.pivot('capacity','serial','total_days')

In [ ]:
df_days_pivot.head()

In [ ]:
df_days_pivot.fillna(0,inplace=True)

In [ ]:
df_days_pivot.head()

In [ ]:
# Exercise : Visualize the above dataframe

Question 6. Find the average running time for failed hard disks and average running time for hard disks that have not failed


In [ ]:
df_fail_days = pd.DataFrame(df[['capacity','serial','failure']].loc[df['failure'] == 1 ])

In [ ]:
df_fail_days.head()

now let us merge the previous data frame which had serial number and count of days


In [ ]:
df_fail_count = df_days.merge(df_fail_days,how="left",on=['capacity','serial'])

In [ ]:
df_fail_count.head()

In [ ]:
df_fail_count.fillna(0,inplace=True)

In [ ]:
df_fail_count.head()

In [ ]:
df_fail_count.dtypes

In [ ]:
g = sns.FacetGrid(df_fail_count, col="failure",hue='failure',size=5,aspect=1.5)
g.map_dataframe(plt.scatter,x='capacity',y='total_days')

Now what can we do with this data? Is this useful? What can I generate from the above data that gives me a little more insight ?

We can generate what is the average time of failure and average success time for capacity


In [ ]:
df_fail_count_avg = pd.DataFrame(df_fail_count.groupby(['capacity','failure']).total_days.mean())

In [ ]:
df_fail_count_avg.head()

In [ ]:
df_fail_count_avg = df_fail_count_avg.reset_index()

In [ ]:
df_fail_count_avg.head()

In [ ]:
df_fail_count_avg_pivot = df_fail_count_avg.pivot('capacity','failure','total_days')

In [ ]:
df_fail_count_avg_pivot.head()

In [ ]:
df_fail_count_avg_pivot.plot(kind="bar")

Question 7. How about using hours (SMART_9) column now and co-relate it with failure


In [ ]:
df_hours = df[['serial','capacity','failure','smart_9']]

In [ ]:
df_hours.head()

In [ ]:
df_hours.shape

Now we want to know upto when for a given hard disk and capacity , how long the hard disk ran


In [ ]:
df_hours_max = pd.DataFrame(df_hours.groupby(['serial','capacity']).smart_9.max())

In [ ]:
df_hours_max.head()

In [ ]:
df_hours_max.shape

In [ ]:
df_hours_max = df_hours_max.reset_index()

In [ ]:
df_hours_max_merge = df_hours_max.merge(df_hours,on=['serial','capacity','smart_9'],how='inner')

In [ ]:
df_hours_max_merge.head()

In [ ]:
df_hours_max_merge_pivot = pd.pivot_table(df_hours_max_merge,index='capacity',columns='failure',values='smart_9'
                                          ,aggfunc='mean')

In [ ]:
df_hours_max_merge_pivot.head()

In [ ]:
df_hours_max_merge_pivot.plot(kind='bar')

Question 8. Given the data , identify the model and capacity of the hard disk to buy based on how long it runs


In [ ]:
df_model_capacity_hours = df[['model','capacity','failure','smart_9']]

In [ ]:
df_model_capacity_hours.head()

Let us convert bytes to gigabytes and round it to the nearest number


In [ ]:
df_model_capacity_hours.capacity = df_model_capacity_hours.capacity / 1024 ** 3

In [ ]:
df_model_capacity_hours.head()

In [ ]:
df_model_capacity_hours.capacity = df_model_capacity_hours.capacity.astype(np.int64)

In [ ]:
df_model_capacity_hours.head()

In [ ]:
df_model_capacity_hours_pivot = pd.pivot_table(data=df_model_capacity_hours,index='model',columns=['failure','capacity'],
                                               values='smart_9',aggfunc='mean')

In [ ]:
df_model_capacity_hours_pivot.head()

In [ ]:
df_model_capacity_hours_pivot.fillna(0,inplace=True)

In [ ]:
df_model_capacity_hours_pivot.head()

In [ ]:
df_model_capacity_hours_pivot.plot(kind="barh")

The above visualization is confusing as the bars reflect combination of failure and hours count


In [ ]:
sns.heatmap(df_model_capacity_hours_pivot)

In [ ]: