When we look at spreadsheets or large amounts of data, its hard for us to understand what is really happening. But when we visually interpret the data then everything starts making sense.
Question 1. Find the total number of hard disks for a given model
Question 2. Find total failures for a hard disk models
Question 3. How do you compute failure rate for a model
Question 4. Given a model and capacity bytes, what does failure count look like
Question 5. Let us count how many days each hard disk ran
Question 6. Find the average running time for failed hard disks and average running time for hard disks that have not failed
Question 7. How about using hours (SMART_9) column now and co-relate it with failure
Question 8. Given the data , identify the model and capacity of the hard disk to buy based on how long it runs
First let us look at our data
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
In [ ]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize']=15,10
In [ ]:
df = pd.read_csv('data/data.csv')
Let us take a sneak peek at the data
In [ ]:
df.head()
What is the size of the dataset?
In [ ]:
df.shape
Now we see that there are different models of hard disks, let us list them
In [ ]:
df_model = pd.DataFrame(df.model.unique(),columns=['model'])
In [ ]:
df_model.head()
In [ ]:
df_model.count()[0]
let us see how many models are there in total
In [ ]:
print "Total number of distinct models : "+ str(df_model.count()[0])
In [ ]:
In [ ]:
# Exerice 1: Find the distinct number of serial numbers
In [ ]:
In [ ]:
# Exercise 2: Find the distinct number of capacity bytes
In [ ]:
Now let us see how many hard disks are there for each model and visualize it.
We see that serial number represents the hard disk and they are related to a model i.e multiple serial numbers belongs to one type of model
In [ ]:
df_model_serial = pd.DataFrame(df.groupby(['model']).serial.nunique())
In [ ]:
df_model_serial.head()
In [ ]:
df_model_serial = df_model_serial.reset_index()
In [ ]:
df_model_serial.head()
In [ ]:
df_model_serial.columns = ['model','total_HD']
In [ ]:
df_model_serial.head(39)
In [ ]:
df_model_serial.plot(kind="barh",x="model",y="total_HD")
Sort and plot
In [ ]:
df_model_serial.sort_values(by='total_HD',inplace=True)
In [ ]:
df_model_serial.plot(kind="barh",x="model",y="total_HD")
In [ ]:
#Exercise 3: Find the count of different capacity bytes for a model and plot with and without sorting
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
df_fail = pd.DataFrame(df.groupby('model').failure.sum())
In [ ]:
df_fail.head()
In [ ]:
df_fail = df_fail.reset_index()
In [ ]:
df_fail.head()
In [ ]:
df_fail.plot(kind="barh",x="model",y="failure",figsize=(18,10))
In [ ]:
# Exercise 4 : sort the above data frame and plot it
In [ ]:
In [ ]:
In [ ]:
Now let us express the failure / total number of hard disks as ratio. This will give us an understanding of models and their failure behavior
To get that data, instead of computing again, we can join the 2 data frames that were previously computed and compute the ratio
In [ ]:
merged_df = df_model_serial.merge(df_fail,how='inner',on='model')
In [ ]:
merged_df.head()
let us see in total hard disks for a model, how many failed and how many did not
In [ ]:
merged_df['success'] = merged_df.total_HD - merged_df.failure
In [ ]:
merged_df.head()
In [ ]:
merged_df.plot(kind="bar",x="model",y=["failure","success"],subplots=True)
now let us compute the ratio of failure number/total_hard_disk of hard disk
In [ ]:
merged_df['ratio_failure'] = merged_df.failure / merged_df.total_HD
In [ ]:
merged_df.head(25)
In [ ]:
merged_df.sort_values(by="ratio_failure",ascending=False,inplace=True)
In [ ]:
merged_df.head()
In [ ]:
merged_df.plot(kind="bar",x="model",y="ratio_failure")
The higher the ratio value is , the model is prone to failure
In [ ]:
#Exercise: Find ratio of success and plot it
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
#Exercise : Plot multiple bar charts comparing ratio of success and failure
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Now we know which models fail the most, let us introduce a new feature in our analysis, capacity.
We are going feature by feature the reason being, the more features we add that add value to the outcome, we see how our understanding of the data starts to change.
Let us look at the capacity
In [ ]:
df_capacity = pd.DataFrame(df.capacity.unique(),columns=['capacity'])
In [ ]:
df_capacity.head()
In [ ]:
df_capacity.shape
In [ ]:
#Exercise : For a given capacity bytes, find the total number of failures and plot it
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
df_fail_mod_cap = pd.DataFrame(df.groupby(['model','capacity']).failure.sum())
In [ ]:
df_fail_mod_cap.head()
In [ ]:
df_fail_mod_cap = df_fail_mod_cap.reset_index()
In [ ]:
df_fail_mod_cap.head(25)
In [ ]:
df_fail_mod_cap.plot(x="capacity",y="failure",kind="bar",figsize=(20,5))
In [ ]:
df_fail_mod_cap.head()
In [ ]:
df_fail_mod_cap_pivot = df_fail_mod_cap.pivot("model","capacity","failure")
In [ ]:
df_fail_mod_cap_pivot.head()
we see that for some models and their respective capacitys we do not have a fail count, lets fill it with 0
In [ ]:
df_fail_mod_cap.fillna(0,inplace=True)
df_fail_mod_cap.head()
In [ ]:
sns.heatmap(df_fail_mod_cap_pivot)
This heat map gives us a better understanding of model, capacity vs failure
In [ ]:
#Exercise : Find count of success for a model with different capacities and plot it
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
df_days = pd.DataFrame(df.groupby(['capacity','serial']).date.count())
In [ ]:
df_days = df_days.reset_index()
df_days.head()
In [ ]:
df_days.columns = ['capacity','serial','total_days']
In [ ]:
df_days.head()
In [ ]:
df_days.capacity.value_counts()
In [ ]:
df_days.shape
In [ ]:
df_days_pivot = df_days.pivot('capacity','serial','total_days')
In [ ]:
df_days_pivot.head()
In [ ]:
df_days_pivot.fillna(0,inplace=True)
In [ ]:
df_days_pivot.head()
In [ ]:
# Exercise : Visualize the above dataframe
In [ ]:
df_fail_days = pd.DataFrame(df[['capacity','serial','failure']].loc[df['failure'] == 1 ])
In [ ]:
df_fail_days.head()
now let us merge the previous data frame which had serial number and count of days
In [ ]:
df_fail_count = df_days.merge(df_fail_days,how="left",on=['capacity','serial'])
In [ ]:
df_fail_count.head()
In [ ]:
df_fail_count.fillna(0,inplace=True)
In [ ]:
df_fail_count.head()
In [ ]:
df_fail_count.dtypes
In [ ]:
g = sns.FacetGrid(df_fail_count, col="failure",hue='failure',size=5,aspect=1.5)
g.map_dataframe(plt.scatter,x='capacity',y='total_days')
Now what can we do with this data? Is this useful? What can I generate from the above data that gives me a little more insight ?
We can generate what is the average time of failure and average success time for capacity
In [ ]:
df_fail_count_avg = pd.DataFrame(df_fail_count.groupby(['capacity','failure']).total_days.mean())
In [ ]:
df_fail_count_avg.head()
In [ ]:
df_fail_count_avg = df_fail_count_avg.reset_index()
In [ ]:
df_fail_count_avg.head()
In [ ]:
df_fail_count_avg_pivot = df_fail_count_avg.pivot('capacity','failure','total_days')
In [ ]:
df_fail_count_avg_pivot.head()
In [ ]:
df_fail_count_avg_pivot.plot(kind="bar")
In [ ]:
df_hours = df[['serial','capacity','failure','smart_9']]
In [ ]:
df_hours.head()
In [ ]:
df_hours.shape
Now we want to know upto when for a given hard disk and capacity , how long the hard disk ran
In [ ]:
df_hours_max = pd.DataFrame(df_hours.groupby(['serial','capacity']).smart_9.max())
In [ ]:
df_hours_max.head()
In [ ]:
df_hours_max.shape
In [ ]:
df_hours_max = df_hours_max.reset_index()
In [ ]:
df_hours_max_merge = df_hours_max.merge(df_hours,on=['serial','capacity','smart_9'],how='inner')
In [ ]:
df_hours_max_merge.head()
In [ ]:
df_hours_max_merge_pivot = pd.pivot_table(df_hours_max_merge,index='capacity',columns='failure',values='smart_9'
,aggfunc='mean')
In [ ]:
df_hours_max_merge_pivot.head()
In [ ]:
df_hours_max_merge_pivot.plot(kind='bar')
In [ ]:
df_model_capacity_hours = df[['model','capacity','failure','smart_9']]
In [ ]:
df_model_capacity_hours.head()
Let us convert bytes to gigabytes and round it to the nearest number
In [ ]:
df_model_capacity_hours.capacity = df_model_capacity_hours.capacity / 1024 ** 3
In [ ]:
df_model_capacity_hours.head()
In [ ]:
df_model_capacity_hours.capacity = df_model_capacity_hours.capacity.astype(np.int64)
In [ ]:
df_model_capacity_hours.head()
In [ ]:
df_model_capacity_hours_pivot = pd.pivot_table(data=df_model_capacity_hours,index='model',columns=['failure','capacity'],
values='smart_9',aggfunc='mean')
In [ ]:
df_model_capacity_hours_pivot.head()
In [ ]:
df_model_capacity_hours_pivot.fillna(0,inplace=True)
In [ ]:
df_model_capacity_hours_pivot.head()
In [ ]:
df_model_capacity_hours_pivot.plot(kind="barh")
The above visualization is confusing as the bars reflect combination of failure and hours count
In [ ]:
sns.heatmap(df_model_capacity_hours_pivot)
In [ ]: