Refining the data

Why to refine?

Removing the un-necessary columns reduces the size of data thereby helping us compute things faster and us lesser memory.

Pre-loaded data

We have loaded the complete data set into sqlite and we have made that file available to you. We will be using this file to look at the data and refine it


In [ ]:
import pandas as pd
from sqlalchemy import create_engine
import os

In [ ]:
#engine = create_engine('sqlite:////home/nischal/hard-disk.db')
engine = create_engine('sqlite://///Users/standy/hard-disk.db')

In [ ]:
total_df = pd.read_sql_query(sql='''SELECT * from data limit 2000000''' ,con=engine)

In [ ]:
specific_columns_df = pd.read_sql_query(sql=''' SELECT date,model from data limit 10 offset 10'''
                                            ,con=engine)

In [ ]:
specific_columns_df.head()

In [ ]:
total_df.head()

In [ ]:
total_df.count()

In [ ]:
total_count_df = pd.DataFrame(total_df.count(),columns=['counts'])

In [ ]:
total_count_df.head()

In [ ]:
total_count_df[total_count_df.counts > 0]

How to refine?

When we look at our dataset , we see that a lot of columns have no data. Let us take a look at the columns that has data and those that dont

From this data we can see that we need only 10 columns

  • model
  • date
  • serial_number
  • capacity_bytes
  • failure
  • smart_1_raw
  • smart_5_raw
  • smart_9_raw
  • smart_194_raw
  • smart_197_raw


In [ ]:
#this is the code snippet to to load required columns into another database
reduced_data_df = pd.read_sql_query(sql='''select model,date,serial_number,capacity_bytes,failure,smart_1_raw,smart_5_raw,smart_9_raw,smart_194_raw,smart_197_raw from data where date between '2013-04-10' and '2013-08-20' ''',con=engine)

In [ ]:
reduced_data_df.to_csv('data.csv')

In [ ]:
reduced_data_df.to_sql("data_reduced",con=engine,index=False,if_exists='append')

Using the above columns we have created a reduced data set and loaded it onto the database. This is the database we will be using from here onwards.


In [ ]: