Pre-loaded data
We have loaded the complete data set into sqlite and we have made that file available to you. We will be using this file to look at the data and refine it
In [ ]:
import pandas as pd
from sqlalchemy import create_engine
import os
In [ ]:
#engine = create_engine('sqlite:////home/nischal/hard-disk.db')
engine = create_engine('sqlite://///Users/standy/hard-disk.db')
In [ ]:
total_df = pd.read_sql_query(sql='''SELECT * from data limit 2000000''' ,con=engine)
In [ ]:
specific_columns_df = pd.read_sql_query(sql=''' SELECT date,model from data limit 10 offset 10'''
,con=engine)
In [ ]:
specific_columns_df.head()
In [ ]:
total_df.head()
In [ ]:
total_df.count()
In [ ]:
total_count_df = pd.DataFrame(total_df.count(),columns=['counts'])
In [ ]:
total_count_df.head()
In [ ]:
total_count_df[total_count_df.counts > 0]
How to refine?
When we look at our dataset , we see that a lot of columns have no data. Let us take a look at the columns that has data and those that dont
From this data we can see that we need only 10 columns
In [ ]:
#this is the code snippet to to load required columns into another database
reduced_data_df = pd.read_sql_query(sql='''select model,date,serial_number,capacity_bytes,failure,smart_1_raw,smart_5_raw,smart_9_raw,smart_194_raw,smart_197_raw from data where date between '2013-04-10' and '2013-08-20' ''',con=engine)
In [ ]:
reduced_data_df.to_csv('data.csv')
In [ ]:
reduced_data_df.to_sql("data_reduced",con=engine,index=False,if_exists='append')
Using the above columns we have created a reduced data set and loaded it onto the database. This is the database we will be using from here onwards.
In [ ]: