2. Acquire the Data

We have downloaded the data from https://www.backblaze.com/hard-drive-test-data.html. We are using the 2013 data dump as 2015 data dump is relatively large. The techniques we are using to acquire the data for 2013, can be applied to 2015 data dump as well.

When you download the 2013 data, you get a zipped package consisting of 266 files.

How do we read from 266 files?

As the data is spread across 266 files, it is hard for us to read data. Is there a way I can put all this data into one single place , so that my reads are easier?

There are 2 options

1. Merge all the data into one single big csv file
1. Push the data into a database

For this example let us go the database path. For the simple reason, keeping a big csv file in memory is very expensive.

Data Formats

csv files

Database

SQLite

Creating entries in SQLite using pandas and sqlalchemy



In [ ]:

    
import pandas as pd
from sqlalchemy import create_engine
import os



In [ ]:

    
import sqlalchemy



In [ ]:

    
sql_path = os.getcwd()
sql_path = sql_path+"/hard-disk-test.db"
print sql_path



In [ ]:

    
# we have created a small db for you to use 
# you will be able to see hard-disk-test.db in repository you have cloned
# this command creates the connection to database
engine = create_engine('sqlite:///'+sql_path)



In [ ]:

    
engine



In [ ]:

    
# we are getting the directory data to read files
path = os.path.join(os.getcwd(),'data')
print path



In [ ]:

    
# we are reading csv files from the directory and laoding them into the database
no_of_files = 0
total_rows = 0
for file in os.listdir(path):
    if str(file).endswith(".csv"):
        print file
        inp_df = pd.read_csv(os.path.join(path,file))
        inp_df.to_sql("data",engine,index=False,chunksize=None,if_exists='append')
        total_rows = total_rows + inp_df.shape[0]

print 'loading data done with rows '+str(total_rows)



In [ ]:

    
pd.read_sql_table(table_name="data",con=engine)



In [ ]:



In [ ]:

    
# you can now put all the 266 csv files into the data directory and the above piece of code will push all the data
# into the database



In [ ]: