We have downloaded the data from https://www.backblaze.com/hard-drive-test-data.html. We are using the 2013 data dump as 2015 data dump is relatively large. The techniques we are using to acquire the data for 2013, can be applied to 2015 data dump as well.
When you download the 2013 data, you get a zipped package consisting of 266 files.
As the data is spread across 266 files, it is hard for us to read data. Is there a way I can put all this data into one single place , so that my reads are easier?
There are 2 options
For this example let us go the database path. For the simple reason, keeping a big csv file in memory is very expensive.
Data Formats
Database
In [ ]:
import pandas as pd
from sqlalchemy import create_engine
import os
In [ ]:
import sqlalchemy
In [ ]:
sql_path = os.getcwd()
sql_path = sql_path+"/hard-disk-test.db"
print sql_path
In [ ]:
# we have created a small db for you to use
# you will be able to see hard-disk-test.db in repository you have cloned
# this command creates the connection to database
engine = create_engine('sqlite:///'+sql_path)
In [ ]:
engine
In [ ]:
# we are getting the directory data to read files
path = os.path.join(os.getcwd(),'data')
print path
In [ ]:
# we are reading csv files from the directory and laoding them into the database
no_of_files = 0
total_rows = 0
for file in os.listdir(path):
if str(file).endswith(".csv"):
print file
inp_df = pd.read_csv(os.path.join(path,file))
inp_df.to_sql("data",engine,index=False,chunksize=None,if_exists='append')
total_rows = total_rows + inp_df.shape[0]
print 'loading data done with rows '+str(total_rows)
In [ ]:
pd.read_sql_table(table_name="data",con=engine)
In [ ]:
In [ ]:
# you can now put all the 266 csv files into the data directory and the above piece of code will push all the data
# into the database
In [ ]: