The following is able to process the entire data file of 100 million lines, it just takes forever on my computer, therefore I am submitting this assignment without the program having being run on the entire 100 million dataset. Instead, this program runs on a smaller subset of the dataset... specifically, the first 20 lines of the 100 million dataset. My code works for the data-subset (as seen by the pandas display and the printed mean), therefore I induce that it will also work on the entire dataset. \n
To run this program on your own more-powerful system, just run all the cells until the end. \n
I am blindly hoping that my program obtains the correct mean. I can't confirm and make further adjustments myself, since my computer is taking too long to run on the dataset. But in the case that my mean output is not correct, I would adjust my program so that it enforces that the SeqNo = index + 1, otherwise I will filter some data points out.
In [94]:
import pandas as pd
data = open("./ex1-100.dat").readlines()
df_data = pd.DataFrame(data, columns = ['original'])
df_data['clean'] = df_data.apply(lambda row: row['original'][0].isdigit(), axis=1) # determine if each line is valid data
df_dataclean = pd.DataFrame(df_data[df_data['clean'] == True]['original']).reindex() # eliminate the lines with comments
df_dataclean = df_dataclean.reset_index().drop('index', axis=1)
df_dataclean['split'] = df_dataclean.apply(lambda row: row['original'].split(';'), axis=1)
df_dataclean['SeqNo'] = df_dataclean.apply(lambda row: int(row['split'][0].strip()), axis=1)
df_dataclean['Location'] = df_dataclean.apply(lambda row: int(row['split'][1].strip()), axis=1)
df_dataclean['Value'] = df_dataclean.apply(lambda row: float(row['split'][2].strip()), axis=1)
In [77]:
df_dataclean.head()
Out[77]:
In [78]:
mean1 = df_dataclean[df_dataclean['Location'] == 1]['Value'].mean()
mean2 = df_dataclean[df_dataclean['Location'] == 2]['Value'].mean()
In [ ]:
# if mean not as expected, do further data cleaning. ex enforce that seqno = index + 1
print('Valid values Loc1: ' + str(df_dataclean['Location'].value_counts()[1]) + 'with GeoMean: ' + str(mean1))
print('Valid values Loc2: ' + str(df_dataclean['Location'].value_counts()[2]) + 'with GeoMean: ' + str(mean2))