APPFS Exercise 1

The following is able to process the entire data file of 100 million lines, it just takes forever on my computer, therefore I am submitting this assignment without the program having being run on the entire 100 million dataset. Instead, this program runs on a smaller subset of the dataset... specifically, the first 20 lines of the 100 million dataset. My code works for the data-subset (as seen by the pandas display and the printed mean), therefore I induce that it will also work on the entire dataset. \n

To run this program on your own more-powerful system, just run all the cells until the end. \n

I am blindly hoping that my program obtains the correct mean. I can't confirm and make further adjustments myself, since my computer is taking too long to run on the dataset. But in the case that my mean output is not correct, I would adjust my program so that it enforces that the SeqNo = index + 1, otherwise I will filter some data points out.



In [94]:

    
import pandas as pd

data = open("./ex1-100.dat").readlines()

df_data = pd.DataFrame(data, columns = ['original'])
df_data['clean'] = df_data.apply(lambda row: row['original'][0].isdigit(), axis=1) # determine if each line is valid data
df_dataclean = pd.DataFrame(df_data[df_data['clean'] == True]['original']).reindex() # eliminate the lines with comments
df_dataclean = df_dataclean.reset_index().drop('index', axis=1)

df_dataclean['split'] = df_dataclean.apply(lambda row: row['original'].split(';'), axis=1)
df_dataclean['SeqNo'] = df_dataclean.apply(lambda row: int(row['split'][0].strip()), axis=1)
df_dataclean['Location'] = df_dataclean.apply(lambda row: int(row['split'][1].strip()), axis=1)
df_dataclean['Value'] = df_dataclean.apply(lambda row: float(row['split'][2].strip()), axis=1)



In [77]:

    
df_dataclean.head()









    Out[77]:






  
    
      
      original
      split
      SeqNo
      Location
      Value
    
  
  
    
      0
      1; 2; 10.5196\n
      [1,  2,  10.5196\n]
      1
      2
      10.5196
    
    
      1
      2; 1; 85.2815\n
      [2,  1,  85.2815\n]
      2
      1
      85.2815
    
    
      2
      3; 2; 63.8276\n
      [3,  2,  63.8276\n]
      3
      2
      63.8276
    
    
      3
      4; 2; 84.4309\n
      [4,  2,  84.4309\n]
      4
      2
      84.4309
    
    
      4
      5; 1; 73.6285\n
      [5,  1,  73.6285\n]
      5
      1
      73.6285



In [78]:

    
mean1 = df_dataclean[df_dataclean['Location'] == 1]['Value'].mean()
mean2 = df_dataclean[df_dataclean['Location'] == 2]['Value'].mean()









    



mean1: 66.14905714285713
mean2: 54.21118



In [ ]:

    
# if mean not as expected, do further data cleaning. ex enforce that seqno = index + 1
print('Valid values Loc1: ' + str(df_dataclean['Location'].value_counts()[1]) + 'with GeoMean: ' + str(mean1))
print('Valid values Loc2: ' + str(df_dataclean['Location'].value_counts()[2]) + 'with GeoMean: ' + str(mean2))

	original	split	SeqNo	Location	Value
0	1; 2; 10.5196\n	[1, 2, 10.5196\n]	1	2	10.5196
1	2; 1; 85.2815\n	[2, 1, 85.2815\n]	2	1	85.2815
2	3; 2; 63.8276\n	[3, 2, 63.8276\n]	3	2	63.8276
3	4; 2; 84.4309\n	[4, 2, 84.4309\n]	4	2	84.4309
4	5; 1; 73.6285\n	[5, 1, 73.6285\n]	5	1	73.6285