In [1]:
import pandas as pd
Pre-requisite: the dataset archive has been downloaded and un-compressed in the same directory
In [2]:
cat UCI\ HAR\ Dataset/activity_labels.txt
In [3]:
act = pd.read_table('UCI HAR Dataset/activity_labels.txt', header=None, sep=' ', names=('ID','Activity'))
In [4]:
act
Out[4]:
In [5]:
type(act)
Out[5]:
In [6]:
act.columns
Out[6]:
act table has 6 observations of 2 variables
In [7]:
features = pd.read_table('UCI HAR Dataset/features.txt', sep=' ', header=None, names=('ID','Sensor'))
In [8]:
features.head()
Out[8]:
In [9]:
features.info()
features table has 561 observations of 2 variables: ID and sensor's name
In [10]:
testSub = pd.read_table('UCI HAR Dataset/test/subject_test.txt', header=None, names=['SubjectID'])
In [11]:
testSub.shape
Out[11]:
In [12]:
testSub.head()
Out[12]:
subject_test table contains 2947 observations of 1 variable: the subject ID
In [13]:
testX = pd.read_table('UCI HAR Dataset/test/X_test.txt', sep='\s+', header=None)
The file X_test requires to use as a separator a regular expression, because sometimes more blanks are used
In [14]:
testX.head()
Out[14]:
In [15]:
testX.shape
Out[15]:
The X test table has 2947 observations of 561 variables
The file y_test contains the outcome activity label for each observation
In [16]:
testY = pd.read_table('UCI HAR Dataset/test/y_test.txt', sep=' ', header=None)
In [17]:
testY.shape
Out[17]:
In [18]:
testY.head()
Out[18]:
In [19]:
testY.tail()
Out[19]:
It's also possible to add a column name after creation:
In [20]:
testY.columns = ['ActivityID']
In [21]:
testY.head()
Out[21]:
Now let's move to the train folder'
In [22]:
trainSub = pd.read_table('UCI HAR Dataset/train/subject_train.txt', header=None, names=['SubjectID'])
In [23]:
trainSub.shape
Out[23]:
In [24]:
trainX = pd.read_table('UCI HAR Dataset/train/X_train.txt', sep='\s+', header=None)
In [25]:
trainX.shape
Out[25]:
In [26]:
trainY = pd.read_table('UCI HAR Dataset/train/y_train.txt', sep=' ', header=None, names=['ActivityID'])
In [27]:
trainY.shape
Out[27]:
As you see, the train set has 7352 observations, spread in 3 files
In [28]:
allSub = pd.concat([trainSub, testSub], ignore_index=True)
In [29]:
allSub.shape
Out[29]:
Now the allSub data frame contains 10299 = 2947+7352 rows. Note that ignore_index=True is necessary to have an index starting from 0 and ending at 10298, without restarting after the first 7352 observations. You can see it using the tail() method:
In [30]:
allSub.tail()
Out[30]:
Now we do the same for the X and Y data sets
In [31]:
allX = pd.concat([trainX, testX], ignore_index = True)
In [32]:
allX.shape
Out[32]:
In [33]:
allY = trainY.append(testY, ignore_index=True)
In [34]:
allY.shape
Out[34]:
In [35]:
allY.head()
Out[35]:
For the Y dataset I used the pandas method append() just to show an alternative merge possibility
Appropriately labels the data set with descriptive variable names.. Uses descriptive activity names to name the activities in the data set
In [36]:
allX.head()
Out[36]:
In [37]:
sensorNames = features['Sensor']
In [38]:
allX.columns = sensorNames
In [39]:
allX.head()
Out[39]:
In [40]:
allSub.head()
Out[40]:
Merge Subjects and X data frames by columns
In [41]:
all = pd.concat([allX, allSub], axis=1)
In [42]:
all.shape
Out[42]:
In [43]:
all.head()
Out[43]:
Now the new data frame has 562 columns, ad the last column is the Subject ID.
same for allY: add it to main data frame as extra column but first map activity label to activity code
Map activity label to code
In [44]:
allY.head()
Out[44]:
In [45]:
act
Out[45]:
In [46]:
allY.tail()
Out[46]:
In [47]:
for i in act['ID']:
activity = act[act['ID'] == i]['Activity'] # get activity cell given ID
allY = allY.replace({i: activity.iloc[0]}) # replace this ID with activity string
In [48]:
allY.columns = ['Activity']
In [49]:
allY.head()
Out[49]:
In [50]:
allY.tail()
Out[50]:
Now add allY to the new all dataframe
In [51]:
allY.shape
Out[51]:
In [52]:
all = pd.concat([all, allY], axis=1)
In [53]:
all.shape
Out[53]:
Now all has 1 column more
In [54]:
all.head()
Out[54]:
Done, with the first dataframe. Can be put into a file with some write.table function
In [55]:
all.to_csv("tidyHARdata.csv")
But instead, from the data set creates a second, independent tidy data set with the average of each variable for each activity and each subject.
In [61]:
grouped = all.groupby (['SubjectID', 'Activity'])
tidy has 900 rows and 563 columns
In [65]:
import numpy as np
In [67]:
tidier = all.groupby (['Activity']).aggregate(np.mean)
In [101]:
tidier = tidier.drop('SubjectID', axis=1)
In [69]:
tidier.head()
Out[69]:
In [ ]:
tidier.to_csv("tidierHARdata.csv")