Read and clean data using Python and Pandas


In [1]:
import pandas as pd

Read the files and load the datasets

Pre-requisite: the dataset archive has been downloaded and un-compressed in the same directory


In [2]:
cat UCI\ HAR\ Dataset/activity_labels.txt


1 WALKING
2 WALKING_UPSTAIRS
3 WALKING_DOWNSTAIRS
4 SITTING
5 STANDING
6 LAYING

In [3]:
act = pd.read_table('UCI HAR Dataset/activity_labels.txt', header=None, sep=' ', names=('ID','Activity'))

In [4]:
act


Out[4]:
ID Activity
0 1 WALKING
1 2 WALKING_UPSTAIRS
2 3 WALKING_DOWNSTAIRS
3 4 SITTING
4 5 STANDING
5 6 LAYING

In [5]:
type(act)


Out[5]:
pandas.core.frame.DataFrame

In [6]:
act.columns


Out[6]:
Index(['ID', 'Activity'], dtype='object')

act table has 6 observations of 2 variables


In [7]:
features = pd.read_table('UCI HAR Dataset/features.txt', sep=' ', header=None, names=('ID','Sensor'))

In [8]:
features.head()


Out[8]:
ID Sensor
0 1 tBodyAcc-mean()-X
1 2 tBodyAcc-mean()-Y
2 3 tBodyAcc-mean()-Z
3 4 tBodyAcc-std()-X
4 5 tBodyAcc-std()-Y

In [9]:
features.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 561 entries, 0 to 560
Data columns (total 2 columns):
ID        561 non-null int64
Sensor    561 non-null object
dtypes: int64(1), object(1)
memory usage: 13.1+ KB

features table has 561 observations of 2 variables: ID and sensor's name


In [10]:
testSub = pd.read_table('UCI HAR Dataset/test/subject_test.txt', header=None, names=['SubjectID'])

In [11]:
testSub.shape


Out[11]:
(2947, 1)

In [12]:
testSub.head()


Out[12]:
SubjectID
0 2
1 2
2 2
3 2
4 2

subject_test table contains 2947 observations of 1 variable: the subject ID


In [13]:
testX = pd.read_table('UCI HAR Dataset/test/X_test.txt', sep='\s+', header=None)

The file X_test requires to use as a separator a regular expression, because sometimes more blanks are used


In [14]:
testX.head()


Out[14]:
0 1 2 3 4 5 6 7 8 9 ... 551 552 553 554 555 556 557 558 559 560
0 0.257178 -0.023285 -0.014654 -0.938404 -0.920091 -0.667683 -0.952501 -0.925249 -0.674302 -0.894088 ... 0.071645 -0.330370 -0.705974 0.006462 0.162920 -0.825886 0.271151 -0.720009 0.276801 -0.057978
1 0.286027 -0.013163 -0.119083 -0.975415 -0.967458 -0.944958 -0.986799 -0.968401 -0.945823 -0.894088 ... -0.401189 -0.121845 -0.594944 -0.083495 0.017500 -0.434375 0.920593 -0.698091 0.281343 -0.083898
2 0.275485 -0.026050 -0.118152 -0.993819 -0.969926 -0.962748 -0.994403 -0.970735 -0.963483 -0.939260 ... 0.062891 -0.190422 -0.640736 -0.034956 0.202302 0.064103 0.145068 -0.702771 0.280083 -0.079346
3 0.270298 -0.032614 -0.117520 -0.994743 -0.973268 -0.967091 -0.995274 -0.974471 -0.968897 -0.938610 ... 0.116695 -0.344418 -0.736124 -0.017067 0.154438 0.340134 0.296407 -0.698954 0.284114 -0.077108
4 0.274833 -0.027848 -0.129527 -0.993852 -0.967445 -0.978295 -0.994111 -0.965953 -0.977346 -0.938610 ... -0.121711 -0.534685 -0.846595 -0.002223 -0.040046 0.736715 -0.118545 -0.692245 0.290722 -0.073857

5 rows × 561 columns


In [15]:
testX.shape


Out[15]:
(2947, 561)

The X test table has 2947 observations of 561 variables

The file y_test contains the outcome activity label for each observation


In [16]:
testY = pd.read_table('UCI HAR Dataset/test/y_test.txt', sep=' ', header=None)

In [17]:
testY.shape


Out[17]:
(2947, 1)

In [18]:
testY.head()


Out[18]:
0
0 5
1 5
2 5
3 5
4 5

In [19]:
testY.tail()


Out[19]:
0
2942 2
2943 2
2944 2
2945 2
2946 2

It's also possible to add a column name after creation:


In [20]:
testY.columns = ['ActivityID']

In [21]:
testY.head()


Out[21]:
ActivityID
0 5
1 5
2 5
3 5
4 5

Now let's move to the train folder'


In [22]:
trainSub = pd.read_table('UCI HAR Dataset/train/subject_train.txt', header=None, names=['SubjectID'])

In [23]:
trainSub.shape


Out[23]:
(7352, 1)

In [24]:
trainX = pd.read_table('UCI HAR Dataset/train/X_train.txt', sep='\s+', header=None)

In [25]:
trainX.shape


Out[25]:
(7352, 561)

In [26]:
trainY = pd.read_table('UCI HAR Dataset/train/y_train.txt', sep=' ', header=None, names=['ActivityID'])

In [27]:
trainY.shape


Out[27]:
(7352, 1)

As you see, the train set has 7352 observations, spread in 3 files

Merge the training and the test datasets


In [28]:
allSub = pd.concat([trainSub, testSub], ignore_index=True)

In [29]:
allSub.shape


Out[29]:
(10299, 1)

Now the allSub data frame contains 10299 = 2947+7352 rows. Note that ignore_index=True is necessary to have an index starting from 0 and ending at 10298, without restarting after the first 7352 observations. You can see it using the tail() method:


In [30]:
allSub.tail()


Out[30]:
SubjectID
10294 24
10295 24
10296 24
10297 24
10298 24

Now we do the same for the X and Y data sets


In [31]:
allX = pd.concat([trainX, testX], ignore_index = True)

In [32]:
allX.shape


Out[32]:
(10299, 561)

In [33]:
allY = trainY.append(testY, ignore_index=True)

In [34]:
allY.shape


Out[34]:
(10299, 1)

In [35]:
allY.head()


Out[35]:
ActivityID
0 5
1 5
2 5
3 5
4 5

For the Y dataset I used the pandas method append() just to show an alternative merge possibility

Appropriately labels the data set with descriptive variable names.. Uses descriptive activity names to name the activities in the data set


In [36]:
allX.head()


Out[36]:
0 1 2 3 4 5 6 7 8 9 ... 551 552 553 554 555 556 557 558 559 560
0 0.288585 -0.020294 -0.132905 -0.995279 -0.983111 -0.913526 -0.995112 -0.983185 -0.923527 -0.934724 ... -0.074323 -0.298676 -0.710304 -0.112754 0.030400 -0.464761 -0.018446 -0.841247 0.179941 -0.058627
1 0.278419 -0.016411 -0.123520 -0.998245 -0.975300 -0.960322 -0.998807 -0.974914 -0.957686 -0.943068 ... 0.158075 -0.595051 -0.861499 0.053477 -0.007435 -0.732626 0.703511 -0.844788 0.180289 -0.054317
2 0.279653 -0.019467 -0.113462 -0.995380 -0.967187 -0.978944 -0.996520 -0.963668 -0.977469 -0.938692 ... 0.414503 -0.390748 -0.760104 -0.118559 0.177899 0.100699 0.808529 -0.848933 0.180637 -0.049118
3 0.279174 -0.026201 -0.123283 -0.996091 -0.983403 -0.990675 -0.997099 -0.982750 -0.989302 -0.938692 ... 0.404573 -0.117290 -0.482845 -0.036788 -0.012892 0.640011 -0.485366 -0.848649 0.181935 -0.047663
4 0.276629 -0.016570 -0.115362 -0.998139 -0.980817 -0.990482 -0.998321 -0.979672 -0.990441 -0.942469 ... 0.087753 -0.351471 -0.699205 0.123320 0.122542 0.693578 -0.615971 -0.847865 0.185151 -0.043892

5 rows × 561 columns


In [37]:
sensorNames = features['Sensor']

In [38]:
allX.columns = sensorNames

In [39]:
allX.head()


Out[39]:
Sensor tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X tBodyAcc-std()-Y tBodyAcc-std()-Z tBodyAcc-mad()-X tBodyAcc-mad()-Y tBodyAcc-mad()-Z tBodyAcc-max()-X ... fBodyBodyGyroJerkMag-meanFreq() fBodyBodyGyroJerkMag-skewness() fBodyBodyGyroJerkMag-kurtosis() angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean) angle(X,gravityMean) angle(Y,gravityMean) angle(Z,gravityMean)
0 0.288585 -0.020294 -0.132905 -0.995279 -0.983111 -0.913526 -0.995112 -0.983185 -0.923527 -0.934724 ... -0.074323 -0.298676 -0.710304 -0.112754 0.030400 -0.464761 -0.018446 -0.841247 0.179941 -0.058627
1 0.278419 -0.016411 -0.123520 -0.998245 -0.975300 -0.960322 -0.998807 -0.974914 -0.957686 -0.943068 ... 0.158075 -0.595051 -0.861499 0.053477 -0.007435 -0.732626 0.703511 -0.844788 0.180289 -0.054317
2 0.279653 -0.019467 -0.113462 -0.995380 -0.967187 -0.978944 -0.996520 -0.963668 -0.977469 -0.938692 ... 0.414503 -0.390748 -0.760104 -0.118559 0.177899 0.100699 0.808529 -0.848933 0.180637 -0.049118
3 0.279174 -0.026201 -0.123283 -0.996091 -0.983403 -0.990675 -0.997099 -0.982750 -0.989302 -0.938692 ... 0.404573 -0.117290 -0.482845 -0.036788 -0.012892 0.640011 -0.485366 -0.848649 0.181935 -0.047663
4 0.276629 -0.016570 -0.115362 -0.998139 -0.980817 -0.990482 -0.998321 -0.979672 -0.990441 -0.942469 ... 0.087753 -0.351471 -0.699205 0.123320 0.122542 0.693578 -0.615971 -0.847865 0.185151 -0.043892

5 rows × 561 columns


In [40]:
allSub.head()


Out[40]:
SubjectID
0 1
1 1
2 1
3 1
4 1

Merge Subjects and X data frames by columns


In [41]:
all = pd.concat([allX, allSub], axis=1)

In [42]:
all.shape


Out[42]:
(10299, 562)

In [43]:
all.head()


Out[43]:
Sensor tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X tBodyAcc-std()-Y tBodyAcc-std()-Z tBodyAcc-mad()-X tBodyAcc-mad()-Y tBodyAcc-mad()-Z tBodyAcc-max()-X ... fBodyBodyGyroJerkMag-skewness() fBodyBodyGyroJerkMag-kurtosis() angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean) angle(X,gravityMean) angle(Y,gravityMean) angle(Z,gravityMean) SubjectID
0 0.288585 -0.020294 -0.132905 -0.995279 -0.983111 -0.913526 -0.995112 -0.983185 -0.923527 -0.934724 ... -0.298676 -0.710304 -0.112754 0.030400 -0.464761 -0.018446 -0.841247 0.179941 -0.058627 1
1 0.278419 -0.016411 -0.123520 -0.998245 -0.975300 -0.960322 -0.998807 -0.974914 -0.957686 -0.943068 ... -0.595051 -0.861499 0.053477 -0.007435 -0.732626 0.703511 -0.844788 0.180289 -0.054317 1
2 0.279653 -0.019467 -0.113462 -0.995380 -0.967187 -0.978944 -0.996520 -0.963668 -0.977469 -0.938692 ... -0.390748 -0.760104 -0.118559 0.177899 0.100699 0.808529 -0.848933 0.180637 -0.049118 1
3 0.279174 -0.026201 -0.123283 -0.996091 -0.983403 -0.990675 -0.997099 -0.982750 -0.989302 -0.938692 ... -0.117290 -0.482845 -0.036788 -0.012892 0.640011 -0.485366 -0.848649 0.181935 -0.047663 1
4 0.276629 -0.016570 -0.115362 -0.998139 -0.980817 -0.990482 -0.998321 -0.979672 -0.990441 -0.942469 ... -0.351471 -0.699205 0.123320 0.122542 0.693578 -0.615971 -0.847865 0.185151 -0.043892 1

5 rows × 562 columns

Now the new data frame has 562 columns, ad the last column is the Subject ID.

same for allY: add it to main data frame as extra column but first map activity label to activity code

Map activity label to code


In [44]:
allY.head()


Out[44]:
ActivityID
0 5
1 5
2 5
3 5
4 5

In [45]:
act


Out[45]:
ID Activity
0 1 WALKING
1 2 WALKING_UPSTAIRS
2 3 WALKING_DOWNSTAIRS
3 4 SITTING
4 5 STANDING
5 6 LAYING

In [46]:
allY.tail()


Out[46]:
ActivityID
10294 2
10295 2
10296 2
10297 2
10298 2

In [47]:
for i in act['ID']:
    activity = act[act['ID'] == i]['Activity']  # get activity cell given ID
    allY = allY.replace({i: activity.iloc[0]})  # replace this ID with activity string

In [48]:
allY.columns = ['Activity']

In [49]:
allY.head()


Out[49]:
Activity
0 STANDING
1 STANDING
2 STANDING
3 STANDING
4 STANDING

In [50]:
allY.tail()


Out[50]:
Activity
10294 WALKING_UPSTAIRS
10295 WALKING_UPSTAIRS
10296 WALKING_UPSTAIRS
10297 WALKING_UPSTAIRS
10298 WALKING_UPSTAIRS

Now add allY to the new all dataframe


In [51]:
allY.shape


Out[51]:
(10299, 1)

In [52]:
all = pd.concat([all, allY], axis=1)

In [53]:
all.shape


Out[53]:
(10299, 563)

Now all has 1 column more


In [54]:
all.head()


Out[54]:
Sensor tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X tBodyAcc-std()-Y tBodyAcc-std()-Z tBodyAcc-mad()-X tBodyAcc-mad()-Y tBodyAcc-mad()-Z tBodyAcc-max()-X ... fBodyBodyGyroJerkMag-kurtosis() angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean) angle(X,gravityMean) angle(Y,gravityMean) angle(Z,gravityMean) SubjectID Activity
0 0.288585 -0.020294 -0.132905 -0.995279 -0.983111 -0.913526 -0.995112 -0.983185 -0.923527 -0.934724 ... -0.710304 -0.112754 0.030400 -0.464761 -0.018446 -0.841247 0.179941 -0.058627 1 STANDING
1 0.278419 -0.016411 -0.123520 -0.998245 -0.975300 -0.960322 -0.998807 -0.974914 -0.957686 -0.943068 ... -0.861499 0.053477 -0.007435 -0.732626 0.703511 -0.844788 0.180289 -0.054317 1 STANDING
2 0.279653 -0.019467 -0.113462 -0.995380 -0.967187 -0.978944 -0.996520 -0.963668 -0.977469 -0.938692 ... -0.760104 -0.118559 0.177899 0.100699 0.808529 -0.848933 0.180637 -0.049118 1 STANDING
3 0.279174 -0.026201 -0.123283 -0.996091 -0.983403 -0.990675 -0.997099 -0.982750 -0.989302 -0.938692 ... -0.482845 -0.036788 -0.012892 0.640011 -0.485366 -0.848649 0.181935 -0.047663 1 STANDING
4 0.276629 -0.016570 -0.115362 -0.998139 -0.980817 -0.990482 -0.998321 -0.979672 -0.990441 -0.942469 ... -0.699205 0.123320 0.122542 0.693578 -0.615971 -0.847865 0.185151 -0.043892 1 STANDING

5 rows × 563 columns

Done, with the first dataframe. Can be put into a file with some write.table function


In [55]:
all.to_csv("tidyHARdata.csv")

But instead, from the data set creates a second, independent tidy data set with the average of each variable for each activity and each subject.


In [61]:
grouped = all.groupby (['SubjectID', 'Activity'])

tidy has 900 rows and 563 columns


In [65]:
import numpy as np

In [67]:
tidier = all.groupby (['Activity']).aggregate(np.mean)

In [101]:
tidier = tidier.drop('SubjectID', axis=1)

In [69]:
tidier.head()


Out[69]:
Sensor tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X tBodyAcc-std()-Y tBodyAcc-std()-Z tBodyAcc-mad()-X tBodyAcc-mad()-Y tBodyAcc-mad()-Z tBodyAcc-max()-X ... fBodyBodyGyroJerkMag-skewness() fBodyBodyGyroJerkMag-kurtosis() angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean) angle(X,gravityMean) angle(Y,gravityMean) angle(Z,gravityMean) SubjectID
Activity
LAYING 0.268649 -0.018318 -0.107436 -0.960932 -0.943507 -0.948069 -0.963395 -0.944099 -0.947261 -0.911610 ... -0.342490 -0.658640 0.010366 0.016013 0.022788 0.009191 0.520261 -0.435944 -0.427749 16.643519
SITTING 0.273060 -0.012690 -0.105517 -0.983446 -0.934881 -0.938982 -0.985280 -0.934798 -0.936671 -0.926703 ... -0.384740 -0.686795 0.012034 0.002458 0.013413 -0.033260 -0.706042 0.006140 -0.089532 16.680360
STANDING 0.279153 -0.016152 -0.106587 -0.984435 -0.932509 -0.939914 -0.987205 -0.933875 -0.937739 -0.921265 ... -0.343769 -0.654989 0.006991 0.010397 0.004614 0.015957 -0.774143 0.209818 0.031743 16.513116
WALKING 0.276337 -0.017907 -0.108882 -0.314644 -0.023583 -0.273921 -0.352470 -0.055491 -0.274098 -0.113458 ... -0.071308 -0.400268 0.014918 -0.007011 0.011332 -0.019443 -0.761886 0.218599 0.059771 15.272938
WALKING_DOWNSTAIRS 0.288137 -0.016312 -0.105762 0.100766 0.059549 -0.190805 0.032426 0.003907 -0.214569 0.417152 ... -0.277020 -0.618394 -0.039692 -0.018665 0.203588 -0.076029 -0.780873 0.200189 0.055866 15.828592

5 rows × 562 columns


In [ ]:
tidier.to_csv("tidierHARdata.csv")