Parsing a single log file

parse: to examine in a minute way

In this notebook we'll extract the information on reaction time and accuracy from a single log file, and generalise our code to apply to any log file (written with the same structure).

It is considered good practice to import all the modules you use in a notebook in the beginning, so we'll start with that:


In [3]:
import string

We'll be using two lists defined in the string-module:

  1. the list of all lowercase (ASCII) letters
  2. the list of all digits (as string a string, not numbers)

In [34]:
print(string.ascii_lowercase)
print(type(string.ascii_lowercase))
print(string.digits)


abcdefghijklmnopqrstuvwxyz
<class 'str'>
0123456789

Read lines of a single log-file into a list

Assign the path to one of the logfiles to the variable logfile_name. You will need to adjust the path to wherever you placed the logs-directory containing them!


In [9]:
logfile_name = '../src/logs/0023_FCA_2017-03-09.log'

Open the file, read the lines & close the file.


In [10]:
fp = open(logfile_name, 'r')
all_lines = fp.readlines()
fp.close()

Display the first ten lines. For this, you can use the slice-syntax [:10], which reads: 'from the start to index 10'.


In [30]:
all_lines[:10]


Out[30]:
['# Original filename: 0023_FCA_2017-03-09.log\n',
 '# Time unit: 100 us\n',
 '# RARECAT=digit\n',
 '#\n',
 '# Time\tHHGG\tEvent\n',
 '35309\t42\tSTIM=x\n',
 '38316\t42\tRESP=1\n',
 '51108\t42\tSTIM=h\n',
 '63261\t42\tRESP=1\n',
 '66731\t42\tSTIM=k\n']

The first five lines are comments, which we'll want to skip over. How many events are there in the file (how many rows after the comments)?


In [12]:
len(all_lines[5:])


Out[12]:
2560

Splitting the lines

From the above, determine the field-separator character used in the file.


In [14]:
field_sep = '\t' # COMPLETE THIS LINE

Split the 6th line and display:


In [15]:
line = all_lines[5]
split_line = line.split(field_sep)
print(split_line)


['35309', '42', 'STIM=x\n']

The 1st value of the split list is the time, the 3rd value contains information on whether the event was a stimulus presentation, or a response. Since the data is consistent, to get the actual stimulus presented (letter or digit), we can simply count how many characters 'in' the equal-sign is: the index of the stimulus is:


In [19]:
# what is the index of the stimulus?
# Try changing the relevant value below until you get 'x'
split_line[2][5]


Out[19]:
'x'

In [20]:
idx = 5  # which index gives you the letter/digit?

Note that this index is also the one we need for getting to the response (1 or 2).

  • split the 6th line & print the stimulus delivery time and stimulus presented
  • split the 7th line & print the response time and button number pressed
  • calculate the reaction time
    • NB: the contents of the file we are reading from is textual
    • arithmetic on text is very different from that on numbers...
    • (you'll need to convert the string to a number; use the int-function)
  • assign the reaction time to a variable ('RT') and print it

In [22]:
# 6th line: STIM
line = all_lines[5]
split_line = line.split(field_sep)
print(split_line)

stim_time = split_line[0]  # replace XXX!
cur_stim = split_line[2][idx] # replace YYY!
print(stim_time, cur_stim)
    
# 7th line: RESP
line = all_lines[6]
split_line = line.split(field_sep)
print(split_line)
    
resp_time = split_line[0]  # replace XXX!
cur_resp = split_line[2][idx] # replace YYY!
print(resp_time, cur_resp)

# calculate RT
RT =  int(resp_time) - int(stim_time) # formula here
print('reaction time: ', RT)


['35309', '42', 'STIM=x\n']
35309 x
['38316', '42', 'RESP=1\n']
38316 1
reaction time:  3007

Loop over the lines

Convert the above into something that can be used to loop over the list. Start by just looping over the 6th and 7th rows: you should arrive at the same answer as above.

You'll need logic for determining whether the current line starts with the string STIM. Strings have a method startswith for this! Use an if-else-construct.


In [26]:
'STIM=x\n'.startswith('STIM')


Out[26]:
True

In [32]:
for line in all_lines[5:]:

    split_line = line.split(field_sep)

    # does the 3rd element of the list start with 'STIM'?
    if split_line[2].startswith('STIM'):
        stim_time = split_line[0]
        cur_stim = split_line[2][idx]
        # print(stim_time, cur_stim)

    else:  # nope; it starts with something other than 'STIM'
        resp_time = split_line[0]  # replace XXX!
        cur_resp = split_line[2][idx] # replace YYY!
        # print(resp_time, cur_resp)

        # calculate RT
        RT =  int(resp_time) - int(stim_time) # formula here
        # print('reaction time: ', RT)

Saving the reaction times into lists

Instead of printing out 1280 RT values, we want to save them into memory for later use (we need to calculate mean and median values over them). Start with two empty lists for reaction times:

  • one for the frequent category of stimuli (letter)
  • one for the rare category of stimuli (digit)

and use the .append-method to add the values to the lists.


In [33]:
# empty lists for reaction times
rt_freq = []
rt_rare = []

In [37]:
for line in all_lines[5:]:
    split_line = line.split(field_sep)

    # does the 3rd element of the list start with 'STIM'?
    if split_line[2].startswith('STIM'):
        stim_time = split_line[0]
        cur_stim = split_line[2][idx]

    else:  # nope; it starts with something other than 'STIM'
        resp_time = split_line[0]  # replace XXX!
        cur_resp = split_line[2][idx] # replace YYY!

        # calculate RT
        RT = int(resp_time) - int(stim_time) # formula here

        # test if the current stimulus is in the `ascii_lowercase`-list
        if cur_stim in string.ascii_lowercase:
            rt_freq.append(RT)            
        # else test if the current stimulus is in the `digits`-list
        elif cur_stim in string.digits:
            rt_rare.append(RT)

Accuracy: is each response correct or incorrect?

Modify the above code to also include logic for determining whether the response in correct or not. Initialise two counters for the number of correct responses.


In [51]:
rt_freq = []
rt_rare = []
n_corr_freq = 0
n_corr_rare = 0

In [52]:
for line in all_lines[5:]:
    split_line = line.split(field_sep)

    # does the 3rd element of the list start with 'STIM'?
    if split_line[2].startswith('STIM'):
        stim_time = split_line[0]
        cur_stim = split_line[2][idx]

    else:  # nope; it starts with something other than 'STIM'
        resp_time = split_line[0]  # replace XXX!
        cur_resp = split_line[2][idx] # replace YYY!

        # calculate RT
        RT = int(resp_time) - int(stim_time) # formula here

        # test if the current stimulus is in the `ascii_lowercase`-list
        if cur_stim in string.ascii_lowercase:
            rt_freq.append(RT)
            if int(cur_resp) == 1:
                n_corr_freq = n_corr_freq + 1
            
        # else test if the current stimulus is in the `digits`-list
        elif cur_stim in string.digits:
            rt_rare.append(RT)
            if cur_resp == '2':
                n_corr_rare = n_corr_rare + 1

In [54]:
rt_freq[:10]


Out[54]:
[3007, 12153, 4080, 5013, 3598, 2730, 5460, 4094, 3173, 4984]
  • use the functions you previously wrote as an exercise
    • you'll have to copy the code into the present notebook and execute
  • recall that times are given in the odd unit of '100's of microseconds'
    • multiply by 100e-3 (i.e 0.1) to obtain milliseconds
  • accuracy is simply the number of correct responses divided by the total number of responses

In [50]:
# copy-paste your mean- and median-function here:
def mean(values):
    return(sum(values)/len(values))
def median(values):
    return(sorted(values)[len(values) // 2])

In [58]:
# freq
mean_rt_freq = 0.1 * mean(rt_freq)
median_rt_freq = 0.1 * median(rt_freq)
accuracy_freq = 100 * n_corr_freq / len(rt_freq)

# rare
mean_rt_rare = 100e-3 * mean(rt_rare)
median_rt_rare = 100e-3 * median(rt_rare)
accuracy_rare = 100 * n_corr_rare / len(rt_rare)

In [59]:
print('Frequent category:')
print('------------------')
print('Mean:', mean_rt_freq)
print('Median:', median_rt_freq)
print('Accuracy:', accuracy_freq)


Frequent category:
------------------
Mean: 499.4505859375
Median: 464.70000000000005
Accuracy: 96.484375

In [60]:
print('Rare category:')
print('--------------')
print('Mean:', mean_rt_rare)
print('Median:', median_rt_rare)
print('Accuracy:', accuracy_rare)


Rare category:
--------------
Mean: 595.238671875
Median: 565.1
Accuracy: 85.9375

Convert all of the above into a function

Now that we have code that works for one file, we can make it into a function and apply it on the other files (hoping they 'behave' the same way as the file we used to develop the code on...).


In [61]:
def read_log_file(logfile_name, field_sep='\t'):
    '''Read a single log file
    
    The default field-separator is set to be the tab-character (\t)
    
    Return the mean and median RT, and the accuracy, separately for
    the frequent and rare categories. This is done as a list (tuple) of
    6 return values, in the order:
    (mean_rt_freq, median_rt_freq, accuracy_freq,
     mean_rt_rare, median_rt_rare, accuracy_rare)
    '''

    # initialise 
    rt_freq = []
    rt_rare = []
    n_corr_freq = 0
    n_corr_rare = 0

    # open file and read all its lines into a list
    fp = open(logfile_name, 'r')
    all_lines = fp.readlines()
    fp.close()

    # hard-code the index of the stimulus/response type/number
    idx = 5
    
    # loop over lines from 6th onwards
    for line in all_lines[5:]:
        split_line = line.split(field_sep)

        # does the 3rd element of the list start with 'STIM'?
        if split_line[2].startswith('STIM'):
            stim_time = split_line[0]
            cur_stim = split_line[2][idx]

        else:  # nope; it starts with something other than 'STIM'
            resp_time = split_line[0]  # replace XXX!
            cur_resp = split_line[2][idx] # replace YYY!

            # calculate RT
            RT = int(resp_time) - int(stim_time) # formula here

            # test if the current stimulus is in the `ascii_lowercase`-list
            if cur_stim in string.ascii_lowercase:
                rt_freq.append(RT)
                if int(cur_resp) == 1:
                    n_corr_freq = n_corr_freq + 1

            # else test if the current stimulus is in the `digits`-list
            elif cur_stim in string.digits:
                rt_rare.append(RT)
                if cur_resp == '2':
                    n_corr_rare = n_corr_rare + 1                 
                    
    # freq
    mean_rt_freq = 0.1 * mean(rt_freq)
    median_rt_freq = 0.1 * median(rt_freq)
    accuracy_freq = 100 * n_corr_freq / len(rt_freq)

    # rare
    mean_rt_rare = 100e-3 * mean(rt_rare)
    median_rt_rare = 100e-3 * median(rt_rare)
    accuracy_rare = 100 * n_corr_rare / len(rt_rare)

    return(mean_rt_freq, median_rt_freq, accuracy_freq,
           mean_rt_rare, median_rt_rare, accuracy_rare)

Test the function on the same file, then on a new one


In [62]:
(mean_rt_freq, median_rt_freq, accuracy_freq,
    mean_rt_rare, median_rt_rare, accuracy_rare) = read_log_file(logfile_name)

In [63]:
print('Frequent category:')
print('------------------')
print('Mean:', mean_rt_freq)
print('Median:', median_rt_freq)
print('Accuracy:', accuracy_freq)


Frequent category:
------------------
Mean: 499.4505859375
Median: 464.70000000000005
Accuracy: 96.484375

In [64]:
print('Rare category:')
print('--------------')
print('Mean:', mean_rt_rare)
print('Median:', median_rt_rare)
print('Accuracy:', accuracy_rare)


Rare category:
--------------
Mean: 595.238671875
Median: 565.1
Accuracy: 85.9375

In [65]:
logfile_name = '../src/logs/0048_MSB_2016-09-23.log'

In [66]:
(mean_rt_freq, median_rt_freq, accuracy_freq,
    mean_rt_rare, median_rt_rare, accuracy_rare) = read_log_file(logfile_name)

In [67]:
print('Frequent category:')
print('------------------')
print('Mean:', mean_rt_freq)
print('Median:', median_rt_freq)
print('Accuracy:', accuracy_freq)


Frequent category:
------------------
Mean: 503.15771484375
Median: 466.6
Accuracy: 95.60546875

In [68]:
print('Rare category:')
print('--------------')
print('Mean:', mean_rt_rare)
print('Median:', median_rt_rare)
print('Accuracy:', accuracy_rare)


Rare category:
--------------
Mean: 582.63359375
Median: 549.6
Accuracy: 88.671875

In [ ]: