A simple (ie. no error checking or sensible engineering) notebook to extract the student answer data from a single xml file.

I'll also export the data to a csv file at the end of this, so that it's easy to read in at the beginning of another notebook.

Following discussions with Suraj, we want the representation to take into account the student's response, the official answer, and the grade. So there'll be a little fiddliness linking the student response back to the gold standard response.

So, first read the file:


In [25]:
filename='semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY1.xml'

It's an xml file, so we'll need the xml.etree parser, and pandas so that we can import into a dataframe:


In [26]:
import pandas as pd

from xml.etree import ElementTree as ET

In [27]:
tree=ET.parse(filename)

r=tree.getroot()

Now, the reference answers are in the second daughter node of the tree. We can extract these and store them in a dictionary. To distinguish between reference answer tokens and student response tokens, I'm going to append each token in the reference answers with _RA, and each of the tokens in a student response with _SR.


In [28]:
from string import punctuation

def to_tokens(textIn):
    '''Convert the input textIn to a list of tokens'''
    tokens_ls=[t.lower().strip(punctuation) for t in textIn.split()]
    # remove any empty tokens
    return [t for t in tokens_ls if t]

str='"Help!" yelped the banana, who was obviously scared out of his skin.'
print(str)
print(to_tokens(str))


"Help!" yelped the banana, who was obviously scared out of his skin.
['help', 'yelped', 'the', 'banana', 'who', 'was', 'obviously', 'scared', 'out', 'of', 'his', 'skin']

In [29]:
refAnswers_dict={refAnswer.attrib['id']:[t+'_RA' for t in to_tokens(refAnswer.text)] 
                 for refAnswer in r[1]}    
refAnswers_dict


Out[29]:
{'answer204': ['terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'terminal_RA',
  'are_RA',
  'separated_RA',
  'by_RA',
  'the_RA',
  'gap_RA'],
 'answer205': ['terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'terminal_RA',
  'are_RA',
  'not_RA',
  'connected_RA'],
 'answer206': ['terminal_RA',
  '1_RA',
  'is_RA',
  'connected_RA',
  'to_RA',
  'the_RA',
  'negative_RA',
  'battery_RA',
  'terminal_RA'],
 'answer207': ['terminal_RA',
  '1_RA',
  'is_RA',
  'not_RA',
  'separated_RA',
  'from_RA',
  'the_RA',
  'negative_RA',
  'battery_RA',
  'terminal_RA'],
 'answer207.NEW': ['terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'battery_RA',
  'terminal_RA',
  'are_RA',
  'in_RA',
  'different_RA',
  'electrical_RA',
  'states_RA']}

Next, we need to extract each of the student responses. These are in the third daughter node:


In [30]:
print(r[2][0].text)
r[2][0].attrib


positive battery terminal is separated by a gap from terminal 1
Out[30]:
{'accuracy': 'correct',
 'answerMatch': 'answer204',
 'count': '1',
 'id': 'FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY1.sbj3-l1.qa193'}

In [31]:
responses_ls=[]
for (i, studentResponse) in enumerate(r[2]):
    if 'answerMatch' in studentResponse.attrib:
        matchTokens_ls=refAnswers_dict[studentResponse.attrib['answerMatch']]
    else:
        matchTokens_ls=[]
    responses_ls.append({'accuracy':studentResponse.attrib['accuracy'],
                         'text':studentResponse.text,
                         'tokens':[t+'_SR' for t in to_tokens(studentResponse.text)] + matchTokens_ls})

responses_ls[36]


Out[31]:
{'accuracy': 'correct',
 'text': 'the positive battery terminal and terminal 1 are not connected',
 'tokens': ['the_SR',
  'positive_SR',
  'battery_SR',
  'terminal_SR',
  'and_SR',
  'terminal_SR',
  '1_SR',
  'are_SR',
  'not_SR',
  'connected_SR',
  'terminal_RA',
  '1_RA',
  'and_RA',
  'the_RA',
  'positive_RA',
  'terminal_RA',
  'are_RA',
  'not_RA',
  'connected_RA']}

OK, that seems to work OK. Now, let's define a function that takes a filename, and returns the list of token dictionaries:


In [32]:
def extract_token_dictionaries(filenameIn):
    
    # Localise the to_tokens function
    def to_tokens_local(textIn):
        '''Convert the input textIn to a list of tokens'''
        tokens_ls=[t.lower().strip(punctuation) for t in textIn.split()]
        # remove any empty tokens
        return [t for t in tokens_ls if t]

    tree=ET.parse(filenameIn)
    root=tree.getroot()
    
    refAnswers_dict={refAnswer.attrib['id']:[t+'_RA' for t in to_tokens_local(refAnswer.text)]
                     for refAnswer in root[1]}

    responsesOut_ls=[]
    for (i, studentResponse) in enumerate(root[2]):
        if 'answerMatch' in studentResponse.attrib:
            matchTokens_ls=refAnswers_dict[studentResponse.attrib['answerMatch']]
        else:
            matchTokens_ls=[]
        responsesOut_ls.append({'accuracy':studentResponse.attrib['accuracy'],
                                'text':studentResponse.text,
                                'tokens':[t+'_SR' for t in to_tokens_local(studentResponse.text)] \
                                          + matchTokens_ls})
    return responsesOut_ls

We now have a function which takes a filename and returns a list of tokenised student responses and reference answers:


In [33]:
extract_token_dictionaries(filename)[:2]


Out[33]:
[{'accuracy': 'correct',
  'text': 'positive battery terminal is separated by a gap from terminal 1',
  'tokens': ['positive_SR',
   'battery_SR',
   'terminal_SR',
   'is_SR',
   'separated_SR',
   'by_SR',
   'a_SR',
   'gap_SR',
   'from_SR',
   'terminal_SR',
   '1_SR',
   'terminal_RA',
   '1_RA',
   'and_RA',
   'the_RA',
   'positive_RA',
   'terminal_RA',
   'are_RA',
   'separated_RA',
   'by_RA',
   'the_RA',
   'gap_RA']},
 {'accuracy': 'correct',
  'text': 'terminal 1 is not connected to the positive terminal',
  'tokens': ['terminal_SR',
   '1_SR',
   'is_SR',
   'not_SR',
   'connected_SR',
   'to_SR',
   'the_SR',
   'positive_SR',
   'terminal_SR',
   'terminal_RA',
   '1_RA',
   'and_RA',
   'the_RA',
   'positive_RA',
   'terminal_RA',
   'are_RA',
   'not_RA',
   'connected_RA']}]

So next we need to be able to build a document frequency dictionary from a list of tokenised documents.


In [34]:
def document_frequencies(listOfTokenLists):
    # Build the dictionary of all tokens used:
    token_set=set()
    for tokenList in listOfTokenLists:
        token_set=token_set.union(set(tokenList))
        
    # Then return the document frequency counts for each token
    
    return {t:len([l for l in listOfTokenLists if t in l])
            for t in token_set}

In [35]:
tokenLists_ls=[x['tokens'] for x in extract_token_dictionaries(filename)]
document_frequencies(tokenLists_ls)


Out[35]:
{'1.5_SR': 3,
 '1_RA': 55,
 '1_SR': 40,
 '2_SR': 1,
 'a_SR': 31,
 'and_RA': 48,
 'and_SR': 20,
 'answer_SR': 1,
 'any_SR': 1,
 'are_RA': 48,
 'are_SR': 12,
 'aren"t_SR': 1,
 'at_SR': 3,
 'batteries_SR': 1,
 'battery"s_SR': 1,
 'battery_RA': 7,
 'battery_SR': 39,
 'becaquse_SR': 1,
 'because_SR': 28,
 'becuase_SR': 1,
 'between_SR': 9,
 'both_SR': 2,
 'bulb_SR': 7,
 'by_RA': 26,
 'by_SR': 10,
 'c_SR': 1,
 'charge_SR': 2,
 'circuit_SR': 3,
 'closed_SR': 1,
 'closing_SR': 1,
 'components_SR': 1,
 'connected_RA': 29,
 'connected_SR': 50,
 'connection_SR': 5,
 'contact_SR': 1,
 'created_SR': 1,
 'damaged_SR': 3,
 'difference_SR': 1,
 'different_SR': 3,
 'dint_SR': 1,
 'direct_SR': 1,
 'do_SR': 1,
 'dont_SR': 1,
 'each_SR': 2,
 'electrical_SR': 3,
 'end_SR': 1,
 'from_SR': 6,
 'gap_RA': 26,
 'gap_SR': 27,
 'gaps_SR': 1,
 'get_SR': 1,
 'had_SR': 2,
 'has_SR': 1,
 'have_SR': 1,
 'he_SR': 2,
 'i_SR': 4,
 'in_SR': 4,
 'is_RA': 7,
 'is_SR': 54,
 'it_SR': 6,
 'its_SR': 2,
 'know_SR': 2,
 'making_SR': 1,
 'me_SR': 1,
 'negative_RA': 7,
 'negative_SR': 13,
 'no_SR': 9,
 'not_RA': 22,
 'not_SR': 26,
 'of_SR': 2,
 'on_SR': 2,
 'one_SR': 8,
 'other_SR': 2,
 'path_SR': 1,
 'positive_RA': 48,
 'positive_SR': 52,
 'posittive_SR': 1,
 'positve_SR': 1,
 'postive_SR': 1,
 'psoitive_SR': 1,
 'reading_SR': 1,
 'same_SR': 1,
 'separated_RA': 26,
 'separated_SR': 6,
 'separates_SR': 1,
 'separation_SR': 2,
 'separted_SR': 1,
 'seperated_SR': 7,
 'so_SR': 1,
 'state_SR': 1,
 'states_SR': 3,
 'tell_SR': 1,
 'termianl_SR': 1,
 'terminal_RA': 55,
 'terminal_SR': 68,
 'terminals_SR': 6,
 'the_RA': 55,
 'the_SR': 71,
 'thebulb_SR': 1,
 'their_SR': 1,
 'then_SR': 2,
 'there_SR': 20,
 'they_SR': 3,
 'to_RA': 7,
 'to_SR': 42,
 'tot_SR': 1,
 'two_SR': 2,
 'understand_SR': 1,
 'v_SR': 1,
 'voltage_SR': 3,
 'was_SR': 18,
 'with_SR': 3}

Next, define a function which takes a list of tokens and a document frequency dictionary, and returns a dictionary of the tf.idf values for each of the tokens in the list. Note: for this function, if a token isn't in the document frequency dictionary, then it won't be returned in the tf.idf dictionary.

We can use the collections.Counter object to get the tf values.


In [36]:
from collections import Counter

In [37]:
def get_tfidf(tokens_ls, docFreq_dict):
    tf_dict=Counter(tokens_ls)
    return {t:tf_dict[t]/docFreq_dict[t] for t in tf_dict if t in docFreq_dict}

In [38]:
get_tfidf('the cat sat on the mat'.split(), {'cat':2, 'the':1})


Out[38]:
{'cat': 0.5, 'the': 2.0}

Finally, we want to convert the outputs for all of the responses into a dataframe.


In [39]:
# Extract the data from the file:
tokenDictionaries_ls=extract_token_dictionaries(filename)

# Build the lists of responses:
tokenLists_ls=[x['tokens'] for x in extract_token_dictionaries(filename)]

# Build the document frequency dict
docFreq_dict=document_frequencies(tokenLists_ls)

# Create the tf.idf for each response:
tfidf_ls=[get_tfidf(tokens_ls, docFreq_dict) for tokens_ls in tokenLists_ls]

# Now, create a dataframe which is indexed by the token dictionary:
trainingText_df=pd.DataFrame(index=docFreq_dict.keys())

# Use the index of responses in the list as column headers:
for (i, tokens_ls) in enumerate(tfidf_ls):
    trainingText_df[i]=pd.Series(tokens_ls, index=trainingText_df.index)

# Finally, transpose, and replace the NaNs with 0:
trainingText_df.fillna(0).T


Out[39]:
positive_RA same_SR 1_SR the_RA is_SR batteries_SR negative_RA connected_SR positve_SR psoitive_SR ... v_SR have_SR path_SR end_SR dont_SR so_SR tell_SR are_RA terminal_RA created_SR
0 0.020833 0.0 0.025 0.036364 0.018519 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
1 0.020833 0.0 0.025 0.018182 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
2 0.000000 0.0 0.025 0.000000 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
3 0.000000 0.0 0.025 0.000000 0.018519 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
4 0.000000 0.0 0.000 0.000000 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
5 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
6 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
7 0.020833 0.0 0.025 0.036364 0.018519 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
8 0.020833 0.0 0.025 0.018182 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
9 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
10 0.000000 0.0 0.000 0.000000 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
11 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
12 0.020833 0.0 0.000 0.036364 0.018519 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
13 0.020833 0.0 0.025 0.018182 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
14 0.000000 0.0 0.025 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
15 0.000000 0.0 0.025 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
16 0.020833 0.0 0.025 0.036364 0.018519 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
17 0.020833 0.0 0.025 0.036364 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
18 0.020833 0.0 0.025 0.018182 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
19 0.020833 0.0 0.000 0.036364 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
20 0.020833 0.0 0.000 0.036364 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
21 0.020833 0.0 0.025 0.018182 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
22 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
23 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
24 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
25 0.000000 1.0 0.000 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
26 0.000000 0.0 0.000 0.018182 0.018519 0.0 0.142857 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.036364 0.0
27 0.000000 0.0 0.025 0.000000 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
28 0.000000 0.0 0.025 0.018182 0.018519 0.0 0.142857 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.036364 0.0
29 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
73 0.020833 0.0 0.000 0.036364 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
74 0.020833 0.0 0.000 0.018182 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
75 0.020833 0.0 0.000 0.018182 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
76 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
77 0.020833 0.0 0.025 0.018182 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
78 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
79 0.020833 0.0 0.000 0.036364 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
80 0.020833 0.0 0.025 0.036364 0.018519 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
81 0.020833 0.0 0.000 0.018182 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
82 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
83 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.000000 0.000000 0.0
84 0.000000 0.0 0.025 0.000000 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
85 0.000000 0.0 0.025 0.000000 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
86 0.000000 0.0 0.000 0.000000 0.018519 1.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
87 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
88 0.020833 0.0 0.000 0.018182 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
89 0.000000 0.0 0.025 0.018182 0.018519 0.0 0.142857 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.036364 0.0
90 0.020833 0.0 0.000 0.018182 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
91 0.020833 0.0 0.000 0.018182 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
92 0.020833 0.0 0.000 0.018182 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
93 0.020833 0.0 0.000 0.036364 0.018519 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
94 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
95 0.020833 0.0 0.025 0.036364 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
96 0.000000 0.0 0.025 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
97 0.020833 0.0 0.000 0.036364 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.020833 0.036364 1.0
98 0.000000 0.0 0.000 0.000000 0.000000 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
99 0.020833 0.0 0.025 0.036364 0.018519 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
100 0.020833 0.0 0.025 0.018182 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0
101 0.000000 0.0 0.025 0.000000 0.000000 0.0 0.000000 0.00 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0
102 0.020833 0.0 0.000 0.018182 0.018519 0.0 0.000000 0.02 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.020833 0.036364 0.0

103 rows × 112 columns

Cool, that seems to work. Now just need to do it for the complete set of files. Just use beetle/train/core for the time being.


In [40]:
!ls semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/


FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY1.xml
FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY2.xml
FaultFinding-BULB_C_VOLTAGE_EXPLAIN_WHY6.xml
FaultFinding-BULB_ONLY_EXPLAIN_WHY2.xml
FaultFinding-BULB_ONLY_EXPLAIN_WHY4.xml
FaultFinding-BULB_ONLY_EXPLAIN_WHY6.xml
FaultFinding-BURNED_BULB_LOCATE_EXPLAIN_Q.xml
FaultFinding-OTHER_TERMINAL_STATE_EXPLAIN_Q.xml
FaultFinding-TERMINAL_STATE_EXPLAIN_Q.xml
FaultFinding-VOLTAGE_AND_GAP_DISCUSS_Q.xml
FaultFinding-VOLTAGE_DEFINE_Q.xml
FaultFinding-VOLTAGE_DIFF_DISCUSS_1_Q.xml
FaultFinding-VOLTAGE_DIFF_DISCUSS_2_Q.xml
FaultFinding-VOLTAGE_GAP_EXPLAIN_WHY1.xml
FaultFinding-VOLTAGE_GAP_EXPLAIN_WHY3.xml
FaultFinding-VOLTAGE_GAP_EXPLAIN_WHY4.xml
FaultFinding-VOLTAGE_GAP_EXPLAIN_WHY5.xml
FaultFinding-VOLTAGE_GAP_EXPLAIN_WHY6.xml
FaultFinding-VOLTAGE_INCOMPLETE_CIRCUIT_2_Q.xml
SwitchesBulbsParallel-BURNED_BULB_PARALLEL_EXPLAIN_Q1.xml
SwitchesBulbsParallel-BURNED_BULB_PARALLEL_EXPLAIN_Q2.xml
SwitchesBulbsParallel-BURNED_BULB_PARALLEL_EXPLAIN_Q3.xml
SwitchesBulbsParallel-BURNED_BULB_PARALLEL_WHY_Q.xml
SwitchesBulbsParallel-GIVE_CIRCUIT_TYPE_HYBRID_EXPLAIN_Q2.xml
SwitchesBulbsParallel-GIVE_CIRCUIT_TYPE_HYBRID_EXPLAIN_Q3.xml
SwitchesBulbsParallel-GIVE_CIRCUIT_TYPE_PARALLEL_EXPLAIN_Q2.xml
SwitchesBulbsParallel-HYBRID_BURNED_OUT_EXPLAIN_Q1.xml
SwitchesBulbsParallel-HYBRID_BURNED_OUT_EXPLAIN_Q3.xml
SwitchesBulbsParallel-HYBRID_BURNED_OUT_WHY_Q2.xml
SwitchesBulbsParallel-HYBRID_BURNED_OUT_WHY_Q3.xml
SwitchesBulbsParallel-OPT1_EXPLAIN_Q2.xml
SwitchesBulbsParallel-OPT2_EXPLAIN_Q.xml
SwitchesBulbsParallel-PARALLEL_SWITCH_EXPLAIN_Q1.xml
SwitchesBulbsParallel-PARALLEL_SWITCH_EXPLAIN_Q2.xml
SwitchesBulbsParallel-PARALLEL_SWITCH_EXPLAIN_Q3.xml
SwitchesBulbsParallel-SWITCH_TABLE_EXPLAIN_Q1.xml
SwitchesBulbsParallel-SWITCH_TABLE_EXPLAIN_Q2.xml
SwitchesBulbsParallel-SWITCH_TABLE_EXPLAIN_Q3.xml
SwitchesBulbsSeries-CONDITIONS_FOR_BULB_TO_LIGHT.xml
SwitchesBulbsSeries-DAMAGED_BUILD_EXPLAIN_Q.xml
SwitchesBulbsSeries-DAMAGED_BULB_EXPLAIN_2_Q.xml
SwitchesBulbsSeries-GIVE_CIRCUIT_TYPE_SERIES_EXPLAIN_Q.xml
SwitchesBulbsSeries-SHORT_CIRCUIT_EXPLAIN_Q_2.xml
SwitchesBulbsSeries-SHORT_CIRCUIT_EXPLAIN_Q_4.xml
SwitchesBulbsSeries-SHORT_CIRCUIT_EXPLAIN_Q_5.xml
SwitchesBulbsSeries-SHORT_CIRCUIT_X_Q.xml
SwitchesBulbsSeries-SWITCH_OPEN_EXPLAIN_Q.xml

Use os.walk to get the files:


In [41]:
import os

We can now do the same as before, but this time using all the files to construct the final dataframe. We also need a series containing the accuracy measures.


In [42]:
tokenDictionaries_ls=[]

# glob would have been easier...
for (root, dirs, files) in os.walk('semeval2013-task7/semeval2013-Task7-5way/beetle/train/Core/'):
    for filename in files:
        if filename[-4:]=='.xml':
            tokenDictionaries_ls.extend(extract_token_dictionaries(os.path.join(root, filename)))

# Now we've extracted the information from all the files. We can now construct the dataframe
# in the same way as before:

# Build the lists of responses:
tokenLists_ls=[x['tokens'] for x in tokenDictionaries_ls]

# Build the document frequency dict
docFreq_dict=document_frequencies(tokenLists_ls)

# Now, create a dataframe which is indexed by the tokens
# in the token frequency dictionary:
trainingText_df=pd.DataFrame(index=docFreq_dict.keys())

# Populate the dataframe with the tf.idf for each response. Also,
# create a dictionary of the accuracy values while we're at it.
accuracy_dict={}
for (i, response_dict) in enumerate(tokenDictionaries_ls):
    trainingText_df[i]=pd.Series(get_tfidf(response_dict['tokens'], docFreq_dict), 
                                 index=trainingText_df.index)
    accuracy_dict[i]=response_dict['accuracy']

# Finally, transpose, and replace the NaNs with 0:
trainingText_df=trainingText_df.fillna(0).T

# Also, to make it easier to store in a single csv file, let's put the accuracy
# values in a column called "accuracy_txt":

trainingText_df['accuracy_txt']=pd.Series(accuracy_dict)

# And to have a final column containing a numerical equivalent of the
# accuracy_txt column (called accuracy_num ):

labels_dict={label:i for (i, label) in enumerate(set(trainingText_df['accuracy_txt']))}
trainingText_df['accuracy_num']=[labels_dict[l] for l in trainingText_df['accuracy_txt']]

In [43]:
trainingText_df.head()


Out[43]:
pr_SR anotehr_SR or_SR a_RA incomplete_RA is_SR wehere_SR postive_SR termials_SR is_RA ... seriously_SR cuts_SR allows_SR versa_SR seeing_SR copntained_SR though_SR opposing_SR accuracy_txt accuracy_num
0 0.0 0.0 0.0 0.0 0.0 0.000648 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 correct 3
1 0.0 0.0 0.0 0.0 0.0 0.000648 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 correct 3
2 0.0 0.0 0.0 0.0 0.0 0.000648 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 contradictory 2
3 0.0 0.0 0.0 0.0 0.0 0.000648 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 contradictory 2
4 0.0 0.0 0.0 0.0 0.0 0.000648 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 contradictory 2

5 rows × 1118 columns

And finish by exporting to a csv file:


In [44]:
trainingText_df.to_csv('beetleTrainingData.csv', index=False)

Done! Now can import the data into a dataframe with:

pd.read_csv('beetleTrainingData.csv')