Storing and loading questions as a serialized object.

As questinos.csv is not easy to use itself, it might be healpful to make the csv file into a serialized object. In this case, we can use pickle, Python object serialization.

https://docs.python.org/2/library/pickle.html

Loading csv and make it dictionary

First of all, we need to load questions.csv file and convert it into dictionary. If key of the dictionay is question id(qid), we can find questions by key.



In [68]:

    
import csv
import gzip
import cPickle as pickle
from collections import defaultdict

import yaml


question_reader = csv.reader(open("../data/questions.csv"))
question_header = ["answer", "group", "category", "question", "pos_token"]
questions = defaultdict(dict)
for row in question_reader:
    question = {}
    row[-1] = yaml.load(row[-1].replace(": u'", ": '"))
    qid = int(row.pop(0))
    for index, item in enumerate(row):
        question[question_header[index]] = item
    questions[qid] = question

Let's check how many items in the dictionary.



In [69]:

    
print len(questions)

Yes, 7949 items is right. How about question numbers? It is continuous number or not? We might want to check the first and last 10 items.



In [70]:

    
print sorted(questions.keys())[:10]
print sorted(questions.keys())[-10:]









    



[1, 4, 5, 6, 11, 19, 35, 36, 42, 44]
[123869, 123875, 123879, 123892, 123896, 123900, 123910, 123911, 123912, 123913]

Yes, it's not continuous data in terms of qid. But, it's OK. What about just see one question? How can we do it? Just look at qid 1.



In [71]:

    
questions[1]









    Out[71]:





{'answer': 'thomas cole',
 'category': 'Fine Arts',
 'group': 'test',
 'pos_token': {0: '',
  1: 'painters',
  2: 'indulgence',
  4: 'visual',
  5: 'fantasy',
  7: 'appreciation',
  9: 'different',
  10: 'historic',
  11: 'architectural',
  12: 'styles',
  15: 'seen',
  18: '1840',
  19: 'architects',
  20: 'dream',
  23: 'series',
  25: 'paintings',
  28: 'last',
  31: 'mohicans',
  33: 'made',
  35: 'three',
  36: 'year',
  37: 'trip',
  39: 'europe',
  41: '1829',
  45: 'better',
  46: 'known',
  49: 'trip',
  50: 'four',
  51: 'years',
  52: 'earlier',
  56: 'journeyed',
  59: 'hudson',
  60: 'river',
  63: 'catskill',
  64: 'mountains',
  65: 'ftp',
  66: 'name',
  68: 'this_painter',
  71: 'oxbow',
  74: 'voyage',
  76: 'life',
  77: 'series'},
 'question': "This painter's indulgence of visual fantasy, and appreciation of different historic architectural styles can be seen in his 1840 Architect's Dream. After a series of paintings on The Last of the Mohicans, he made a three year trip to Europe in 1829, but he is better known for a trip four years earlier in which he journeyed up the Hudson River to the Catskill Mountains. FTP, name this painter of The Oxbow and The Voyage of Life series."}

Yes, it's dictionary. So, you can use some dictionary functions. Check this out.



In [72]:

    
questions[1].keys()









    Out[72]:





['answer', 'category', 'group', 'pos_token', 'question']



In [73]:

    
questions[1]['answer']









    Out[73]:





'thomas cole'



In [74]:

    
questions[1]['pos_token']









    Out[74]:





{0: '',
 1: 'painters',
 2: 'indulgence',
 4: 'visual',
 5: 'fantasy',
 7: 'appreciation',
 9: 'different',
 10: 'historic',
 11: 'architectural',
 12: 'styles',
 15: 'seen',
 18: '1840',
 19: 'architects',
 20: 'dream',
 23: 'series',
 25: 'paintings',
 28: 'last',
 31: 'mohicans',
 33: 'made',
 35: 'three',
 36: 'year',
 37: 'trip',
 39: 'europe',
 41: '1829',
 45: 'better',
 46: 'known',
 49: 'trip',
 50: 'four',
 51: 'years',
 52: 'earlier',
 56: 'journeyed',
 59: 'hudson',
 60: 'river',
 63: 'catskill',
 64: 'mountains',
 65: 'ftp',
 66: 'name',
 68: 'this_painter',
 71: 'oxbow',
 74: 'voyage',
 76: 'life',
 77: 'series'}



In [75]:

    
questions[1]['pos_token'].keys()









    Out[75]:





[0,
 1,
 2,
 4,
 5,
 7,
 9,
 10,
 11,
 12,
 15,
 18,
 19,
 20,
 23,
 25,
 28,
 31,
 33,
 35,
 36,
 37,
 39,
 41,
 45,
 46,
 49,
 50,
 51,
 52,
 56,
 59,
 60,
 63,
 64,
 65,
 66,
 68,
 71,
 74,
 76,
 77]



In [76]:

    
questions[1]['pos_token'].values()









    Out[76]:





['',
 'painters',
 'indulgence',
 'visual',
 'fantasy',
 'appreciation',
 'different',
 'historic',
 'architectural',
 'styles',
 'seen',
 '1840',
 'architects',
 'dream',
 'series',
 'paintings',
 'last',
 'mohicans',
 'made',
 'three',
 'year',
 'trip',
 'europe',
 '1829',
 'better',
 'known',
 'trip',
 'four',
 'years',
 'earlier',
 'journeyed',
 'hudson',
 'river',
 'catskill',
 'mountains',
 'ftp',
 'name',
 'this_painter',
 'oxbow',
 'voyage',
 'life',
 'series']



In [77]:

    
questions[1]['pos_token'].items()









    Out[77]:





[(0, ''),
 (1, 'painters'),
 (2, 'indulgence'),
 (4, 'visual'),
 (5, 'fantasy'),
 (7, 'appreciation'),
 (9, 'different'),
 (10, 'historic'),
 (11, 'architectural'),
 (12, 'styles'),
 (15, 'seen'),
 (18, '1840'),
 (19, 'architects'),
 (20, 'dream'),
 (23, 'series'),
 (25, 'paintings'),
 (28, 'last'),
 (31, 'mohicans'),
 (33, 'made'),
 (35, 'three'),
 (36, 'year'),
 (37, 'trip'),
 (39, 'europe'),
 (41, '1829'),
 (45, 'better'),
 (46, 'known'),
 (49, 'trip'),
 (50, 'four'),
 (51, 'years'),
 (52, 'earlier'),
 (56, 'journeyed'),
 (59, 'hudson'),
 (60, 'river'),
 (63, 'catskill'),
 (64, 'mountains'),
 (65, 'ftp'),
 (66, 'name'),
 (68, 'this_painter'),
 (71, 'oxbow'),
 (74, 'voyage'),
 (76, 'life'),
 (77, 'series')]

How can figure out questions length without tokenizing question it self?



In [78]:

    
max(questions[1]['pos_token'].keys())









    Out[78]:





77

Make questions pickled data

As you know that, reading csv and coverting it as a dictionary spend more than one minute. Once we convert it as a dictionary, we can save it as a pickled data and we can load it when we need. It is really simple and fast. Look at that!

Wait! We will use gzip.open instead of open because pickled file is too big. So we will use compression. It's easy and it consumes only 1/10 size of that of original one. Of course, it will take few more seconds than plain one.

original: 1 sec in my PC
compressed: 5 sec in my PC

Also, "wb" means writing as binary mode, and "rb" means reading file as binary mode.



In [93]:

    
with gzip.open("questions.pklz", "wb") as output:
    pickle.dump(questions, output)

Yes, now we can load pickled data as a variable.



In [94]:

    
with gzip.open("questions.pklz", "rb") as fp:
    questions_new = pickle.load(fp)
print len(questions_new)

Yes, it took only few second. I will save it, make it as a commit, and push it into github. So, you can use pickled data instead of converting questions.csv



In [95]:

    
print questions == questions









    



True



In [96]:

    
print questions == questions_new









    



True



In [97]:

    
questions_new[0] = 1



In [98]:

    
print questions == questions_new









    



False

Yes, pickled data is exactly same with original one.