In [68]:
import csv
import gzip
import cPickle as pickle
from collections import defaultdict
import yaml
question_reader = csv.reader(open("../data/questions.csv"))
question_header = ["answer", "group", "category", "question", "pos_token"]
questions = defaultdict(dict)
for row in question_reader:
question = {}
row[-1] = yaml.load(row[-1].replace(": u'", ": '"))
qid = int(row.pop(0))
for index, item in enumerate(row):
question[question_header[index]] = item
questions[qid] = question
Let's check how many items in the dictionary.
In [69]:
print len(questions)
Yes, 7949 items is right. How about question numbers? It is continuous number or not? We might want to check the first and last 10 items.
In [70]:
print sorted(questions.keys())[:10]
print sorted(questions.keys())[-10:]
Yes, it's not continuous data in terms of qid. But, it's OK. What about just see one question? How can we do it? Just look at qid 1.
In [71]:
questions[1]
Out[71]:
Yes, it's dictionary. So, you can use some dictionary functions. Check this out.
In [72]:
questions[1].keys()
Out[72]:
In [73]:
questions[1]['answer']
Out[73]:
In [74]:
questions[1]['pos_token']
Out[74]:
In [75]:
questions[1]['pos_token'].keys()
Out[75]:
In [76]:
questions[1]['pos_token'].values()
Out[76]:
In [77]:
questions[1]['pos_token'].items()
Out[77]:
How can figure out questions length without tokenizing question it self?
In [78]:
max(questions[1]['pos_token'].keys())
Out[78]:
As you know that, reading csv and coverting it as a dictionary spend more than one minute. Once we convert it as a dictionary, we can save it as a pickled data and we can load it when we need. It is really simple and fast. Look at that!
Wait! We will use gzip.open instead of open because pickled file is too big. So we will use compression. It's easy and it consumes only 1/10 size of that of original one. Of course, it will take few more seconds than plain one.
Also, "wb" means writing as binary mode, and "rb" means reading file as binary mode.
In [93]:
with gzip.open("questions.pklz", "wb") as output:
pickle.dump(questions, output)
Yes, now we can load pickled data as a variable.
In [94]:
with gzip.open("questions.pklz", "rb") as fp:
questions_new = pickle.load(fp)
print len(questions_new)
Yes, it took only few second. I will save it, make it as a commit, and push it into github. So, you can use pickled data instead of converting questions.csv
In [95]:
print questions == questions
In [96]:
print questions == questions_new
In [97]:
questions_new[0] = 1
In [98]:
print questions == questions_new
Yes, pickled data is exactly same with original one.