Here, I will show you how to read questions.csv file. Let's do it.

Read csv formatted file without header

First above all, you need to import csv and yaml then read the csv formatted file.


In [2]:
import csv
import yaml

In [3]:
reader = csv.reader(open("../data/questions.csv"))

Read the first line and see the structure.


In [4]:
question_1 = reader.next()

In [5]:
question_1


Out[5]:
['1',
 'thomas cole',
 'test',
 'Fine Arts',
 "This painter's indulgence of visual fantasy, and appreciation of different historic architectural styles can be seen in his 1840 Architect's Dream. After a series of paintings on The Last of the Mohicans, he made a three year trip to Europe in 1829, but he is better known for a trip four years earlier in which he journeyed up the Hudson River to the Catskill Mountains. FTP, name this painter of The Oxbow and The Voyage of Life series.",
 "{0: '', 1: u'painters', 2: u'indulgence', 4: u'visual', 5: u'fantasy', 7: u'appreciation', 9: u'different', 10: u'historic', 11: u'architectural', 12: u'styles', 15: u'seen', 18: u'1840', 19: u'architects', 20: u'dream', 23: u'series', 25: u'paintings', 28: u'last', 31: u'mohicans', 33: u'made', 35: u'three', 36: u'year', 37: u'trip', 39: u'europe', 41: u'1829', 45: u'better', 46: u'known', 49: u'trip', 50: u'four', 51: u'years', 52: u'earlier', 56: u'journeyed', 59: u'hudson', 60: u'river', 63: u'catskill', 64: u'mountains', 65: u'ftp', 66: u'name', 68: u'this_painter', 71: u'oxbow', 74: u'voyage', 76: u'life', 77: u'series'}"]

Yes, each line is converted into list and it has 6 items as expected. However, how can we use the last item? It is string type but it seems dictionary or json.

OK, let's try to convert it into dictionary.


In [6]:
yaml.load(question_1[-1].replace(": u'", ": '"))


Out[6]:
{0: '',
 1: 'painters',
 2: 'indulgence',
 4: 'visual',
 5: 'fantasy',
 7: 'appreciation',
 9: 'different',
 10: 'historic',
 11: 'architectural',
 12: 'styles',
 15: 'seen',
 18: '1840',
 19: 'architects',
 20: 'dream',
 23: 'series',
 25: 'paintings',
 28: 'last',
 31: 'mohicans',
 33: 'made',
 35: 'three',
 36: 'year',
 37: 'trip',
 39: 'europe',
 41: '1829',
 45: 'better',
 46: 'known',
 49: 'trip',
 50: 'four',
 51: 'years',
 52: 'earlier',
 56: 'journeyed',
 59: 'hudson',
 60: 'river',
 63: 'catskill',
 64: 'mountains',
 65: 'ftp',
 66: 'name',
 68: 'this_painter',
 71: 'oxbow',
 74: 'voyage',
 76: 'life',
 77: 'series'}

Now, you know how to convert csv files into other formats that you want. So, you can handle all the given files.

Convert csv into list

Let's try to read train.csv.


In [12]:
reader = csv.reader(open("../data/train.csv"))

However, you know that train.csv has header which is not data we want to use. So, you might need to get rid of the first line. By the way, we need to know that reader returned by csv.reader is enumerater not list. So, you just use reader only once. If you want to use it once again, you need to use csv.reader once.


In [13]:
reader.next()


Out[13]:
['id', 'question', 'user', 'position', 'answer']

OK, now reader is on the 2nd line of the csv flie. Try to convert it into list.


In [14]:
train_set = []
for row in reader:
    train_set.append(row)

In [15]:
print len(train_set)


28494

In [16]:
print len(train_set[0])


5

In [18]:
print train_set[0]
print train_set[-1]


['1', '1', '0', '61.0', 'cole']
['33242', '103765', '50', '94.0', 'olympia']

I guess you know realized that why csv.reader return enumerator instead of list. This is because we don't know how the size of given csv file. If the file is too big, we got memory fault. So, in this case, enumerator is much bettern than list.

It means that, if we don't need to convert csv into list, please don't convert csv into list to save memory.


In [ ]: