Sketches and progress for SHS I/O


In [1]:
%matplotlib inline
from __future__ import division, print_function
import numpy as np
import os


/Users/Jan/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Read a list of all available URI's

def read_uris():
...

In [64]:
import SHS_data

uris, ids = SHS_data.read_uris()


Out[64]:
True

Read cliques

def read_cliques(clique_file='shs_pruned.txt'):
...

In [48]:
reload(SHS_data)

cliques_by_name, cliques_by_id = SHS_data.read_cliques()

Split cliques into train, test & evaluation sets

def split_train_test_validation(clique_dict, ratio=(50,20,30),
                           random_state=1988):
...

In [4]:
reload(SHS_data)

train_cliques, test_cliques, val_cliques = SHS_data.split_train_test_validation(cliques_by_name)

Get URI's for a clique

First idea: get URI's and a ground truth matrix.

But maybe that's not what we want: 18K x 18K ground truth matrix in dense form = 2Gb.

Therefore: just URI's for now.

Open Question

Should this function be in SHS_data or somewhere more general?

Probably somewhere more general, but there is no such somewhere for now, so leaving it in place.


In [44]:
reload(SHS_data)

train_uris = SHS_data.uris_from_clique_dict(train_cliques)

In [47]:


In [ ]: