Test Data Setup

redshiftzero, November 29, 2016

Create some CSVs to populate the database on the dev VM

We want to have the PostgreSQL development VM be populated with some test data such that we can run the feature generation and machine learning codes without having to wait for a crawl to proceed or connect to the real database (of course one can do either of these things if they wish). Right now the unit tests populate the test database with some a minimal amount of data and remove it after the execution of the tests but we also want the dev database to be populated with a more realistic amount of data during provisioning of the development VM. This notebook shows how this test data was generated from real data collected by our crawlers and stored in the production database.


In [12]:
import os
import pandas as pd
from sqlalchemy import create_engine

with open(os.environ["PGPASS"], "rb") as f:
    content = f.readline().decode("utf-8").replace("\n", "").split(":")

engine = create_engine("postgresql://{user}:{passwd}@{host}/{db}".format(user=content[3],
                                                                         passwd=content[4],
                                                                         host=content[0],
                                                                         db=content[2]))

frontpage_examples


In [16]:
df_examples = pd.read_sql("SELECT * FROM raw.frontpage_examples", con=engine)

Grab some examples:


In [80]:
test_example_data = df_examples[100:200]

In [81]:
test_example_data.to_csv('frontpage_examples.csv', index=False)

crawls


In [82]:
unique_crawls = test_example_data.crawlid.unique()

In [83]:
crawls_select_str = ', '.join([str(x) for x in unique_crawls])

In [84]:
df_crawls = pd.read_sql("SELECT * FROM raw.crawls WHERE crawlid IN ({})".format(crawls_select_str), con=engine)

Remove VPS related info:


In [85]:
df_crawls['ip'] = '127.0.0.1'
df_crawls['kernel_version'] = '1.2.3-4-generic'
df_crawls['os'] = 'Linux'
df_crawls['entry_node'] = '1A2B3C4D'

In [86]:
df_crawls.to_csv('crawls.csv', index=False)

onion services


In [87]:
unique_onion_services = test_example_data.hsid.unique()

In [88]:
onions_select_str = ', '.join([str(x) for x in unique_onion_services])

In [91]:
df_hs = pd.read_sql("SELECT * FROM raw.hs_history WHERE hsid IN ({})".format(onions_select_str), con=engine)

Probably unnecessary, but replace each hs_url with a random (fake and invalid) address:


In [92]:
import random

In [93]:
word_file = "/usr/share/dict/words"
WORDS = open(word_file).read().splitlines()

In [94]:
word = random.choice(WORDS)

In [95]:
df_hs['hs_url'] = df_hs['hs_url'].map(lambda x: 'http://{}1234.onion'.format(random.choice(WORDS)))

In [96]:
df_hs.to_csv('hs_history.csv', index=False)

frontpage_traces

Get the trace cells for the examples in our dataset


In [97]:
examples_select_str = ', '.join([str(x) for x in test_example_data.exampleid])

In [98]:
df_cells = pd.read_sql("SELECT * FROM raw.frontpage_traces WHERE exampleid IN ({})".format(examples_select_str), con=engine)

In [99]:
df_cells.to_csv('frontpage_traces.csv', index=False)

Make a nice merged version for others to play with

Others who don't want to do a bunch of joins might want to play with this data, so let's construct a merged version of three of the above tables (excluding the crawls table).


In [104]:
df_examples_w_hs_info = pd.merge(test_example_data, df_hs, on='hsid')

In [105]:
df_examples_and_cells = pd.merge(df_examples_w_hs_info, df_cells, on='exampleid')

In [108]:
df_examples_and_cells.to_csv('test_data.csv', index=False)