We want to have the PostgreSQL development VM be populated with some test data such that we can run the feature generation and machine learning codes without having to wait for a crawl to proceed or connect to the real database (of course one can do either of these things if they wish). Right now the unit tests populate the test database with some a minimal amount of data and remove it after the execution of the tests but we also want the dev database to be populated with a more realistic amount of data during provisioning of the development VM. This notebook shows how this test data was generated from real data collected by our crawlers and stored in the production database.
In [12]:
import os
import pandas as pd
from sqlalchemy import create_engine
with open(os.environ["PGPASS"], "rb") as f:
content = f.readline().decode("utf-8").replace("\n", "").split(":")
engine = create_engine("postgresql://{user}:{passwd}@{host}/{db}".format(user=content[3],
passwd=content[4],
host=content[0],
db=content[2]))
In [16]:
df_examples = pd.read_sql("SELECT * FROM raw.frontpage_examples", con=engine)
Grab some examples:
In [80]:
test_example_data = df_examples[100:200]
In [81]:
test_example_data.to_csv('frontpage_examples.csv', index=False)
In [82]:
unique_crawls = test_example_data.crawlid.unique()
In [83]:
crawls_select_str = ', '.join([str(x) for x in unique_crawls])
In [84]:
df_crawls = pd.read_sql("SELECT * FROM raw.crawls WHERE crawlid IN ({})".format(crawls_select_str), con=engine)
Remove VPS related info:
In [85]:
df_crawls['ip'] = '127.0.0.1'
df_crawls['kernel_version'] = '1.2.3-4-generic'
df_crawls['os'] = 'Linux'
df_crawls['entry_node'] = '1A2B3C4D'
In [86]:
df_crawls.to_csv('crawls.csv', index=False)
In [87]:
unique_onion_services = test_example_data.hsid.unique()
In [88]:
onions_select_str = ', '.join([str(x) for x in unique_onion_services])
In [91]:
df_hs = pd.read_sql("SELECT * FROM raw.hs_history WHERE hsid IN ({})".format(onions_select_str), con=engine)
Probably unnecessary, but replace each hs_url with a random (fake and invalid) address:
In [92]:
import random
In [93]:
word_file = "/usr/share/dict/words"
WORDS = open(word_file).read().splitlines()
In [94]:
word = random.choice(WORDS)
In [95]:
df_hs['hs_url'] = df_hs['hs_url'].map(lambda x: 'http://{}1234.onion'.format(random.choice(WORDS)))
In [96]:
df_hs.to_csv('hs_history.csv', index=False)
Get the trace cells for the examples in our dataset
In [97]:
examples_select_str = ', '.join([str(x) for x in test_example_data.exampleid])
In [98]:
df_cells = pd.read_sql("SELECT * FROM raw.frontpage_traces WHERE exampleid IN ({})".format(examples_select_str), con=engine)
In [99]:
df_cells.to_csv('frontpage_traces.csv', index=False)
In [104]:
df_examples_w_hs_info = pd.merge(test_example_data, df_hs, on='hsid')
In [105]:
df_examples_and_cells = pd.merge(df_examples_w_hs_info, df_cells, on='exampleid')
In [108]:
df_examples_and_cells.to_csv('test_data.csv', index=False)