The target format is the Hierarchical Data Format (HDF) in version 5, a well established data format with good reading routines for Python, Matlab and IDL.
The first step is just a straight forward parsing of the CSV output of the Mongo database dump.
While parsing, values of 'null' are being replaced by numpy.NaN.
I made the conscious decision to NOT replace None
in the marking
column by NaN because that
detail is in itself useable data.
Both the acquisition_date
and the created_at
column are currently being parsed to a python datetime.
This has been made optional by calling the reduction routine with the option --raw_times
.
Some markings for fans and blotches have some of their required data fields empty. By default we are removing these from the HDF5 database files. The way this is done is:
blotch_data_cols = 'x y image_x image_y radius_1 radius_2'.split()
fan_data_cols = 'x y image_x image_y distance angle spread'.split()
For each marking ['fan', 'blotch']
do:
marking
data and the rest
.marking
data for the respective required columns.marking
data with the rest
.To be noted: These incomplete data are not only from the first days during the TV event, but, albeit at lower frequency, scattered throughout the next year.
The application is called reduction.py
and when called with -h
for help, it provides the following output:
usage: planet4_reduction.py [-h] [--raw_times] [--keep_dirt] csv_fname
positional arguments:
csv_fname Provide the filename of the database dump csv-file here.
optional arguments:
-h, --help show this help message and exit
--raw_times Do not parse the times into a Python datetime object. For the
stone-age. ;) Default: parse into datetime object.
--keep_dirt Do not filter for dirty data. Keep everything. Default: Do the
filtering.
I produce different versions of the reduced dataset, increasing in reduction, resulting in smaller and faster to read files.
For all file names the date part indicates the date of the database dump which is delivered every by Stuart.
Fast_Read
This file is a fixed table format for all cleaned data, in case one needs to read everything into memory the fastest way.
Above mentioned filtering was applied, so tutorials and incomplete data rows removed.
Product file name is yyyy-mm-dd_planet_four_classifications_fast_all_read.h5
Queryable
This file is the data of Level Fast_Read
, but combined with a multi-column index, to be able to query the database file.
The data columns that can be filtered for are:
data_columns=['classification_id', 'image_id',
'image_name', 'user_name', 'marking',
'acquisition_date', 'local_mars_time'])
The way querying works (amazingly fast, btw.), for example to get all data for one image_id:
data = pd.read_hdf(database_fname, 'df', where='image_id=df
is the HDF internal handle for the table. This is required because HDF files can contain more than one table structure.
Product file name is yyyy-mm-dd_planet_four_classifications_queryable.h5
Retired
(not yet implemented in reduction.py)This product is reduced to only included image_id's that have been retired (> 30 analyses done.)
Product file name is yyyy-mm-dd_planet_four_classifications_retired.h5
Note: This procedure currently creates approx 5 GB of data on top of the downloaded file size.
xxx.csv.tar.gz
from the email you get every Sunday (if not, contact Meg).tar zxvf yyyy_mm_dd_xxxx.csv.tar.gz
cd
into P4_sandbox/planet4
python reduction.py path_to_csv_file
The argument is the full path to the unpacked CSV file.You need a current Python environment that includes the following modules:
I can recommend the Anaconda distribution from Continuum Analytics, it contains extra features for academic users. But I have also used Enthought's Canopy successfully for years, just on Linux I don't like the hoops one has to go through for a multi-users installation.
It will create both the queryable and fast-read HDF5 database files in the same folder where the given CSV file is stored.
I wrote a litte checking tool that enables you to see for each of the html gold standard link files that Meg sends around each week which of the entries you have done already. Caveat: Data dumps only appear only on Sundays, so you can't check on things that happened after Sunday before the next Sunday.
Here are the steps how to use it:
P4_sandbox/planet4
folder.gold_standard_checker.py
. When launching it with option -h
you get some (hopefully) helpful text, but in summary, it has --user
controls with username to check for. A list is given with the exact spelling you have to choose from--datadir
is the path to the directory where you stored the CSV datafile.For example, the command could look like this for Meg:
python gold_standard_checker.py html_file_path --username mschwamb --datadir path_to_folder_with_csv
If you find any bugs, don't hesitated to report them with the Issues
link on the right.