This first tutorial will explain a little bit on what the data is and where to get it.
This project has been a huge success and we'd like to thank all of the participants, the winning team Effsubsee
and the scientific members of the SETI Insitute and IBM.
We are beginning to decommision this project. However, it will still be useful as a learning tool. The only real change is that the primary full data set will be removed. The basic
, primary small
and primary medium
data sets will remain.
We learned a lot at the hackathon on June 10-11th and decided to regenerate the primary data set. This is called the v3
primary data set. The changes, compared to v2
are: the noise background is gaussian white noise instead of noise from the Sun, the signal amplitudes are higher and the characteristics should make them more distinguishable, and there are only 140k in the full set (20k per signal type), compared with 350k previously (50k per signal type).
The basic
data set remains unchanged from before.
For the Code Challenge, you will be using the "primary" data set, as we've called it. The primary data set is
* labeled data set of 35000 simulated signals
* 7 different labels, or "signal classifications"
* total of about 10 GB of data
This data set should be used to train your models.
As stated above, we no longer have the full 140,000 data set (51 GB). All of the data are found in the primary medium
data set below. Additionally , there is the basic4
data set and the and primay small
sub set. They are explained below.
Each data file has a simple format:
* file name = <UUID>.dat
* a JSON header in the first line that contains:
* UUID
* signal_classification (label)
* followed by stream complex-valued time-series data.
The ibmseti
Python package is available to assist in reading this data and performing some basic operations for you.
There is also a second, simple and clean data set that you may use for warmup, which we call the "basic" data set. This basic set should be used as a sanity check and for very early-stage prototyping. We recommend that everybody starts with this.
* Only 4 different signal classifications
* 1000 simulation files for each class: 4000 files total
* Available as single zip file
* ~1 GB in total.
The difference between the
basic
andprimary
data sets is that the signals simulated in thebasic
set have, on average, much higher signal to noise ratio (they are larger amplitude signals). They also have other characteristics that will make the different signal classes very distinguishable. You should be able to get very high signal classification accuracy with the basic data set. The primary data set has smaller amplitude signals and can look more similar to each other, making classification accuracy more difficult with this data set. There are also only 4 classes in the basic data set and 7 classes in the primary set.
The primary small
is a subset of the full primary data set. Use for early-stage prototyping.
The primary medium
was a subset of the full primary data set but it not constitutes the entire data set. You may want to consider ways to augment this data set in order to create more training samples. Additionally, you could consider splitting each file up into 4 or 5 smaller files and simply build models that accept smaller files. You wouldn't be able to use this for post scores to the Scoreboards, but it would be one way to generate more data. Finally, we hope to one day release the simulation code, which would allow you to generate your own data sets.
For all data sets, there exists an index file. That file is a CSV file. Each row holds the UUID, signal_classification (label) for a simulation file in the data set. You can use these index files in a few different ways (from using to keep track of your downloads, to facilitate parallelization of your analysis on Spark).
It's probably easiest to download these zip files, unzip them separately, then move the contents of to a single folder.
Once you've trained your model, done all of your cross-validation testing, and are ready to submit an entry to the contest, you'll need to download the test data set and score the test set data with your model.
The test data files are nearly the same as the training sets. The only difference is the JSON header in each file does not contain the signal class. You can use ibmseti
python package to read each file, just like you would the training data. See Step_2_reading_SETI_code_challenge_data.ipynb for examples.
The primary_testset_preview_v3
data set contains 2414 test simulation files. Each data file is the same as the above training data except the JSON header does NOT contain the 'signal_classification' key.
The primary_testset_final_v3
data set contains 2496 test simulation files. Each data file is the same as the above training data except the JSON header does NOT contain the 'signal_classification' key.
See the Judging Criteria notebook for information on submitting your test-set classifications.
If you're working with IBM Watson Data Platform (or Data Science Experience), you can use either wget
or curl
from a Jupyter notebook cell. Or you can use the requests
library, or similar, to download the files programmatically. (This should work for both the IBM Spark service backend or the IBM Analytics Engine backend.) Simply call wget
command-line from the shell using the appropriate shell command syntax. The shell command syntax is different for Python kernels versus Scala kernels. Below we show you the Python kernel way, asssuming that the vast majority will use Python.
In [33]:
#copy link from above.
#make sure to use the -O <filename.zip> to redirect the output
!wget https://ibm.box.com/shared/static/91z783n1ysyrzomcvj4o89f4b8ss76ct.zip -O primary_testset_preview_v3.zip
In [34]:
!ls -al primary_testset_preview_v3.zip
In [35]:
import zipfile
zz = zipfile.ZipFile('primary_testset_preview_v3.zip')
In [36]:
zz.namelist()[:10]
Out[36]: