Ingesting More Data

NEXUS uses Spring XD to ingest new data into the system. Spring XD is a distributed runtime that allows for parallel ingestion of data into data stores of all types. It requires a few tools for administrative purposes, including Redis and a Relational database management system (RDBMS).

The Spring XD architecture also consists of a management application called XD Admin which manages XD Containers. Spring XD utilizes Apache Zookeeper to keep track of the state of the cluster and also uses Apache Kafka to communicate between it's components.

Step 1: Start an Ingestion Cluster

We can bring up an ingestion cluster by using docker-compose.

TODOs

  1. Navigate to the directory containing the docker-compose.yml file for the ingestion cluster

    $ cd ~/nexus/esip-workshop/docker/ingest
    
  2. Use docker-compose to bring up the containers in the ingestion cluster

    docker-compose up -d
    

Step 2: Verify the Ingestion Cluster is Working

Now that the cluster has started we can use various commands to ensure that it is operational and monitor its status.

TODO

  1. List all running docker containers.

    $ docker ps
    

    The output should look simillar to this:

    CONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS              PORTS                                            NAMES
    581a05925ea6        nexusjpl/ingest-container     "/usr/local/nexus-..."   5 seconds ago       Up 3 seconds        9393/tcp                                         xd-container2
    1af7ba346d31        nexusjpl/ingest-container     "/usr/local/nexus-..."   5 seconds ago       Up 3 seconds        9393/tcp                                         xd-container3
    0668e2a48c9a        nexusjpl/ingest-container     "/usr/local/nexus-..."   5 seconds ago       Up 3 seconds        9393/tcp                                         xd-container1
    d717e6629b4a        nexusjpl/ingest-admin         "/usr/local/nexus-..."   5 seconds ago       Up 4 seconds        9393/tcp                                         xd-admin
    a4dae8ca6757        nexusjpl/kafka                "kafka-server-star..."   7 seconds ago       Up 6 seconds                                                         kafka3
    c29664cfae4a        nexusjpl/kafka                "kafka-server-star..."   7 seconds ago       Up 6 seconds                                                         kafka2
    623bdaa50207        nexusjpl/kafka                "kafka-server-star..."   7 seconds ago       Up 6 seconds                                                         kafka1
    2266c2a54113        redis:3                       "docker-entrypoint..."   7 seconds ago       Up 5 seconds        6379/tcp                                         redis
    da3267942d5f        mysql:8                       "docker-entrypoint..."   7 seconds ago       Up 6 seconds        3306/tcp                                         mysqldb
    e5589456a78a        nexusjpl/nexus-webapp         "/tmp/docker-entry..."   31 hours ago        Up 31 hours         0.0.0.0:4040->4040/tcp, 0.0.0.0:8083->8083/tcp   nexus-webapp
    18e682b9af0e        nexusjpl/spark-mesos-agent    "/tmp/docker-entry..."   31 hours ago        Up 31 hours                                                          mesos-agent1
    8951841d1da6        nexusjpl/spark-mesos-agent    "/tmp/docker-entry..."   31 hours ago        Up 31 hours                                                          mesos-agent3
    c0240926a4a2        nexusjpl/spark-mesos-agent    "/tmp/docker-entry..."   31 hours ago        Up 31 hours                                                          mesos-agent2
    c97ad268833f        nexusjpl/spark-mesos-master   "/bin/bash -c './b..."   31 hours ago        Up 31 hours         0.0.0.0:5050->5050/tcp                           mesos-master
    90d370eb3a4e        nexusjpl/jupyter              "tini -- start-not..."   3 days ago          Up 3 days           0.0.0.0:8000->8888/tcp                           jupyter
    cd0f47fe303d        nexusjpl/nexus-solr           "docker-entrypoint..."   3 days ago          Up 3 days           8983/tcp                                         solr2
    8c0f5c8eeb45        nexusjpl/nexus-solr           "docker-entrypoint..."   3 days ago          Up 3 days           8983/tcp                                         solr3
    27e34d14c16e        nexusjpl/nexus-solr           "docker-entrypoint..."   3 days ago          Up 3 days           8983/tcp                                         solr1
    247f807cb5ec        cassandra:2.2.8               "/docker-entrypoin..."   3 days ago          Up 3 days           7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp      cassandra3
    09cc86a27321        zookeeper                     "/docker-entrypoin..."   3 days ago          Up 3 days           2181/tcp, 2888/tcp, 3888/tcp                     zk1
    33e9d9b1b745        zookeeper                     "/docker-entrypoin..."   3 days ago          Up 3 days           2181/tcp, 2888/tcp, 3888/tcp                     zk3
    dd29e4d09124        cassandra:2.2.8               "/docker-entrypoin..."   3 days ago          Up 3 days           7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp      cassandra2
    11e57e0c972f        zookeeper                     "/docker-entrypoin..."   3 days ago          Up 3 days           2181/tcp, 2888/tcp, 3888/tcp                     zk2
    2292803d942d        cassandra:2.2.8               "/docker-entrypoin..."   3 days ago          Up 3 days           7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp      cassandra1
    
  2. View the log of the XD Admin container to verify it has started.

    $ docker logs -f xd-admin
    

Step 3: Ingest Some Data

Now that the ingestion cluster has been started, we can ingest some new data into the system. Currently, there is AVHRR data ingested up through 2016. In this step you will ingest the remaining AVHRR data through July 2017. The source granules for AVHRR have already been copied to the EBS volume attached to your EC2 instance and mounted in the ingestion containers as /usr/local/data/nexus/avhrr/2017.

In order to begin ingesting data, we need to deploy a new ingestion stream. The ingestion stream needs a few key parameters: the name of the dataset, where to look for the data files, the variable name to extract from the granules, and approximately how many tiles should be created per granule. These parameters can all be provided to the nx-deploy-stream shell script that is present in the xd-admin container.

TODOs

  1. Deploy the stream to ingest the 2017 AVHRR data
    $ docker exec -it xd-admin /usr/local/nx-deploy-stream.sh --datasetName AVHRR_OI_L4_GHRSST_NCEI --dataDirectory /usr/local/data/nexus/avhrr/2017 --variableName analysed_sst --tilesDesired 1296
    

Step 4: Monitor the Ingestion

Once the stream is deployed, the data will begin to flow into the system. Progress can be monitored by tailing the log files and monitoring the number of tiles and granules that have been ingested into the system.

TODOs

  1. Get a listing of granules and tiles per granule for AVHRR 2017
  2. Get a count of the number of granules ingested for AVHRR 2017
  3. Verify the dataset list shows that granules have been ingested through July 2017

In [ ]:
# TODO Run this cell multiple times to watch as the granules are ingested into the system.
import requests

dataset = 'AVHRR_OI_L4_GHRSST_NCEI'
year = 2017

response = requests.get("http://solr1:8983/solr/nexustiles/query?q=granule_s:%d*&rows=0&fq=dataset_s:%s&facet.field=granule_s&facet=true&facet.mincount=1&facet.limit=-1&facet.sort=index" % (year, dataset))
data = response.json()
for k in data['facet_counts']["facet_fields"]['granule_s']:
    print(k)

In [ ]:
# TODO Run this cell to get a count of the number of AVHRR granules ingested for the year 2017.
# Ingestion is finished when there the total reaches 187
import requests

dataset = 'AVHRR_OI_L4_GHRSST_NCEI'
year = 2017

response = requests.get("http://solr1:8983/solr/nexustiles/query?q=granule_s:%d*&json.facet={granule_s:'unique(granule_s)'}&rows=0&fq=dataset_s:%s" % (year, dataset))
data = response.json()
number_of_granules = data['facets']['granule_s'] if 'granule_s' in data['facets'] else 0
print("Number of granules for %s : %d" % (dataset, number_of_granules))

In [ ]:
# TODO Run this cell to get a list of datasets available along with their start and end dates.
import nexuscli
# Target the nexus webapp server
nexuscli.set_target("http://nexus-webapp:8083")
nexuscli.dataset_list()

Step 5: Run a Time Series With the new Data

Once you have reached 187 total granules ingested for 2017 and see that AVHRR has data through July 2017, the ingestion has completed. You can now use the analytical functions on the new data.

TODOs

  1. Generate a Time Series using the new data.

In [ ]:
# TODO Run this cell to produce a Time Series plot using AVHRR data from 2017.
%matplotlib inline
import matplotlib.pyplot as plt
import time
import nexuscli
from datetime import datetime

from shapely.geometry import box

bbox = box(-150, 40, -120, 55)
datasets = ["AVHRR_OI_L4_GHRSST_NCEI"]
start_time = datetime(2017, 1, 1)
end_time = datetime(2017, 7, 6)

start = time.perf_counter()
ts, = nexuscli.time_series(datasets, bbox, start_time, end_time, spark=True)
print("Time Series took {} seconds to generate".format(time.perf_counter() - start))

plt.figure(figsize=(10,5), dpi=100)
plt.plot(ts.time, ts.mean, 'b-', marker='|', markersize=2.0, mfc='b')
plt.grid(b=True, which='major', color='k', linestyle='-')
plt.xlabel("Time")
plt.ylabel ("Sea Surface Temperature (C)")
plt.show()

Congratulations!

You have completed this workshop. You now have a completely functional NEXUS cluster with all containers started:

If you would like, you can go back to the workshop 1 notebooks and verify they are still working. More information about NEXUS is available on our GitHub.

If you are interested in learning more about Docker, Nga Quach will be giving a presentaion all about Docker Thursday, July 27 during the Free and Open Source Software (FOSS) and Technologies for the Cloud session.

If you are interested in learning more about our Apache Spark, Joe Jacob will be giving a presentation all about Spark Thursday, July 27 during the Free and Open Source Software (FOSS) and Technologies for the Cloud session.

Thank you for participating!