Ingesting More Data

NEXUS uses Spring XD to ingest new data into the system. Spring XD is a distributed runtime that allows for parallel ingestion of data into data stores of all types. It requires a few tools for administrative purposes, including Redis and a Relational database management system (RDBMS).

The Spring XD architecture also consists of a management application called XD Admin which manages XD Containers. Spring XD utilizes Apache Zookeeper to keep track of the state of the cluster and also uses Apache Kafka to communicate between it's components.

Step 1: Start an Ingestion Cluster

We can bring up an ingestion cluster by using docker-compose.

TODOs

Navigate to the directory containing the docker-compose.yml file for the ingestion cluster
```
$ cd ~/nexus/esip-workshop/docker/ingest
```
Use docker-compose to bring up the containers in the ingestion cluster
```
docker-compose up -d
```

Step 2: Verify the Ingestion Cluster is Working

Now that the cluster has started we can use various commands to ensure that it is operational and monitor its status.

TODO

List all running docker containers.

$ docker ps

The output should look simillar to this:

CONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS              PORTS                                            NAMES
581a05925ea6        nexusjpl/ingest-container     "/usr/local/nexus-..."   5 seconds ago       Up 3 seconds        9393/tcp                                         xd-container2
1af7ba346d31        nexusjpl/ingest-container     "/usr/local/nexus-..."   5 seconds ago       Up 3 seconds        9393/tcp                                         xd-container3
0668e2a48c9a        nexusjpl/ingest-container     "/usr/local/nexus-..."   5 seconds ago       Up 3 seconds        9393/tcp                                         xd-container1
d717e6629b4a        nexusjpl/ingest-admin         "/usr/local/nexus-..."   5 seconds ago       Up 4 seconds        9393/tcp                                         xd-admin
a4dae8ca6757        nexusjpl/kafka                "kafka-server-star..."   7 seconds ago       Up 6 seconds                                                         kafka3
c29664cfae4a        nexusjpl/kafka                "kafka-server-star..."   7 seconds ago       Up 6 seconds                                                         kafka2
623bdaa50207        nexusjpl/kafka                "kafka-server-star..."   7 seconds ago       Up 6 seconds                                                         kafka1
2266c2a54113        redis:3                       "docker-entrypoint..."   7 seconds ago       Up 5 seconds        6379/tcp                                         redis
da3267942d5f        mysql:8                       "docker-entrypoint..."   7 seconds ago       Up 6 seconds        3306/tcp                                         mysqldb
e5589456a78a        nexusjpl/nexus-webapp         "/tmp/docker-entry..."   31 hours ago        Up 31 hours         0.0.0.0:4040->4040/tcp, 0.0.0.0:8083->8083/tcp   nexus-webapp
18e682b9af0e        nexusjpl/spark-mesos-agent    "/tmp/docker-entry..."   31 hours ago        Up 31 hours                                                          mesos-agent1
8951841d1da6        nexusjpl/spark-mesos-agent    "/tmp/docker-entry..."   31 hours ago        Up 31 hours                                                          mesos-agent3
c0240926a4a2        nexusjpl/spark-mesos-agent    "/tmp/docker-entry..."   31 hours ago        Up 31 hours                                                          mesos-agent2
c97ad268833f        nexusjpl/spark-mesos-master   "/bin/bash -c './b..."   31 hours ago        Up 31 hours         0.0.0.0:5050->5050/tcp                           mesos-master
90d370eb3a4e        nexusjpl/jupyter              "tini -- start-not..."   3 days ago          Up 3 days           0.0.0.0:8000->8888/tcp                           jupyter
cd0f47fe303d        nexusjpl/nexus-solr           "docker-entrypoint..."   3 days ago          Up 3 days           8983/tcp                                         solr2
8c0f5c8eeb45        nexusjpl/nexus-solr           "docker-entrypoint..."   3 days ago          Up 3 days           8983/tcp                                         solr3
27e34d14c16e        nexusjpl/nexus-solr           "docker-entrypoint..."   3 days ago          Up 3 days           8983/tcp                                         solr1
247f807cb5ec        cassandra:2.2.8               "/docker-entrypoin..."   3 days ago          Up 3 days           7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp      cassandra3
09cc86a27321        zookeeper                     "/docker-entrypoin..."   3 days ago          Up 3 days           2181/tcp, 2888/tcp, 3888/tcp                     zk1
33e9d9b1b745        zookeeper                     "/docker-entrypoin..."   3 days ago          Up 3 days           2181/tcp, 2888/tcp, 3888/tcp                     zk3
dd29e4d09124        cassandra:2.2.8               "/docker-entrypoin..."   3 days ago          Up 3 days           7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp      cassandra2
11e57e0c972f        zookeeper                     "/docker-entrypoin..."   3 days ago          Up 3 days           2181/tcp, 2888/tcp, 3888/tcp                     zk2
2292803d942d        cassandra:2.2.8               "/docker-entrypoin..."   3 days ago          Up 3 days           7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp      cassandra1

View the log of the XD Admin container to verify it has started.
```
$ docker logs -f xd-admin
```

Step 3: Ingest Some Data

Now that the ingestion cluster has been started, we can ingest some new data into the system. Currently, there is AVHRR data ingested up through 2016. In this step you will ingest the remaining AVHRR data through July 2017. The source granules for AVHRR have already been copied to the EBS volume attached to your EC2 instance and mounted in the ingestion containers as /usr/local/data/nexus/avhrr/2017.

In order to begin ingesting data, we need to deploy a new ingestion stream. The ingestion stream needs a few key parameters: the name of the dataset, where to look for the data files, the variable name to extract from the granules, and approximately how many tiles should be created per granule. These parameters can all be provided to the nx-deploy-stream shell script that is present in the xd-admin container.

TODOs

Deploy the stream to ingest the 2017 AVHRR data

$ docker exec -it xd-admin /usr/local/nx-deploy-stream.sh --datasetName AVHRR_OI_L4_GHRSST_NCEI --dataDirectory /usr/local/data/nexus/avhrr/2017 --variableName analysed_sst --tilesDesired 1296

Step 4: Monitor the Ingestion

Once the stream is deployed, the data will begin to flow into the system. Progress can be monitored by tailing the log files and monitoring the number of tiles and granules that have been ingested into the system.

TODOs

Get a listing of granules and tiles per granule for AVHRR 2017
Get a count of the number of granules ingested for AVHRR 2017
Verify the dataset list shows that granules have been ingested through July 2017



In [ ]:

    
# TODO Run this cell multiple times to watch as the granules are ingested into the system.
import requests

dataset = 'AVHRR_OI_L4_GHRSST_NCEI'
year = 2017

response = requests.get("http://solr1:8983/solr/nexustiles/query?q=granule_s:%d*&rows=0&fq=dataset_s:%s&facet.field=granule_s&facet=true&facet.mincount=1&facet.limit=-1&facet.sort=index" % (year, dataset))
data = response.json()
for k in data['facet_counts']["facet_fields"]['granule_s']:
    print(k)



In [ ]:

    
# TODO Run this cell to get a count of the number of AVHRR granules ingested for the year 2017.
# Ingestion is finished when there the total reaches 187
import requests

dataset = 'AVHRR_OI_L4_GHRSST_NCEI'
year = 2017

response = requests.get("http://solr1:8983/solr/nexustiles/query?q=granule_s:%d*&json.facet={granule_s:'unique(granule_s)'}&rows=0&fq=dataset_s:%s" % (year, dataset))
data = response.json()
number_of_granules = data['facets']['granule_s'] if 'granule_s' in data['facets'] else 0
print("Number of granules for %s : %d" % (dataset, number_of_granules))



In [ ]:

    
# TODO Run this cell to get a list of datasets available along with their start and end dates.
import nexuscli
# Target the nexus webapp server
nexuscli.set_target("http://nexus-webapp:8083")
nexuscli.dataset_list()

Step 5: Run a Time Series With the new Data

Once you have reached 187 total granules ingested for 2017 and see that AVHRR has data through July 2017, the ingestion has completed. You can now use the analytical functions on the new data.

TODOs

Generate a Time Series using the new data.



In [ ]:

    
# TODO Run this cell to produce a Time Series plot using AVHRR data from 2017.
%matplotlib inline
import matplotlib.pyplot as plt
import time
import nexuscli
from datetime import datetime

from shapely.geometry import box

bbox = box(-150, 40, -120, 55)
datasets = ["AVHRR_OI_L4_GHRSST_NCEI"]
start_time = datetime(2017, 1, 1)
end_time = datetime(2017, 7, 6)

start = time.perf_counter()
ts, = nexuscli.time_series(datasets, bbox, start_time, end_time, spark=True)
print("Time Series took {} seconds to generate".format(time.perf_counter() - start))

plt.figure(figsize=(10,5), dpi=100)
plt.plot(ts.time, ts.mean, 'b-', marker='|', markersize=2.0, mfc='b')
plt.grid(b=True, which='major', color='k', linestyle='-')
plt.xlabel("Time")
plt.ylabel ("Sea Surface Temperature (C)")
plt.show()

Congratulations!

You have completed this workshop. You now have a completely functional NEXUS cluster with all containers started:

If you would like, you can go back to the workshop 1 notebooks and verify they are still working. More information about NEXUS is available on our GitHub.

If you are interested in learning more about Docker, Nga Quach will be giving a presentaion all about Docker Thursday, July 27 during the Free and Open Source Software (FOSS) and Technologies for the Cloud session.

If you are interested in learning more about our Apache Spark, Joe Jacob will be giving a presentation all about Spark Thursday, July 27 during the Free and Open Source Software (FOSS) and Technologies for the Cloud session.

Thank you for participating!