NEXUS uses Spring XD to ingest new data into the system. Spring XD is a distributed runtime that allows for parallel ingestion of data into data stores of all types. It requires a few tools for administrative purposes, including Redis and a Relational database management system (RDBMS).
The Spring XD architecture also consists of a management application called XD Admin which manages XD Containers. Spring XD utilizes Apache Zookeeper to keep track of the state of the cluster and also uses Apache Kafka to communicate between it's components.
We can bring up an ingestion cluster by using docker-compose
.
Navigate to the directory containing the docker-compose.yml file for the ingestion cluster
$ cd ~/nexus/esip-workshop/docker/ingest
Use docker-compose to bring up the containers in the ingestion cluster
docker-compose up -d
Now that the cluster has started we can use various commands to ensure that it is operational and monitor its status.
List all running docker containers.
$ docker ps
The output should look simillar to this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 581a05925ea6 nexusjpl/ingest-container "/usr/local/nexus-..." 5 seconds ago Up 3 seconds 9393/tcp xd-container2 1af7ba346d31 nexusjpl/ingest-container "/usr/local/nexus-..." 5 seconds ago Up 3 seconds 9393/tcp xd-container3 0668e2a48c9a nexusjpl/ingest-container "/usr/local/nexus-..." 5 seconds ago Up 3 seconds 9393/tcp xd-container1 d717e6629b4a nexusjpl/ingest-admin "/usr/local/nexus-..." 5 seconds ago Up 4 seconds 9393/tcp xd-admin a4dae8ca6757 nexusjpl/kafka "kafka-server-star..." 7 seconds ago Up 6 seconds kafka3 c29664cfae4a nexusjpl/kafka "kafka-server-star..." 7 seconds ago Up 6 seconds kafka2 623bdaa50207 nexusjpl/kafka "kafka-server-star..." 7 seconds ago Up 6 seconds kafka1 2266c2a54113 redis:3 "docker-entrypoint..." 7 seconds ago Up 5 seconds 6379/tcp redis da3267942d5f mysql:8 "docker-entrypoint..." 7 seconds ago Up 6 seconds 3306/tcp mysqldb e5589456a78a nexusjpl/nexus-webapp "/tmp/docker-entry..." 31 hours ago Up 31 hours 0.0.0.0:4040->4040/tcp, 0.0.0.0:8083->8083/tcp nexus-webapp 18e682b9af0e nexusjpl/spark-mesos-agent "/tmp/docker-entry..." 31 hours ago Up 31 hours mesos-agent1 8951841d1da6 nexusjpl/spark-mesos-agent "/tmp/docker-entry..." 31 hours ago Up 31 hours mesos-agent3 c0240926a4a2 nexusjpl/spark-mesos-agent "/tmp/docker-entry..." 31 hours ago Up 31 hours mesos-agent2 c97ad268833f nexusjpl/spark-mesos-master "/bin/bash -c './b..." 31 hours ago Up 31 hours 0.0.0.0:5050->5050/tcp mesos-master 90d370eb3a4e nexusjpl/jupyter "tini -- start-not..." 3 days ago Up 3 days 0.0.0.0:8000->8888/tcp jupyter cd0f47fe303d nexusjpl/nexus-solr "docker-entrypoint..." 3 days ago Up 3 days 8983/tcp solr2 8c0f5c8eeb45 nexusjpl/nexus-solr "docker-entrypoint..." 3 days ago Up 3 days 8983/tcp solr3 27e34d14c16e nexusjpl/nexus-solr "docker-entrypoint..." 3 days ago Up 3 days 8983/tcp solr1 247f807cb5ec cassandra:2.2.8 "/docker-entrypoin..." 3 days ago Up 3 days 7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp cassandra3 09cc86a27321 zookeeper "/docker-entrypoin..." 3 days ago Up 3 days 2181/tcp, 2888/tcp, 3888/tcp zk1 33e9d9b1b745 zookeeper "/docker-entrypoin..." 3 days ago Up 3 days 2181/tcp, 2888/tcp, 3888/tcp zk3 dd29e4d09124 cassandra:2.2.8 "/docker-entrypoin..." 3 days ago Up 3 days 7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp cassandra2 11e57e0c972f zookeeper "/docker-entrypoin..." 3 days ago Up 3 days 2181/tcp, 2888/tcp, 3888/tcp zk2 2292803d942d cassandra:2.2.8 "/docker-entrypoin..." 3 days ago Up 3 days 7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp cassandra1
View the log of the XD Admin container to verify it has started.
$ docker logs -f xd-admin
Now that the ingestion cluster has been started, we can ingest some new data into the system. Currently, there is AVHRR data ingested up through 2016. In this step you will ingest the remaining AVHRR data through July 2017. The source granules for AVHRR have already been copied to the EBS volume attached to your EC2 instance and mounted in the ingestion containers as /usr/local/data/nexus/avhrr/2017
.
In order to begin ingesting data, we need to deploy a new ingestion stream. The ingestion stream needs a few key parameters: the name of the dataset, where to look for the data files, the variable name to extract from the granules, and approximately how many tiles should be created per granule. These parameters can all be provided to the nx-deploy-stream
shell script that is present in the xd-admin
container.
$ docker exec -it xd-admin /usr/local/nx-deploy-stream.sh --datasetName AVHRR_OI_L4_GHRSST_NCEI --dataDirectory /usr/local/data/nexus/avhrr/2017 --variableName analysed_sst --tilesDesired 1296
Once the stream is deployed, the data will begin to flow into the system. Progress can be monitored by tailing the log files and monitoring the number of tiles and granules that have been ingested into the system.
In [ ]:
# TODO Run this cell multiple times to watch as the granules are ingested into the system.
import requests
dataset = 'AVHRR_OI_L4_GHRSST_NCEI'
year = 2017
response = requests.get("http://solr1:8983/solr/nexustiles/query?q=granule_s:%d*&rows=0&fq=dataset_s:%s&facet.field=granule_s&facet=true&facet.mincount=1&facet.limit=-1&facet.sort=index" % (year, dataset))
data = response.json()
for k in data['facet_counts']["facet_fields"]['granule_s']:
print(k)
In [ ]:
# TODO Run this cell to get a count of the number of AVHRR granules ingested for the year 2017.
# Ingestion is finished when there the total reaches 187
import requests
dataset = 'AVHRR_OI_L4_GHRSST_NCEI'
year = 2017
response = requests.get("http://solr1:8983/solr/nexustiles/query?q=granule_s:%d*&json.facet={granule_s:'unique(granule_s)'}&rows=0&fq=dataset_s:%s" % (year, dataset))
data = response.json()
number_of_granules = data['facets']['granule_s'] if 'granule_s' in data['facets'] else 0
print("Number of granules for %s : %d" % (dataset, number_of_granules))
In [ ]:
# TODO Run this cell to get a list of datasets available along with their start and end dates.
import nexuscli
# Target the nexus webapp server
nexuscli.set_target("http://nexus-webapp:8083")
nexuscli.dataset_list()
In [ ]:
# TODO Run this cell to produce a Time Series plot using AVHRR data from 2017.
%matplotlib inline
import matplotlib.pyplot as plt
import time
import nexuscli
from datetime import datetime
from shapely.geometry import box
bbox = box(-150, 40, -120, 55)
datasets = ["AVHRR_OI_L4_GHRSST_NCEI"]
start_time = datetime(2017, 1, 1)
end_time = datetime(2017, 7, 6)
start = time.perf_counter()
ts, = nexuscli.time_series(datasets, bbox, start_time, end_time, spark=True)
print("Time Series took {} seconds to generate".format(time.perf_counter() - start))
plt.figure(figsize=(10,5), dpi=100)
plt.plot(ts.time, ts.mean, 'b-', marker='|', markersize=2.0, mfc='b')
plt.grid(b=True, which='major', color='k', linestyle='-')
plt.xlabel("Time")
plt.ylabel ("Sea Surface Temperature (C)")
plt.show()
You have completed this workshop. You now have a completely functional NEXUS cluster with all containers started:
If you would like, you can go back to the workshop 1 notebooks and verify they are still working. More information about NEXUS is available on our GitHub.
If you are interested in learning more about Docker, Nga Quach will be giving a presentaion all about Docker Thursday, July 27 during the Free and Open Source Software (FOSS) and Technologies for the Cloud session.
If you are interested in learning more about our Apache Spark, Joe Jacob will be giving a presentation all about Spark Thursday, July 27 during the Free and Open Source Software (FOSS) and Technologies for the Cloud session.
Thank you for participating!