Interacting with CerebralCortex Data

Cerebral Cortex is MD2K's big data cloud tool designed to support population-scale data analysis, visualization, model development, and intervention design for mobile-sensor data. It provides the ability to do machine learning model development on population scale datasets and provides interoperable interfaces for aggregation of diverse data sources.

This page provides an overview of the core Cerebral Cortex operations to familiarilze you with how to discover and interact with different sources of data that could be contained within the system.

Note: While some of these examples are showing generated data, they are designed to function on real-world mCerebrum data and the signal generators were built to facilitate the testing and evaluation of the Cerebral Cortex platform by those individuals that are unable to see those original datasets or do not wish to collect data before evaluating the system.

Import packages

Python projects always require a number of imports and the specifics are located in the util/dependencies.py file. The settings import specifies the specific USER_ID that is utilized within this demonstration. These ids are the unique user identifiers within Cerebral Cortex.


In [ ]:
%reload_ext autoreload
from util.dependencies import *
from settings import USER_ID

Create CerebralCortex object

The Kernel object is the main entry point to the Cerebral Cortex system. It is necessary to pass a configuration directory that tells it all the different parameters it needs to communicate with its other components. You can examine the details of these configurations for this server by looking at the files contained in the cc_conf folder.


In [ ]:
CC = Kernel("/home/md2k/cc_conf/")

Getting help

These are the typical ways to learn more about the code and objects within Cerebral Cortex.

  1. Intelligent context help by typing the object or class into a cell followed by the period, ., then when you press <tab> a popup will appear showing additional information about the object or method. Uncomment the first line to try it out.
  2. Formatting the commands with a question mark retrieves the documentation strings and examples when appropriate. ? CC.list_streams
  3. Reading the documentation on our site: https://cerebralcortex-kernel.readthedocs.io/en/latest/

In [ ]:
# CC.

? CC.list_streams

Generate some sample data for phone battery

This helper method utilizes Cerebral Cortex (CC), the USER_ID, and a stream_name to generate fake data for for the purposes of these examples. If you have real-world data, this step can be skipped and your stream names adjusted to make your dataset. This is disabled for this demonstration to not create too much data at once.


In [ ]:
#gen_phone_battery_data(CC, user_id=USER_ID, stream_name="BATTERY--org.md2k.phonesensor--PHONE")

List available streams in CC

One of the first things a researcher typically wants to know is what data is available to explore. The kernel offers a couple of methods to facilitate this. The first, list_streams, is shown below and exposes all the available streams within the system.


In [ ]:
streams = CC.list_streams()
for stream in streams:
    print(stream.name)

Search streams by name

For larger deployments, the list of all streams may be too long to easily sort through, or you may be interested in a specific type of information. In this case, the second method search_stream would be more applicable. This search returns streams that have a substring match of the search parameter.


In [ ]:
results = CC.search_stream("battery")
for result in results:
    print(result)

Get stream data

Once a stream is identified by name, it needs to be loaded into a DataStream object by calling get_stream. This pulls into a single object all the metadata associated with the stream as well as a reference to the data so that it can be accessed as needed.


In [ ]:
battery_data_stream = CC.get_stream("BATTERY--org.md2k.phonesensor--PHONE")

The summary method displays some basic statistics about the datastream such as the number of points as well as max, mean, stdev, and min values. These statistics are shown for each column of data in the stream.


In [ ]:
battery_data_stream.summary()

Any datastream can be printed or visualized to the screen; however, it is important to limit, in this case to 3, the number of rows to show. Streams can contain millions to billions of samples depending on the size of the system and even for the case of a single individual wearing a motion-capture band, this number can exceed 30,000,000 samples for a short two week study. Cerebral Cortex defaults to settings that try to not load all the data unless needed.

This example prints the first 3 rows of the loaded battery stream and it contains 5 columns.

  • timestamp: This is the time in UTC that the sample was recorded at
  • localtime: This is the time in the local timezone that the sample was recorded at
  • battery_level: This is the battery percentage of the smartphone device
  • version: This is the Cerebral Cortex version code assigned to this stream.
  • user: This is the specific UUID that identifies the user that owns this data point

In [ ]:
battery_data_stream.show(3, truncate=False)

Each stream contains

  • name: The complete string name of this stream
  • description: A text description of this stream
  • data_descriptor: A list of objects that describe the data components of the stream (e.g. battery_level)
    • ...
    • name: data descriptor name
    • type: the object type (e.g. integer, float, string, ...)
    • optional_fields: any number of arbitrary fields can be added when creating a stream and will appear here
    • ...
  • annotations: Currently unused but designed to link streams together such as a data quality and the corresponding raw stream
  • input_streams: Currently unused but designed to specify which streams were utilized to generate this stream
  • modules: Metadata about the algorithm/code module the generated this data
    • name: The name of the code module
    • version: The version of the code module
    • attributes: Arbitrary attributes specified by key-value pairs
    • authors: A set of author names and emails

In [ ]:
metadata = battery_data_stream.get_metadata(version=1)
print(metadata)

Filter Data

Cerebral Cortex returns all data associated with a stream name, which is great for performing operations and intial exploration; however, it allows for the filtering of these streams of data to isolate certain criterias such as value ranges or specific columns or users.

Filter data by data column

The first major filtering capability allows for named columns to have logical operations applied to them. The filter method is applicable to the data stream object and accepts three parameters.

  • column name: (e.g. battery_level)
  • operation: (e.g. >, <, ==, >=, ...)
  • criteria: (e.g. 97)

In [ ]:
filtered_data = battery_data_stream.filter("battery_level", ">", 97)
filtered_data.show(3,truncate=False)

Filter data by user

User filtering is a special case due to the way Cerebral Cortex stores data and a dedicated method, filter_user, is provided which accepts a single USER_ID as input. This example illustrates filtering by the prior user id.


In [ ]:
filtered_user_data = battery_data_stream.filter_user("00000000-afb8-476e-9872-6472b4e66b68")
filtered_user_data.show(3,truncate=False)

Filter data by version

Version filtering is a special case due to the way Cerebral Cortex stores data. A dedicated method, filter_version, is provided which accepts a single version as input.


In [ ]:
filtered_version_data = battery_data_stream.filter_version(1)
filtered_version_data.show(3,truncate=False)

Convert datastream object into Pandas dataframe

The data representations and visualizations that have been shown so far provide a way for basic data inspections; however, these are not directly suitable for more complex interactions or analysis. Cerebral Cortex provide a to_pandas method to transform the datastream data into a Pandas dataframe object. From this point, anything that Pandas can do is supported.


In [ ]:
pdf = battery_data_stream.to_pandas()
pdf.data

Plot stream data

Visualization is a key part to gaining an understanding of the data and performing data analysis. The datastream contains a set of basic plotting operations that are accessible through the .plot() method or through other direct mechanisms. Please see the plotting_demo tutorial page for a complete set of plotting examples.

This plot is interactive; try using your mouse to explore the data.


In [ ]:
battery_data_stream.plot()

Compute some basic stats

Cerebral Cortex provides computationally efficient helper functions for generating basic statistics over the datastream. These functions include: average, sqrt, sum, variance, stdev, min, max


In [ ]:
average = battery_data_stream.compute_average()
average.show(4, False)

minimum = battery_data_stream.compute_min()
minimum.show(4, False)

Perform windowing operation on data

Many times it is preferable to group the data into windows before applying an algorithm or computation to the data. The basic windowing function groups data into non-overlapping chunks and returns a data stream with each cell containing all the data associated with that particular window.


In [ ]:
windowed_data = battery_data_stream.window(windowDuration=60)
windowed_data.show(3)

Sliding windows

Another common windowing technique can be accomplished by adding an offset parameter to the parameter list which causes the windows to move by a partial window size instead of the whole window.


In [ ]:
windowed_data = battery_data_stream.window(windowDuration=60, slideDuration=5)
windowed_data.show(3)

Computation over windows

There are two mechanisms where windowed data can have computation applied. First, some of the core methods such as compute_average or compute_variance contain optimized implementations of these algorithms which more efficiently computes the desired outcome by internally applying the windowing algorithms. Second, a more generic approach comes from explicitly creating the windows and applying a particular computation to it. This will be discussed in another panel.


In [ ]:
average = battery_data_stream.compute_average(windowDuration=60)
average.show(4, False)