Cerebral Cortex is MD2K's big data cloud tool designed to support population-scale data analysis, visualization, model development, and intervention design for mobile-sensor data. It provides the ability to do machine learning model development on population scale datasets and provides interoperable interfaces for aggregation of diverse data sources.
This page provides an overview of the core Cerebral Cortex operations to familiarilze you with how to discover and interact with different sources of data that could be contained within the system.
Note: While some of these examples are showing generated data, they are designed to function on real-world mCerebrum data and the signal generators were built to facilitate the testing and evaluation of the Cerebral Cortex platform by those individuals that are unable to see those original datasets or do not wish to collect data before evaluating the system.
In [ ]:
%reload_ext autoreload
from util.dependencies import *
from settings import USER_ID
The Kernel object is the main entry point to the Cerebral Cortex system. It is necessary to pass a configuration directory that tells it all the different parameters it needs to communicate with its other components. You can examine the details of these configurations for this server by looking at the files contained in the cc_conf
folder.
In [ ]:
CC = Kernel("/home/md2k/cc_conf/")
These are the typical ways to learn more about the code and objects within Cerebral Cortex.
.
, then when you press <tab>
a popup will appear showing additional information about the object or method. Uncomment the first line to try it out.? CC.list_streams
In [ ]:
# CC.
? CC.list_streams
This helper method utilizes Cerebral Cortex (CC
), the USER_ID
, and a stream_name
to generate fake data for for the purposes of these examples. If you have real-world data, this step can be skipped and your stream names adjusted to make your dataset. This is disabled for this demonstration to not create too much data at once.
In [ ]:
#gen_phone_battery_data(CC, user_id=USER_ID, stream_name="BATTERY--org.md2k.phonesensor--PHONE")
In [ ]:
streams = CC.list_streams()
for stream in streams:
print(stream.name)
For larger deployments, the list of all streams may be too long to easily sort through, or you may be interested in a specific type of information. In this case, the second method search_stream
would be more applicable. This search returns streams that have a substring match of the search parameter.
In [ ]:
results = CC.search_stream("battery")
for result in results:
print(result)
In [ ]:
battery_data_stream = CC.get_stream("BATTERY--org.md2k.phonesensor--PHONE")
In [ ]:
battery_data_stream.summary()
Any datastream can be printed or visualized to the screen; however, it is important to limit, in this case to 3, the number of rows to show. Streams can contain millions to billions of samples depending on the size of the system and even for the case of a single individual wearing a motion-capture band, this number can exceed 30,000,000 samples for a short two week study. Cerebral Cortex defaults to settings that try to not load all the data unless needed.
This example prints the first 3 rows of the loaded battery stream and it contains 5 columns.
In [ ]:
battery_data_stream.show(3, truncate=False)
Each stream contains
In [ ]:
metadata = battery_data_stream.get_metadata(version=1)
print(metadata)
The first major filtering capability allows for named columns to have logical operations applied to them. The filter
method is applicable to the data stream object and accepts three parameters.
In [ ]:
filtered_data = battery_data_stream.filter("battery_level", ">", 97)
filtered_data.show(3,truncate=False)
In [ ]:
filtered_user_data = battery_data_stream.filter_user("00000000-afb8-476e-9872-6472b4e66b68")
filtered_user_data.show(3,truncate=False)
In [ ]:
filtered_version_data = battery_data_stream.filter_version(1)
filtered_version_data.show(3,truncate=False)
The data representations and visualizations that have been shown so far provide a way for basic data inspections; however, these are not directly suitable for more complex interactions or analysis. Cerebral Cortex provide a to_pandas
method to transform the datastream data into a Pandas dataframe object. From this point, anything that Pandas can do is supported.
In [ ]:
pdf = battery_data_stream.to_pandas()
pdf.data
Visualization is a key part to gaining an understanding of the data and performing data analysis. The datastream contains a set of basic plotting operations that are accessible through the .plot()
method or through other direct mechanisms. Please see the plotting_demo
tutorial page for a complete set of plotting examples.
This plot is interactive; try using your mouse to explore the data.
In [ ]:
battery_data_stream.plot()
In [ ]:
average = battery_data_stream.compute_average()
average.show(4, False)
minimum = battery_data_stream.compute_min()
minimum.show(4, False)
Many times it is preferable to group the data into windows before applying an algorithm or computation to the data. The basic windowing function groups data into non-overlapping chunks and returns a data stream with each cell containing all the data associated with that particular window.
In [ ]:
windowed_data = battery_data_stream.window(windowDuration=60)
windowed_data.show(3)
In [ ]:
windowed_data = battery_data_stream.window(windowDuration=60, slideDuration=5)
windowed_data.show(3)
There are two mechanisms where windowed data can have computation applied. First, some of the core methods such as compute_average
or compute_variance
contain optimized implementations of these algorithms which more efficiently computes the desired outcome by internally applying the windowing algorithms. Second, a more generic approach comes from explicitly creating the windows and applying a particular computation to it. This will be discussed in another panel.
In [ ]:
average = battery_data_stream.compute_average(windowDuration=60)
average.show(4, False)