Timesketch and Jupyter

This is a small notebook that is built to demonstrate how to interact with Timesketch from a jupyter notebook in order to do some additional exploration of the data.

Jupyter can greatly complement investigations by providing the analyst with access to the powers of using python to manipulate the data stored in Timeskech. Additionally it provides developers with the ability to do research on the data in order to speed up developments of analyzers, aggregators and graphing. The purpose of this notebook is simply to briefly introduce the powers of jupyter notebooks to analysts and developers, with the hope of inspiring more to take advantage of this powerful platform. It is also possible to use to do these explorations, and a colab copy of this notebook can be found .

Each code cell (denoted by the [] and grey color) can be run simply by hitting "shift + enter" inside it. In order to run the notebook you'll need to install the jupyter notebook on your machine, and start it (with a python3 kernel). Then you can connect to your local runtime. It is also possible to use mybinder to start up a docker instance with the jupyter notebook. Remember, you can easily add new code cells, or modify the code that is already there to experiment.

README

If you want to have your own copy of the notebook to make some changes or do some other experimentation you can simply select "File / Save as" button.

If you want to connect the notebook to your own Timesketch instance (that is if it is not publicly reachable) you simply run the jupyter notebook binary on a machine that can reach your instance, and configure the SERVER/USER/PASSWORD parameters below to match yours.

Once you have your local runtime setup you should be able to reach your local Timesketch instance.

Import Libraries

We need to start by importing some libraries that we'll use in this notebook.


In [ ]:
import altair as alt # For graphing.
import numpy as np   # Never know when this will come in handy.
import pandas as pd  # We will be using pandas quite heavily.

from timesketch_api_client import client

Connect to TS

And now we can start creating a timesketch client. The client is the object used to connect to the TS server and provides the API to interact with it.

This will connect to the public demo of timesketch, you may want to change these parameters to connect to your own TS instance.


In [ ]:
SERVER = 'https://demo.timesketch.org'
USER = 'demo'
PASSWORD = 'demo'

ts_client = client.TimesketchApi(SERVER, USER, PASSWORD)

If you are running a Jupyter notebook and not JupyterLab you'll need to uncomment and run the cell below, otherwise there is no action needed.


In [ ]:
# This works in a Jupyter notebook settings. Uncomment if you are using a jupyter notebook.
# (you'll need to have installed vega)
#alt.renderers.enable('notebook')

Let's Explore

And now we can start to explore. The first thing is to get all the sketches that are available. Most of the operations you want to do with TS are available in the sketch API.


In [ ]:
sketches = ts_client.list_sketches()

Now that we've got a lis of all available sketches, let's print out the names of the sketches as well as the index into the list, so that we can more easily choose a sketch that interests us.


In [ ]:
for i, sketch in enumerate(sketches):
  print('[{0:d}] {1:s}'.format(i, sketch.name))

Another way is to create a dictionary where the keys are the names of the sketchces and values are the sketch objects.


In [ ]:
sketch_dict = dict((x.name, x) for x in sketches)

In [ ]:
sketch_dict

Let's now take a closer look at some of the data we've got in the "Greendale" investigation.


In [ ]:
gd_sketch = sketch_dict.get('The Greendale incident - 2019', sketches[0])

Now that we've connected to a sketch we can do all sorts of things.

Try doing: gd_sketch.<TAB>

In colab you can use TAB completion to get a list of all attributes of the object you are working with. See a function you may want to call? Try calling it with gd_sketch.function_name? and hit enter.. let's look at an example:


In [ ]:
gd_sketch.explore?

This way you'll get a list of all the parameters you may want or need to use. You can also use tab completion as soon as you type, gd_sketch.e<TAB> will give you all options that start with an e, etc.

You can also type gd_sketch.explore(<TAB>) and get a pop-up with a list of what parameters this function provides.

But for now, let's look at what views are available to use here:


In [ ]:
views = gd_sketch.list_views()

for index, view in enumerate(views):
  print('[{0:d}] {1:s}'.format(index, view.name))

You can then start to query the API to get back results from these views. Let's try one of them...

Word of caution, try to limit your search so that you don't get too many results back. The API will happily let you get all the results back as you choose, but the more records you get back the longer the API call will take (10k events per API call).


In [ ]:
# You can change this number if you would like to test out another view.
# The way the code works is that it checks first of you set the "view_text", and uses that to pick a view, otherwise the number is used.
view_number = 1
view_text = '[phishy_domains] Phishy Domains'

if view_text:
  for index, view in enumerate(views):
    if view.name == view_text:
      view_number = index
      break

print('Fetching data from : {0:s}'.format(views[view_number].name))
print( '        Query used : {0:s}'.format(views[view_number].query_string if views[view_number].query_string else views[view_number].query_dsl))

If you want to issue this query, then you can run the cell below, otherwise you can change the view_number to try another one.


In [ ]:
greendale_frame = gd_sketch.explore(view=views[view_number], as_pandas=True)

Did you notice the "as_pandas=True" parameter that got passed to the "explore" function? That means that the data that we'll get back is a pandas DataFrame that we can now start exploring.

Let's start with seeing how many entries we got back.


In [ ]:
greendale_frame.shape

This tells us that the view returned back 670 events with 12 columns. Let's explore the first few entries, just so that we can wrap our head around what we got back.


In [ ]:
greendale_frame.head(5)

Let's look at what columns we got back... and maybe create a slice that contains fewer columns.


In [ ]:
greendale_frame.columns

In [ ]:
greendale_slice = greendale_frame[['datetime', 'timestamp_desc', 'tag', 'message', 'label']]

greendale_slice.head(4)

Since this is a result from the analyzers we have few extra fields we can pull in.

When running gd_sketch.explore? did you notice the field called return_fields:

    return_fields: List of fields that should be included in the
        response.

We can use that to specify what fields we would like to get back. Let's add few more fields (you can see what fields are available in the UI)


In [ ]:
greendale_frame = gd_sketch.explore(view=views[view_number], return_fields='datetime,message,source_short,tag,timestamp_desc,url,domain,human_readable', as_pandas=True)

Let's briefly look at these events.


In [ ]:
greendale_slice = greendale_frame[['datetime', 'timestamp_desc', 'tag', 'human_readable', 'url', 'domain']]

greendale_slice.head(5)

OK,.... since this is a phishy domain analyzer, and all the results we got back are essentially from that analyzer, let's look at few things. First of all let's look at the tags tha are available.


In [ ]:
greendale_frame['tag_string'] = greendale_frame.tag.str.join('|')

greendale_frame.tag_string.unique()

OK... so we've got some that are part of the whitelisted-domain... let's look at those the domains that are marked as "phishy" yet excluding those that are whitelisted.


In [ ]:
greendale_frame[~greendale_frame.tag_string.str.contains('whitelisted-domain')].domain.value_counts()

OK... now we get to see all the domains that the domain analyzer considered to be potentially "phishy"... is there a domain that stands out??? what about that grendale one?


In [ ]:
greendale_slice[greendale_slice.domain == 'grendale.xyz']

OK... this seems odd.. let's look at few things, a the human_readable string as well as the URL...


In [ ]:
greendale_slice[greendale_slice.domain == 'grendale.xyz']

In [ ]:
grendale = greendale_slice[greendale_slice.domain == 'grendale.xyz']

string_set = set()
for string_list in grendale.human_readable:
  new_list = [x for x in string_list if 'phishy_domains' in x]
  _ = list(map(string_set.add, new_list))

for entry in string_set:
  print('Human readable string is: {0:s}'.format(entry))

print('')
print('Counts for URL connections to the grendale domain:')
grendale_count = grendale.url.value_counts()
for index in grendale_count.index:
  print('[{0:d}] {1:s}'.format(grendale_count[index], index))

We can start doing a lot more now if we want to... let's look at when these things occurred...


In [ ]:
grendale_array = grendale.url.unique()

greendale_slice[greendale_slice.url.isin(grendale_array)]

OK... we can then start to look at surrounding events.... let's look at one date in particular... "2015-08-29 12:21:06"


In [ ]:
query_dsl = """
{
	"query": {
		"bool": {
			"filter": {
				"bool": {
					"should": [
						{
							"range": {
								"datetime": {
									"gte": "2015-08-29T12:20:06",
									"lte": "2015-08-29T12:22:06"
								}
							}
						}
					]
				}
			},
			"must": [
				{
					"query_string": {
						"query": "*"
					}
				}
			]
		}
	},
	"size": 10000,
	"sort": {
		"datetime": "asc"
	}
}
"""

data = gd_sketch.explore(query_dsl=query_dsl, return_fields='message,human_readable,datetime,timestamp_desc,source_short,data_type,tags,url,domain', as_pandas=True)

In [ ]:
data[['datetime', 'message', 'human_readable', 'url']].head(4)

Let's find the grendale and just look at events two seconds before/after


In [ ]:
data[(data.datetime > '2015-08-29 12:21:04') & (data.datetime < '2015-08-29 12:21:08')][['datetime', 'message', 'timestamp_desc']]

Let's look at aggregation

Timesketch also has aggregation capabilities that we can call from the client. Let's take a quick look.

Start by checking out whether there are any stored aggregations that we can just take a look at.

You can also store your own aggregations using the .save() function on the aggregation object. However we are not going to do that in this colab.


In [ ]:
gd_sketch.list_aggregations()

OK, so there are some aggregations stored. Let's just pick one of those to take a closer look at.


In [ ]:
aggregation = gd_sketch.list_aggregations()[0]

Now we've got an aggregation object that we can take a closer look at.


In [ ]:
aggregation.name

In [ ]:
aggregation.description

OK, so from the name, we can determine that this has to do with top 10 visited domains. We can also look at all of the stored aggregations


In [ ]:
pd.DataFrame([{'name': x.name, 'description': x.description} for x in gd_sketch.list_aggregations()])

Let's look at the aggregation visually, both as a table and a chart.


In [ ]:
aggregation.table

In [ ]:
aggregation.chart

We can also take a look at what aggregators can be used, if we want to run our own custom aggregator.


In [ ]:
gd_sketch.list_available_aggregators()

Now we can see that there are at least the "field_bucket" and "query_bucket" aggregators that we can look at. The field_bucket one is a terms bucket aggregation, which means we can take any field in the dataset and aggregate on that.

So if we want to for instance see the top 20 domains that were visited we can just ask for an aggregation of the field domain and limit it to 20 records (which will be the top 20). Let's do that:


In [ ]:
aggregator = gd_sketch.run_aggregator(
    aggregator_name='field_bucket',
    aggregator_parameters={'field': 'domain', 'limit': 20, 'supported_charts': 'barchart'})

Now we've got an aggregation object that we can take a closer look at... let's look at the data it stored. What we were trying to get out was the top 20 domains that were visited.


In [ ]:
aggregator.table

Or we can look at this visually... as a chart


In [ ]:
aggregator.chart

We can also do something a bit more complex. The other aggregator, the query_bucket works in a similar way, except you can filter the results first. We want to aggregate all the domains that have been tagged with the phishy domain tag.


In [ ]:
tag_aggregator = gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'domain',
        'query_string': 'tag:"phishy-domain"',
        'supported_charts': 'barchart',
    }
)

Let's look at the results.


In [ ]:
tag_aggregator.table

We can also look at all the tags in the timeline. What tags have been applied and how frequent are they.


In [ ]:
gd_sketch.run_aggregator(
    aggregator_name='field_bucket',
    aggregator_parameters={
        'field': 'tag',
        'limit': 10,
    }
).table

And then to see what are the most frequent applications executed on the machine.

Since not all of the execution events have the same fields in them we'll have to create few tables here... let's start with looking at what data types are there.


In [ ]:
gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'data_type',
        'query_string': 'tag:"application_execution"',
        'supported_charts': 'barchart',
    }
).table

And then we can do a summary for each one.


In [ ]:
gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'path',
        'query_string': 'tag:"application_execution"',
        'supported_charts': 'barchart',
    }
).table

In [ ]:
gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'link_target',
        'query_string': 'tag:"application_execution"',
        'supported_charts': 'barchart',
    }
).table

Let's look at logins...

Let's do a search to look at login entries...


In [ ]:
login_data = gd_sketch.explore(
    'data_type:"windows:evtx:record" AND event_identifier:4624', 
    return_fields='datetime,timestamp_desc,human_readable,message,tag,event_identifier,computer_name,record_number,recovered,strings,username',
    as_pandas=True
)

This will produce quite a bit of events... let's look at how many.


In [ ]:
login_data.shape

Let's look at usernames....


In [ ]:
login_data.username.value_counts()

The login analyzer in the demo site wasn't checked in, and therefore didn't extract all those usernames. Let's do this manually for logon entries.


In [ ]:
login_data['account_name'] = login_data.message.str.extract(r'Account Name:.+Account Name:\\t\\t([^\\]+)\\n', expand=False)
login_data['account_domain'] = login_data.message.str.extract(r'Account Domain:.+Account Domain:\\t\\t([^\\]+)\\n', expand=False)
login_data['process_name'] = login_data.message.str.extract(r'Process Name:.+Process Name:\\t\\t([^\\]+)\\n', expand=False)
login_data['date'] = pd.to_datetime(login_data.datetime)

What accounts have logged in:


In [ ]:
login_data.account_name.value_counts()

Let's look at all the computers in there...


In [ ]:
login_data.computer_name.value_counts()

Let's graph.... and you can then interact with the graph... try zomming in, etc.

First we'll define a graph function that we can then call with parameters...


In [ ]:
def GraphLogins(data_frame, machine_name=None):
  
  if machine_name:
    data_slice = data_frame[data_frame.computer_name == machine_name]
    title = 'Accounts Logged In - {0:s}'.format(machine_name)
  else:
    data_slice = data_frame
    title = 'Accounts Logged In'
    
  data_grouped = data_slice[['account_name', 'date']].groupby('account_name', as_index=False).count()
  data_grouped['count'] = data_grouped.date
  del data_grouped['date']

  return alt.Chart(data_grouped, width=400).mark_bar().encode(
    x='account_name', y='count',
    tooltip=['account_name', 'count']
  ).properties(
    title=title
  ).interactive()

Start by graphing all machines


In [ ]:
GraphLogins(login_data)

Or we can look at this for a particular machine:


In [ ]:
GraphLogins(login_data, 'Student-PC1.internal.greendale.edu')

Or we can look at this as a scatter plot...

First we'll define a function that munches the data for us. This function will essentially graph all logins in a day with a scatter plot, using colors to denote the count value.

This graph will be very interactive... try selecting a time period by clicking with the mouse on the upper graph and drawing a selection.


In [ ]:
login_data['day'] = login_data['date'].dt.strftime('%Y-%m-%d')

def GraphScatterLogin(data_frame, machine_name=''):
  if machine_name:
    data_slice = data_frame[data_frame.computer_name == machine_name]
    title = 'Accounts Logged In - {0:s}'.format(machine_name)
  else:
    data_slice = data_frame
    title = 'Accounts Logged In'
  
  login_grouped = data_slice[['day', 'computer_name', 'account_name', 'message']].groupby(['day', 'computer_name', 'account_name'], as_index=False).count()
  login_grouped['count'] = login_grouped.message
  del login_grouped['message']
    
  brush = alt.selection_interval(encodings=['x'])
  click = alt.selection_multi(encodings=['color'])
  color = alt.Color('count:Q')

  chart1 = alt.Chart(login_grouped).mark_point().encode(
      x='day', 
      y='account_name',
      color=alt.condition(brush, color, alt.value('lightgray')),
  ).properties(
      title=title,
      width=600
  ).add_selection(
      brush
  ).transform_filter(
      click
  )
  
  chart2 = alt.Chart(login_grouped).mark_bar().encode(
      x='count',
      y='account_name',
      color=alt.condition(brush, color, alt.value('lightgray')),
      tooltip=['count'],
  ).transform_filter(
      brush
  ).properties(
      width=600
  ).add_selection(
      click
  )
  
  return chart1 & chart2

OK, let's start by graphing for all logins...


In [ ]:
GraphScatterLogin(login_data)

And now just for the Student-PC1


In [ ]:
GraphScatterLogin(login_data, 'Student-PC1.internal.greendale.edu')

And now it is your time to shine, experiment with python pandas, the graphing library and other data science techniques.


In [ ]: