Timesketch and Colab

This is a small colab that is built to demonstrate how to interact with Timesketch from colab to do some additional exploration of the data.

Colab can greatly complement investigations by providing the analyst with access to the powers of using python to manipulate the data stored in Timeskech. Additionally it provides developers with the ability to do research on the data in order to speed up developments of analyzers, aggregators and graphing. The purpose of this colab is simply to briefly introduce the powers of colab to analysts and developers, with the hope of inspiring more to take advantage of this powerful platform. It is also an option to use jupyter notebook instead of colab, both are just as valid options.

Each code cell (denoted by the [] and grey color) can be run simply by hitting "shift + enter" inside it. The first code that you execute will automatically connect you to a public runtime for colab and connect to the publicly open demo timesketch. You can easily add new code cells, or modify the code that is already there to experiment.

README

If you want to have your own copy of the colab to make some changes or do some other experimentation you can simply select "File / Save a Copy in Drive" button to make your own copy of this colab and start making changes.

If you want to connect colab to your own Timesketch instance (that is if it is not publicly reachable) you can build your own colab runtime, hit the small triangle right next to the "Connect" button in the upper right corner and select "Connect to local runtime". There will be instructions on how to setup your local runtime there.

Once you have your local runtime setup you should be able to reach your local Timesketch instance.

Installation

Let's start by installing the TS API client... all commands that start with ! are executed in the shell, therefore if you are missing Python packages you can use pip.

This colab uses python2 as the underlying python binary.


In [0]:
!pip install --upgrade timesketch-api-client

Then we need to import some libraries that we'll use in this colab.


In [0]:
import altair as alt # For graphing.
import numpy as np   # Never know when this will come in handy.
import pandas as pd  # We will be using pandas quite heavily.

from timesketch_api_client import client

Connect to TS

And now we can start creating a timesketch client. The client is the object used to connect to the TS server and provides the API to interact with it.

This will connect to the public demo of timesketch, you may want to change these parameters to connect to your own TS instance.


In [0]:
#@title Client Information { run: "auto"}

SERVER = 'https://demo.timesketch.org' #@param {type: "string"}
USER = 'demo' #@param {type: "string"}
PASSWORD = 'demo' #@param {type: "string"}


ts_client = client.TimesketchApi(SERVER, USER, PASSWORD)

(hint the above cell is just a small piece of code, you can see the code by clicking the "three dots" and Form/Show code )

Let's Explore

And now we can start to explore. The first thing is to get all the sketches that are available. Most of the operations you want to do with TS are available in the sketch API.


In [0]:
sketches = ts_client.list_sketches()

Now that we've got a lis of all available sketches, let's print out the names of the sketches as well as the index into the list, so that we can more easily choose a sketch that interests us.


In [0]:
for i, sketch in enumerate(sketches):
  print('[{0:d}] {1:s}'.format(i, sketch.name))

Another way is to create a dictionary where the keys are the names of the sketchces and values are the sketch objects.


In [0]:
sketch_dict = dict((x.name, x) for x in sketches)

In [0]:
sketch_dict

Let's now take a closer look at some of the data we've got in the "Greendale" investigation.


In [0]:
gd_sketch = sketch_dict.get('The Greendale incident - 2019', sketches[0])

Now that we've connected to a sketch we can do all sorts of things.

Try doing: gd_sketch.<TAB>

In colab you can use TAB completion to get a list of all attributes of the object you are working with. See a function you may want to call? Try calling it with gd_sketch.function_name? and hit enter.. let's look at an example:


In [0]:
gd_sketch.explore?

This way you'll get a list of all the parameters you may want or need to use. You can also use tab completion as soon as you type, gd_sketch.e<TAB> will give you all options that start with an e, etc.

You can also type gd_sketch.explore(<TAB>) and get a pop-up with a list of what parameters this function provides.

Now let's look at somethings we can do with the sketch object and the TS client. For example if we want to get all starred events in the sketch we can do that by querying the sketch for available labels. You can look at a label as a "sketch specific tag", that is unlike a tag that is stored in the Elastic document and therefore is shared among all sketches that have that same timeline attached, a label is bound to the actual sketch and therefore not available outside of it... this is used in various places, most notably to indicate which events have labels, are hidden from views and are starred. These pre-defined labels are:

  • __ts_star: Starred event
  • __ts_comment: Event with a comment
  • __ts_hidden: A hidden event

Let's for instance look at all starred events in the Greendale index:


In [0]:
lines = []

star_label = gd_sketch.search_by_label('__ts_star')

for obj in star_label.get('objects', []):
  event = obj.get('_source', {})
  event['_id'] = obj.get('_id')
  event['_index'] = obj.get('_index')
  lines.append(event)
  
labeled_events = pd.DataFrame(lines)

labeled_events.shape

As you noticed there are quite a few starred events.. to limit this, let's look at just the first 10


In [0]:
labeled_events.head(10)

Or a single one...


In [0]:
pd.set_option('display.max_colwidth', 100)
labeled_events.iloc[9]

To continue let's look at what views have been stored in the sketch:


In [0]:
views = gd_sketch.list_views()

for index, view in enumerate(views):
  print('[{0:d}] {1:s}'.format(index, view.name))

You can then start to query the API to get back results from these views. Let's try one of them...

Word of caution, try to limit your search so that you don't get too many results back. The API will happily let you get all the results back as you choose, but the more records you get back the longer the API call will take (10k events per API call).


In [0]:
# You can change this number if you would like to test out another view.
# The way the code works is that it checks first of you set the "view_text", and uses that to pick a view, otherwise the number is used.
view_number = 1
view_text = '[phishy_domains] Phishy Domains'

if view_text:
  for index, view in enumerate(views):
    if view.name == view_text:
      view_number = index
      break

print('Fetching data from : {0:s}'.format(views[view_number].name))
print('        Query used : {0:s}'.format(views[view_number].query_string if views[view_number].query_string else views[view_number].query_dsl))

If you want to issue this query, then you can run the cell below, otherwise you can change the view_number to try another one.


In [0]:
greendale_frame = gd_sketch.explore(view=views[view_number], as_pandas=True)

Did you notice the "as_pandas=True" parameter that got passed to the "explore" function? That means that the data that we'll get back is a pandas DataFrame that we can now start exploring.

Let's start with seeing how many entries we got back.


In [0]:
greendale_frame.shape

This tells us that the view returned back 670 events with 12 columns. Let's explore the first few entries, just so that we can wrap our head around what we got back.


In [0]:
greendale_frame.head(5)

Let's look at what columns we got back... and maybe create a slice that contains fewer columns.


In [0]:
greendale_frame.columns

In [0]:
greendale_slice = greendale_frame[['datetime', 'timestamp_desc', 'tag', 'message', 'label']]

greendale_slice.head(4)

Since this is a result from the analyzers we have few extra fields we can pull in.

When running gd_sketch.explore? did you notice the field called return_fields:

    return_fields: List of fields that should be included in the
        response.

We can use that to specify what fields we would like to get back. Let's add few more fields (you can see what fields are available in the UI)


In [0]:
greendale_frame = gd_sketch.explore(view=views[view_number], return_fields='datetime,message,source_short,tag,timestamp_desc,url,domain,human_readable', as_pandas=True)

Let's briefly look at these events.


In [0]:
greendale_slice = greendale_frame[['datetime', 'timestamp_desc', 'tag', 'human_readable', 'url', 'domain']]

greendale_slice.head(5)

OK,.... since this is a phishy domain analyzer, and all the results we got back are essentially from that analyzer, let's look at few things. First of all let's look at the tags tha are available.


In [0]:
greendale_frame['tag_string'] = greendale_frame.tag.str.join('|')

greendale_frame.tag_string.unique()

OK... so we've got some that are part of the whitelisted-domain... let's look at those the domains that are marked as "phishy" yet excluding those that are whitelisted.


In [0]:
greendale_frame[~greendale_frame.tag_string.str.contains('whitelisted-domain')].domain.value_counts()

OK... now we get to see all the domains that the domain analyzer considered to be potentially "phishy"... is there a domain that stands out??? what about that grendale one?


In [0]:
greendale_slice[greendale_slice.domain == 'grendale.xyz']

OK... this seems odd.. let's look at few things, a the human_readable string as well as the URL...


In [0]:
grendale = greendale_slice[greendale_slice.domain == 'grendale.xyz']

string_set = set()
for string_list in grendale.human_readable:
  new_list = [x for x in string_list if 'phishy_domain' in x]
  _ = list(map(string_set.add, new_list))

for entry in string_set:
  print('Human readable string is: {0:s}'.format(entry))
  

print('')
print('Counts for URL connections to the grendale domain:')
grendale_count = grendale.url.value_counts()
for index in grendale_count.index:
  print('[{0:d}] {1:s}'.format(grendale_count[index], index))

We can start doing a lot more now if we want to... let's look at when these things occurred...


In [0]:
grendale_array = grendale.url.unique()

greendale_slice[greendale_slice.url.isin(grendale_array)]

OK... we can then start to look at surrounding events.... let's look at one date in particular... "2015-08-29 12:21:06"


In [0]:
query_dsl = """
{
	"query": {
		"bool": {
			"filter": {
				"bool": {
					"should": [
						{
							"range": {
								"datetime": {
									"gte": "2015-08-29T12:20:06",
									"lte": "2015-08-29T12:22:06"
								}
							}
						}
					]
				}
			},
			"must": [
				{
					"query_string": {
						"query": "*"
					}
				}
			]
		}
	},
	"size": 10000,
	"sort": {
		"datetime": "asc"
	}
}
"""

data = gd_sketch.explore(query_dsl=query_dsl, return_fields='message,human_readable,datetime,timestamp_desc,source_short,data_type,tags,url,domain', as_pandas=True)

In [0]:
data[['datetime', 'message', 'human_readable', 'url']].head(4)

Let's find the grendale and just look at events two seconds before/after


In [0]:
data[(data.datetime > '2015-08-29 12:21:04') & (data.datetime < '2015-08-29 12:21:08')][['datetime', 'message', 'timestamp_desc']]

Let's look at aggregation

Timesketch also has aggregation capabilities that we can call from the client. Let's take a quick look.

Start by checking out whether there are any stored aggregations that we can just take a look at.

You can also store your own aggregations using the gd_sketch.store_aggregation function. However we are not going to do that in this colab.


In [0]:
gd_sketch.list_aggregations()

OK, so there are some aggregations stored. Let's just pick one of those to take a closer look at.


In [0]:
aggregation = gd_sketch.list_aggregations()[0]

Now we've got an aggregation object that we can take a closer look at.


In [0]:
aggregation.name

In [0]:
aggregation.description

OK, so from the name, we can determine that this has to do with top 10 visited domains. We can also look at all of the stored aggregations


In [0]:
pd.DataFrame([{'name': x.name, 'description': x.description} for x in gd_sketch.list_aggregations()])

Let's look at the aggregation visually, both as a table and a chart.


In [0]:
aggregation.table

In [0]:
aggregation.chart

We can also take a look at what aggregators can be used, if we want to run our own custom aggregator.


In [0]:
gd_sketch.list_available_aggregators()

Now we can see that there are at least the "field_bucket" and "query_bucket" aggregators that we can look at. The field_bucket one is a terms bucket aggregation, which means we can take any field in the dataset and aggregate on that.

So if we want to for instance see the top 20 domains that were visited we can just ask for an aggregation of the field domain and limit it to 20 records (which will be the top 20). Let's do that:


In [0]:
aggregator = gd_sketch.run_aggregator(
    aggregator_name='field_bucket',
    aggregator_parameters={'field': 'domain', 'limit': 20, 'supported_charts': 'barchart'})

Now we've got an aggregation object that we can take a closer look at... let's look at the data it stored. What we were trying to get out was the top 20 domains that were visited.


In [0]:
aggregator.table

Or we can look at this visually... as a chart


In [0]:
aggregator.chart

We can also do something a bit more complex. The other aggregator, the query_bucket works in a similar way, except you can filter the results first. We want to aggregate all the domains that have been tagged with the phishy domain tag.


In [0]:
tag_aggregator = gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'domain',
        'query_string': 'tag:"phishy-domain"',
        'supported_charts': 'barchart',
    }
)

Let's look at the results.


In [0]:
tag_aggregator.table

We can also look at all the tags in the timeline. What tags have been applied and how frequent are they.


In [0]:
gd_sketch.run_aggregator(
    aggregator_name='field_bucket',
    aggregator_parameters={
        'field': 'tag',
        'limit': 10,
    }
).table

And then to see what are the most frequent applications executed on the machine.

Since not all of the execution events have the same fields in them we'll have to create few tables here... let's start with looking at what data types are there.


In [0]:
gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'data_type',
        'query_string': 'tag:"application_execution"',
        'supported_charts': 'barchart',
    }
).table

And then we can do a summary for each one.


In [0]:
gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'path',
        'query_string': 'tag:"application_execution"',
        'supported_charts': 'barchart',
    }
).table

In [0]:
gd_sketch.run_aggregator(
    aggregator_name='query_bucket',
    aggregator_parameters={
        'field': 'link_target',
        'query_string': 'tag:"application_execution"',
        'supported_charts': 'barchart',
    }
).table

Let's look at logins...

Let's do a search to look at login entries...


In [0]:
login_data = gd_sketch.explore(
    'data_type:"windows:evtx:record" AND event_identifier:4624', 
    return_fields='datetime,timestamp_desc,human_readable,message,tag,event_identifier,computer_name,record_number,recovered,strings,username',
    as_pandas=True
)

This will produce quite a bit of events... let's look at how many.


In [0]:
login_data.shape

Let's look at usernames....


In [0]:
login_data.username.value_counts()

It seems as the login analyzer was not working properly... so let's extract these fields manually...


In [0]:
login_data['account_name'] = login_data.message.str.extract(r'Account Name:.+Account Name:\\t\\t([^\\]+)\\n', expand=False)
login_data['account_domain'] = login_data.message.str.extract(r'Account Domain:.+Account Domain:\\t\\t([^\\]+)\\n', expand=False)
login_data['process_name'] = login_data.message.str.extract(r'Process Name:.+Process Name:\\t\\t([^\\]+)\\n', expand=False)
login_data['date'] = pd.to_datetime(login_data.datetime)

What accounts have logged in:


In [0]:
login_data.account_name.value_counts()

Let's look at all the computers in there...


In [0]:
login_data.computer_name.value_counts()

Let's graph.... and you can then interact with the graph... try zomming in, etc.

First we'll define a graph function that we can then call with parameters...


In [0]:
def GraphLogins(data_frame, machine_name=None):
  
  if machine_name:
    data_slice = data_frame[data_frame.computer_name == machine_name]
    title = 'Accounts Logged In - {0:s}'.format(machine_name)
  else:
    data_slice = data_frame
    title = 'Accounts Logged In'
    
  data_grouped = data_slice[['account_name', 'date']].groupby('account_name', as_index=False).count()
  data_grouped['count'] = data_grouped.date
  del data_grouped['date']

  return alt.Chart(data_grouped, width=400).mark_bar().encode(
    x='account_name', y='count',
    tooltip=['account_name', 'count']
  ).properties(
    title=title
  ).interactive()

Start by graphing all machines


In [0]:
GraphLogins(login_data)

Or we can look at this for a particular machine:


In [0]:
GraphLogins(login_data, 'Student-PC1.internal.greendale.edu')

Or we can look at this as a scatter plot...

First we'll define a function that munches the data for us. This function will essentially graph all logins in a day with a scatter plot, using colors to denote the count value.

This graph will be very interactive... try selecting a time period by clicking with the mouse on the upper graph and drawing a selection.


In [0]:
login_data['day'] = login_data['date'].dt.strftime('%Y-%m-%d')

def GraphScatterLogin(data_frame, machine_name=''):
  if machine_name:
    data_slice = data_frame[data_frame.computer_name == machine_name]
    title = 'Accounts Logged In - {0:s}'.format(machine_name)
  else:
    data_slice = data_frame
    title = 'Accounts Logged In'
  
  login_grouped = data_slice[['day', 'computer_name', 'account_name', 'message']].groupby(['day', 'computer_name', 'account_name'], as_index=False).count()
  login_grouped['count'] = login_grouped.message
  del login_grouped['message']
    
  brush = alt.selection_interval(encodings=['x'])
  click = alt.selection_multi(encodings=['color'])
  color = alt.Color('count:Q')

  chart1 = alt.Chart(login_grouped).mark_point().encode(
      x='day', 
      y='account_name',
      color=alt.condition(brush, color, alt.value('lightgray')),
  ).properties(
      title=title,
      width=600
  ).add_selection(
      brush
  ).transform_filter(
      click
  )
  
  chart2 = alt.Chart(login_grouped).mark_bar().encode(
      x='count',
      y='account_name',
      color=alt.condition(brush, color, alt.value('lightgray')),
      tooltip=['count'],
  ).transform_filter(
      brush
  ).properties(
      width=600
  ).add_selection(
      click
  )
  
  return chart1 & chart2

OK, let's start by graphing for all logins...


In [0]:
GraphScatterLogin(login_data)

And now just for the Student-PC1


In [0]:
GraphScatterLogin(login_data, 'Student-PC1.internal.greendale.edu')

And now it is your time to shine, experiment with python pandas, the graphing library and other data science techniques.


In [0]: