Dirk Loss, http://dirk-loss.de, @dloss. v1.1, 2013-06-02
This IPython notebook shows how to analyse network traffic using the following tools:
Pandas allows for very flexible analysis, treating your PCAP files as a timeseries of packet data.
So if the statistics provided by Wireshark are not enough, you might want to try this. And it's more fun, of course. :)
First we need a PCAP file. I chose a sample file from the Digital Corpora site that has been used for courses in network forensics:
In [1]:
from IPython.display import HTML
HTML('<iframe src=http://digitalcorpora.org/corpora/scenarios/nitroba-university-harassment-scenario width=600 height=300></iframe>')
Out[1]:
In [2]:
!mkdir -p pcap
In [3]:
cd pcap
We can download it using curl or pure Python. Just uncomment one of the following cells:
In [4]:
url="http://digitalcorpora.org/corp/nps/packets/2008-nitroba/nitroba.pcap"
In [5]:
# If you have curl installed, we can get nice progress bars:
#!curl -o nitroba.pcap $url
In [6]:
# Or use pure Python:
# import urllib
# urllib.urlretrieve(url, "nitroba.pcap")
In [7]:
ls -l nitroba.pcap
In [8]:
!md5sum nitroba.pcap
We can use the tshark
command from the Wireshark tool suite to read the PCAP file and convert it into a tab-separated file. This might not be very fast, but it is very flexible, because all of Wireshark's diplay filters can be used to select the packets that we are interested in.
In [9]:
!tshark -v
For now, I just select the frame number and the frame length and redirect the output to a file:
In [10]:
!tshark -n -r nitroba.pcap -T fields -Eheader=y -e frame.number -e frame.len > frame.len
Let's have a look at the file:
In [11]:
!head -10 frame.len
Two columns, tab-separaed. (Not exactly CSV, but who cares. ;-)
Pandas can read those tables into a DataFrame object:
In [12]:
import pandas as pd
In [13]:
df=pd.read_table("frame.len")
The object has a nice default representation that shows the number of values in each row:
In [14]:
df
Out[14]:
Some statistics about the frame length:
In [15]:
df["frame.len"].describe()
Out[15]:
The minimum and maximum frame lengths are plausible for an Ethernet connection.
For a better overview, we plot the frame length over time.
We initialise IPython to show inline graphics:
In [16]:
%pylab inline
Set a figure size in inches:
In [17]:
figsize(10,6)
Pandas automatically uses Matplotlib for plotting. We plot with small dots and an alpha channel of 0.2:
In [18]:
df["frame.len"].plot(style=".", alpha=0.2)
title("Frame length")
ylabel("bytes")
xlabel("frame number")
Out[18]:
So there are always lots of small packets (< 100 bytes) and lots of large packets (> 1400 bytes). Some bursts of packets with other sizes (around 400 bytes, 1000 bytes, etc.) can be clearly seen.
Passing all those arguments to tshark is quite cumbersome. Here is a convenience function that reads the given fields into a Pandas DataFrame:
In [19]:
import subprocess
import datetime
import pandas as pd
def read_pcap(filename, fields=[], display_filter="",
timeseries=False, strict=False):
""" Read PCAP file into Pandas DataFrame object.
Uses tshark command-line tool from Wireshark.
filename: Name or full path of the PCAP file to read
fields: List of fields to include as columns
display_filter: Additional filter to restrict frames
strict: Only include frames that contain all given fields
(Default: false)
timeseries: Create DatetimeIndex from frame.time_epoch
(Default: false)
Syntax for fields and display_filter is specified in
Wireshark's Display Filter Reference:
http://www.wireshark.org/docs/dfref/
"""
if timeseries:
fields = ["frame.time_epoch"] + fields
fieldspec = " ".join("-e %s" % f for f in fields)
display_filters = fields if strict else []
if display_filter:
display_filters.append(display_filter)
filterspec = "-R '%s'" % " and ".join(f for f in display_filters)
options = "-r %s -n -T fields -Eheader=y" % filename
cmd = "tshark %s %s %s" % (options, filterspec, fieldspec)
proc = subprocess.Popen(cmd, shell = True,
stdout=subprocess.PIPE)
if timeseries:
df = pd.read_table(proc.stdout,
index_col = "frame.time_epoch",
parse_dates=True,
date_parser=datetime.datetime.fromtimestamp)
else:
df = pd.read_table(proc.stdout)
return df
We will use this function in my further analysis.
By summing up the frame lengths we can calculate the complete (Ethernet) bandwidth used. First use our convenience function to read the PCAP into a DataFrame:
In [20]:
framelen=read_pcap("nitroba.pcap", ["frame.len"], timeseries=True)
framelen
Out[20]:
Then we re-sample the timeseries into buckets of 1 second, summing over the lengths of all frames that were captured in that second:
In [21]:
bytes_per_second=framelen.resample("S", how="sum")
Here are the first 5 rows. We get NaN for those timestamps where no frames were captured:
In [22]:
bytes_per_second.head()
Out[22]:
In [23]:
bytes_per_second.plot()
Out[23]:
Let's try to replicate the TCP Time-Sequence Graph that is known from Wireshark (Statistics > TCP Stream Analysis > Time-Sequence Graph (Stevens).
In [24]:
fields=["tcp.stream", "ip.src", "ip.dst", "tcp.seq", "tcp.ack", "tcp.window_size", "tcp.len"]
ts=read_pcap("nitroba.pcap", fields, timeseries=True, strict=True)
ts
Out[24]:
Now we have to select a TCP stream to analyse. As an example, we just pick stream number 10:
In [25]:
stream=ts[ts["tcp.stream"] == 10]
In [26]:
stream
Out[26]:
Pandas only print the overview because the table is to wide. So we force a display:
In [27]:
print stream.to_string()
Add a column that shows who sent the packet (client or server).
The fancy lambda expression is a function that distinguishes between the client and the server side of the stream by comparing the source IP address with the source IP address of the first packet in the stream (for TCP steams that should have been sent by the client).
In [28]:
stream["type"] = stream.apply(lambda x: "client" if x["ip.src"] == stream.irow(0)["ip.src"] else "server", axis=1)
In [29]:
print stream.to_string()
In [30]:
client_stream=stream[stream.type == "client"]
In [31]:
client_stream["tcp.seq"].plot(style="r-o")
Out[31]:
Notice that the x-axis shows the real timestamps.
For comparison, change the x-axis to be the packet number in the stream:
In [32]:
client_stream.index = arange(len(client_stream))
client_stream["tcp.seq"].plot(style="r-o")
Out[32]:
Looks different of course.
In [33]:
per_stream=ts.groupby("tcp.stream")
per_stream.head()
Out[33]:
In [34]:
bytes_per_stream = per_stream["tcp.len"].sum()
bytes_per_stream.head()
Out[34]:
In [35]:
bytes_per_stream.plot()
Out[35]:
In [36]:
bytes_per_stream.max()
Out[36]:
In [37]:
biggest_stream=bytes_per_stream.idxmax()
biggest_stream
Out[37]:
In [38]:
bytes_per_stream.ix[biggest_stream]
Out[38]:
Let's have a look at the padding of the Ethernet frames. Some cards have been leaking data in the past. For more details, see http://www.securiteam.com/securitynews/5BP01208UO.html
In [39]:
trailer_df = read_pcap("nitroba.pcap", ["eth.src", "eth.trailer"], timeseries=True)
trailer_df
Out[39]:
In [40]:
trailer=trailer_df["eth.trailer"]
trailer
Out[40]:
Ok. Most frames do not seem to have padding, but some have. Let's count per value to get an overview:
In [41]:
trailer.value_counts()
Out[41]:
Mostly zeros, but some data. Let's decode the hex strings:
In [42]:
import binascii
def unhex(s, sep=":"):
return binascii.unhexlify("".join(s.split(sep)))
In [43]:
s=unhex("3b:02:a7:19:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02")
s
Out[43]:
In [44]:
padding = trailer_df.dropna()
In [45]:
padding["unhex"]=padding["eth.trailer"].map(unhex)
In [46]:
def printable(s):
chars = []
for c in s:
if c.isalnum():
chars.append(c)
else:
chars.append(".")
return "".join(chars)
In [47]:
printable("\x95asd\x33")
Out[47]:
In [48]:
padding["printable"]=padding["unhex"].map(printable)
In [49]:
padding["printable"].value_counts()
Out[49]:
In [50]:
def ratio_printable(s):
printable = sum(1.0 for c in s if c.isalnum())
return printable / len(s)
In [51]:
ratio_printable("a\x93sdfs")
Out[51]:
In [52]:
padding["ratio_printable"] = padding["unhex"].map(ratio_printable)
In [53]:
padding[padding["ratio_printable"] > 0.5]
Out[53]:
In [54]:
_.printable.value_counts()
Out[54]:
Now find out which Ethernet cards sent those packets with more than 50% ASCII data in their padding:
In [55]:
padding[padding["ratio_printable"] > 0.5]['eth.src'].drop_duplicates()
Out[55]:
In [56]:
HTML('<iframe src=http://www.coffer.com/mac_find/?string=00%3A1d%3Ad9%3A2e%3A4f%3A61 width=600 height=300></iframe>')
Out[56]:
Thats 'Hon Hai Precision' (and "Netopia Inc" for the other MAC address).