Analysing network traffic with Pandas

Dirk Loss, http://dirk-loss.de, @dloss. v1.1, 2013-06-02

This IPython notebook shows how to analyse network traffic using the following tools:

Pandas allows for very flexible analysis, treating your PCAP files as a timeseries of packet data.

So if the statistics provided by Wireshark are not enough, you might want to try this. And it's more fun, of course. :)

Get a PCAP file

First we need a PCAP file. I chose a sample file from the Digital Corpora site that has been used for courses in network forensics:


In [1]:
from IPython.display import HTML
HTML('<iframe src=http://digitalcorpora.org/corpora/scenarios/nitroba-university-harassment-scenario width=600 height=300></iframe>')


Out[1]:

In [2]:
!mkdir -p pcap

In [3]:
cd pcap


/home/dirk/projects/pcap

We can download it using curl or pure Python. Just uncomment one of the following cells:


In [4]:
url="http://digitalcorpora.org/corp/nps/packets/2008-nitroba/nitroba.pcap"

In [5]:
# If you have curl installed, we can get nice progress bars:
#!curl -o nitroba.pcap $url

In [6]:
# Or use pure Python:
# import urllib
# urllib.urlretrieve(url, "nitroba.pcap")

In [7]:
ls -l nitroba.pcap


-rw-rw-r-- 1 dirk dirk 56795590 Jun  2 12:10 nitroba.pcap

In [8]:
!md5sum nitroba.pcap


d6b5df10fc572b54ceb9c543d11f10a4  nitroba.pcap

Convert PCAP to a CSV using tshark

We can use the tshark command from the Wireshark tool suite to read the PCAP file and convert it into a tab-separated file. This might not be very fast, but it is very flexible, because all of Wireshark's diplay filters can be used to select the packets that we are interested in.


In [9]:
!tshark -v


TShark 1.6.7

Copyright 1998-2012 Gerald Combs <gerald@wireshark.org> and contributors.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Compiled (32-bit) with GLib 2.32.0, with libpcap (version unknown), with libz
1.2.3.4, with POSIX capabilities (Linux), without libpcre, with SMI 0.4.8, with
c-ares 1.7.5, with Lua 5.1, without Python, with GnuTLS 2.12.14, with Gcrypt
1.5.0, with MIT Kerberos, with GeoIP.

Running on Linux 3.2.0-45-generic, with libpcap version 1.1.1, with libz
1.2.3.4.

Built using gcc 4.6.3.

For now, I just select the frame number and the frame length and redirect the output to a file:


In [10]:
!tshark -n -r nitroba.pcap -T fields -Eheader=y -e frame.number -e frame.len > frame.len

Let's have a look at the file:


In [11]:
!head -10 frame.len


frame.number	frame.len
1	70
2	70
3	1421
4	70
5	1284
6	70
7	70
8	70
9	78

Two columns, tab-separaed. (Not exactly CSV, but who cares. ;-)

Pandas can read those tables into a DataFrame object:


In [12]:
import pandas as pd

In [13]:
df=pd.read_table("frame.len")

The object has a nice default representation that shows the number of values in each row:


In [14]:
df


Out[14]:
&ltclass 'pandas.core.frame.DataFrame'>
Int64Index: 95175 entries, 0 to 95174
Data columns (total 2 columns):
frame.number    95175  non-null values
frame.len       95175  non-null values
dtypes: int64(2)

Some statistics about the frame length:


In [15]:
df["frame.len"].describe()


Out[15]:
count    95175.000000
mean       580.748789
std        625.757017
min         42.000000
25%         70.000000
50%         87.000000
75%       1466.000000
max       1466.000000
dtype: float64

The minimum and maximum frame lengths are plausible for an Ethernet connection.

Plotting

For a better overview, we plot the frame length over time.

We initialise IPython to show inline graphics:


In [16]:
%pylab inline


Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Set a figure size in inches:


In [17]:
figsize(10,6)

Pandas automatically uses Matplotlib for plotting. We plot with small dots and an alpha channel of 0.2:


In [18]:
df["frame.len"].plot(style=".", alpha=0.2)
title("Frame length")
ylabel("bytes")
xlabel("frame number")


Out[18]:
<matplotlib.text.Text at 0x9aaf58c>

So there are always lots of small packets (< 100 bytes) and lots of large packets (> 1400 bytes). Some bursts of packets with other sizes (around 400 bytes, 1000 bytes, etc.) can be clearly seen.

A Python function to read PCAP files into Pandas DataFrames

Passing all those arguments to tshark is quite cumbersome. Here is a convenience function that reads the given fields into a Pandas DataFrame:


In [19]:
import subprocess
import datetime
import pandas as pd

def read_pcap(filename, fields=[], display_filter="", 
              timeseries=False, strict=False):
    """ Read PCAP file into Pandas DataFrame object. 
    Uses tshark command-line tool from Wireshark.

    filename:       Name or full path of the PCAP file to read
    fields:         List of fields to include as columns
    display_filter: Additional filter to restrict frames
    strict:         Only include frames that contain all given fields 
                    (Default: false)
    timeseries:     Create DatetimeIndex from frame.time_epoch 
                    (Default: false)

    Syntax for fields and display_filter is specified in
    Wireshark's Display Filter Reference:
 
      http://www.wireshark.org/docs/dfref/
    """
    if timeseries:
        fields = ["frame.time_epoch"] + fields
    fieldspec = " ".join("-e %s" % f for f in fields)

    display_filters = fields if strict else []
    if display_filter:
        display_filters.append(display_filter)
    filterspec = "-R '%s'" % " and ".join(f for f in display_filters)

    options = "-r %s -n -T fields -Eheader=y" % filename
    cmd = "tshark %s %s %s" % (options, filterspec, fieldspec)
    proc = subprocess.Popen(cmd, shell = True, 
                                 stdout=subprocess.PIPE)
    if timeseries:
        df = pd.read_table(proc.stdout, 
                        index_col = "frame.time_epoch", 
                        parse_dates=True, 
                        date_parser=datetime.datetime.fromtimestamp)
    else:
        df = pd.read_table(proc.stdout)
    return df

We will use this function in my further analysis.

Bandwidth

By summing up the frame lengths we can calculate the complete (Ethernet) bandwidth used. First use our convenience function to read the PCAP into a DataFrame:


In [20]:
framelen=read_pcap("nitroba.pcap", ["frame.len"], timeseries=True)
framelen


Out[20]:
&ltclass 'pandas.core.frame.DataFrame'>
DatetimeIndex: 95175 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029
Data columns (total 1 columns):
frame.len    95175  non-null values
dtypes: int64(1)

Then we re-sample the timeseries into buckets of 1 second, summing over the lengths of all frames that were captured in that second:


In [21]:
bytes_per_second=framelen.resample("S", how="sum")

Here are the first 5 rows. We get NaN for those timestamps where no frames were captured:


In [22]:
bytes_per_second.head()


Out[22]:
frame.len
frame.time_epoch
2008-07-22 03:51:07 20729
2008-07-22 03:51:08 8426
2008-07-22 03:51:09 13565
2008-07-22 03:51:10 NaN
2008-07-22 03:51:11 NaN

In [23]:
bytes_per_second.plot()


Out[23]:
<matplotlib.axes.AxesSubplot at 0x9f40d2c>

TCP Time-Sequence Graph

Let's try to replicate the TCP Time-Sequence Graph that is known from Wireshark (Statistics > TCP Stream Analysis > Time-Sequence Graph (Stevens).


In [24]:
fields=["tcp.stream", "ip.src", "ip.dst", "tcp.seq", "tcp.ack", "tcp.window_size", "tcp.len"]
ts=read_pcap("nitroba.pcap", fields, timeseries=True, strict=True)
ts


Out[24]:
&ltclass 'pandas.core.frame.DataFrame'>
DatetimeIndex: 81451 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029
Data columns (total 7 columns):
tcp.stream         81451  non-null values
ip.src             81451  non-null values
ip.dst             81451  non-null values
tcp.seq            81451  non-null values
tcp.ack            81451  non-null values
tcp.window_size    81451  non-null values
tcp.len            81451  non-null values
dtypes: int64(5), object(2)

Now we have to select a TCP stream to analyse. As an example, we just pick stream number 10:


In [25]:
stream=ts[ts["tcp.stream"] == 10]

In [26]:
stream


Out[26]:
&ltclass 'pandas.core.frame.DataFrame'>
DatetimeIndex: 26 entries, 2008-07-22 03:51:08.431406 to 2008-07-22 03:53:29.160668
Data columns (total 7 columns):
tcp.stream         26  non-null values
ip.src             26  non-null values
ip.dst             26  non-null values
tcp.seq            26  non-null values
tcp.ack            26  non-null values
tcp.window_size    26  non-null values
tcp.len            26  non-null values
dtypes: int64(5), object(2)

Pandas only print the overview because the table is to wide. So we force a display:


In [27]:
print stream.to_string()


                            tcp.stream         ip.src         ip.dst  tcp.seq  tcp.ack  tcp.window_size  tcp.len
frame.time_epoch                                                                                                
2008-07-22 03:51:08.431406          10  209.85.171.97   192.168.1.64        0        1             5672        0
2008-07-22 03:51:08.437600          10   192.168.1.64  209.85.171.97        1        1           524280        0
2008-07-22 03:51:08.438156          10   192.168.1.64  209.85.171.97        1        1           524280      153
2008-07-22 03:51:08.467383          10  209.85.171.97   192.168.1.64        1      154             6784        0
2008-07-22 03:51:08.469846          10  209.85.171.97   192.168.1.64        1      154             6784     1177
2008-07-22 03:51:08.474440          10   192.168.1.64  209.85.171.97      154     1178           523712        0
2008-07-22 03:51:08.547444          10   192.168.1.64  209.85.171.97      154     1178           524280      267
2008-07-22 03:51:08.547498          10   192.168.1.64  209.85.171.97      421     1178           524280        6
2008-07-22 03:51:08.547768          10   192.168.1.64  209.85.171.97      427     1178           524280       41
2008-07-22 03:51:08.589823          10  209.85.171.97   192.168.1.64     1178      468             7872       47
2008-07-22 03:51:08.592029          10   192.168.1.64  209.85.171.97      468     1225           524280        0
2008-07-22 03:51:08.594719          10   192.168.1.64  209.85.171.97      468     1225           524280      604
2008-07-22 03:51:08.633074          10  209.85.171.97   192.168.1.64     1225     1072             9024     1344
2008-07-22 03:51:08.635798          10   192.168.1.64  209.85.171.97     1072     2569           523552        0
2008-07-22 03:51:09.295395          10   192.168.1.64  209.85.171.97     1072     2569           524280     1024
2008-07-22 03:51:09.337628          10  209.85.171.97   192.168.1.64     2569     2096            11072      354
2008-07-22 03:51:09.340889          10   192.168.1.64  209.85.171.97     2096     2923           524280        0
2008-07-22 03:53:09.324698          10  209.85.171.97   192.168.1.64     2923     2096            11072        0
2008-07-22 03:53:09.561366          10  209.85.171.97   192.168.1.64     2923     2096            11072        0
2008-07-22 03:53:10.020463          10  209.85.171.97   192.168.1.64     2923     2096            11072        0
2008-07-22 03:53:10.734440          10  209.85.171.97   192.168.1.64     2923     2096            11072        0
2008-07-22 03:53:11.956795          10  209.85.171.97   192.168.1.64     2923     2096            11072        0
2008-07-22 03:53:13.662067          10  209.85.171.97   192.168.1.64     2923     2096            11072        0
2008-07-22 03:53:15.876856          10  209.85.171.97   192.168.1.64     2923     2096            11072        0
2008-07-22 03:53:20.305760          10  209.85.171.97   192.168.1.64     2923     2096            11072        0
2008-07-22 03:53:29.160668          10  209.85.171.97   192.168.1.64     2923     2096            11072        0

Add a column that shows who sent the packet (client or server).

The fancy lambda expression is a function that distinguishes between the client and the server side of the stream by comparing the source IP address with the source IP address of the first packet in the stream (for TCP steams that should have been sent by the client).


In [28]:
stream["type"] = stream.apply(lambda x: "client" if x["ip.src"] == stream.irow(0)["ip.src"] else "server", axis=1)

In [29]:
print stream.to_string()


                            tcp.stream         ip.src         ip.dst  tcp.seq  tcp.ack  tcp.window_size  tcp.len    type
frame.time_epoch                                                                                                        
2008-07-22 03:51:08.431406          10  209.85.171.97   192.168.1.64        0        1             5672        0  client
2008-07-22 03:51:08.437600          10   192.168.1.64  209.85.171.97        1        1           524280        0  server
2008-07-22 03:51:08.438156          10   192.168.1.64  209.85.171.97        1        1           524280      153  server
2008-07-22 03:51:08.467383          10  209.85.171.97   192.168.1.64        1      154             6784        0  client
2008-07-22 03:51:08.469846          10  209.85.171.97   192.168.1.64        1      154             6784     1177  client
2008-07-22 03:51:08.474440          10   192.168.1.64  209.85.171.97      154     1178           523712        0  server
2008-07-22 03:51:08.547444          10   192.168.1.64  209.85.171.97      154     1178           524280      267  server
2008-07-22 03:51:08.547498          10   192.168.1.64  209.85.171.97      421     1178           524280        6  server
2008-07-22 03:51:08.547768          10   192.168.1.64  209.85.171.97      427     1178           524280       41  server
2008-07-22 03:51:08.589823          10  209.85.171.97   192.168.1.64     1178      468             7872       47  client
2008-07-22 03:51:08.592029          10   192.168.1.64  209.85.171.97      468     1225           524280        0  server
2008-07-22 03:51:08.594719          10   192.168.1.64  209.85.171.97      468     1225           524280      604  server
2008-07-22 03:51:08.633074          10  209.85.171.97   192.168.1.64     1225     1072             9024     1344  client
2008-07-22 03:51:08.635798          10   192.168.1.64  209.85.171.97     1072     2569           523552        0  server
2008-07-22 03:51:09.295395          10   192.168.1.64  209.85.171.97     1072     2569           524280     1024  server
2008-07-22 03:51:09.337628          10  209.85.171.97   192.168.1.64     2569     2096            11072      354  client
2008-07-22 03:51:09.340889          10   192.168.1.64  209.85.171.97     2096     2923           524280        0  server
2008-07-22 03:53:09.324698          10  209.85.171.97   192.168.1.64     2923     2096            11072        0  client
2008-07-22 03:53:09.561366          10  209.85.171.97   192.168.1.64     2923     2096            11072        0  client
2008-07-22 03:53:10.020463          10  209.85.171.97   192.168.1.64     2923     2096            11072        0  client
2008-07-22 03:53:10.734440          10  209.85.171.97   192.168.1.64     2923     2096            11072        0  client
2008-07-22 03:53:11.956795          10  209.85.171.97   192.168.1.64     2923     2096            11072        0  client
2008-07-22 03:53:13.662067          10  209.85.171.97   192.168.1.64     2923     2096            11072        0  client
2008-07-22 03:53:15.876856          10  209.85.171.97   192.168.1.64     2923     2096            11072        0  client
2008-07-22 03:53:20.305760          10  209.85.171.97   192.168.1.64     2923     2096            11072        0  client
2008-07-22 03:53:29.160668          10  209.85.171.97   192.168.1.64     2923     2096            11072        0  client

In [30]:
client_stream=stream[stream.type == "client"]

In [31]:
client_stream["tcp.seq"].plot(style="r-o")


Out[31]:
<matplotlib.axes.AxesSubplot at 0xa1e454c>

Notice that the x-axis shows the real timestamps.

For comparison, change the x-axis to be the packet number in the stream:


In [32]:
client_stream.index = arange(len(client_stream))
client_stream["tcp.seq"].plot(style="r-o")


Out[32]:
<matplotlib.axes.AxesSubplot at 0xa1d91ac>

Looks different of course.

Bytes per stream


In [33]:
per_stream=ts.groupby("tcp.stream")
per_stream.head()


Out[33]:
&ltclass 'pandas.core.frame.DataFrame'>
MultiIndex: 9913 entries, (0, 2008-07-22 03:51:07.095278) to (2765, 2008-07-22 08:11:35.496780)
Data columns (total 7 columns):
tcp.stream         9913  non-null values
ip.src             9913  non-null values
ip.dst             9913  non-null values
tcp.seq            9913  non-null values
tcp.ack            9913  non-null values
tcp.window_size    9913  non-null values
tcp.len            9913  non-null values
dtypes: int64(5), object(2)

In [34]:
bytes_per_stream = per_stream["tcp.len"].sum()
bytes_per_stream.head()


Out[34]:
tcp.stream
0                0
1             2565
5             5158
6             8266
10            5017
Name: tcp.len, dtype: int64

In [35]:
bytes_per_stream.plot()


Out[35]:
<matplotlib.axes.AxesSubplot at 0xac810ac>

In [36]:
bytes_per_stream.max()


Out[36]:
5150771

In [37]:
biggest_stream=bytes_per_stream.idxmax()
biggest_stream


Out[37]:
88

In [38]:
bytes_per_stream.ix[biggest_stream]


Out[38]:
5150771

Ethernet Padding

Let's have a look at the padding of the Ethernet frames. Some cards have been leaking data in the past. For more details, see http://www.securiteam.com/securitynews/5BP01208UO.html


In [39]:
trailer_df = read_pcap("nitroba.pcap", ["eth.src", "eth.trailer"], timeseries=True)
trailer_df


Out[39]:
&ltclass 'pandas.core.frame.DataFrame'>
DatetimeIndex: 95175 entries, 2008-07-22 03:51:07.095278 to 2008-07-22 08:13:47.046029
Data columns (total 2 columns):
eth.src        95175  non-null values
eth.trailer    12851  non-null values
dtypes: object(2)

In [40]:
trailer=trailer_df["eth.trailer"]
trailer


Out[40]:
frame.time_epoch
2008-07-22 03:51:07.095278    NaN
2008-07-22 03:51:07.103728    NaN
2008-07-22 03:51:07.114897    NaN
2008-07-22 03:51:07.139448    NaN
2008-07-22 03:51:07.319680    NaN
2008-07-22 03:51:07.321990    NaN
2008-07-22 03:51:07.326517    NaN
2008-07-22 03:51:07.335554    NaN
2008-07-22 03:51:07.376171    NaN
2008-07-22 03:51:07.378392    NaN
2008-07-22 03:51:07.389299    NaN
2008-07-22 03:51:07.390478    NaN
2008-07-22 03:51:07.404056    NaN
2008-07-22 03:51:07.416518    NaN
2008-07-22 03:51:07.423663    NaN
...
2008-07-22 08:13:44.266370                  NaN
2008-07-22 08:13:44.266638                  NaN
2008-07-22 08:13:44.293692    00:00:00:00:00:00
2008-07-22 08:13:44.585477                  NaN
2008-07-22 08:13:44.863535                  NaN
2008-07-22 08:13:44.873602                  NaN
2008-07-22 08:13:44.883737                  NaN
2008-07-22 08:13:44.893510                  NaN
2008-07-22 08:13:44.903460                  NaN
2008-07-22 08:13:44.913495                  NaN
2008-07-22 08:13:44.923654                  NaN
2008-07-22 08:13:44.933648                  NaN
2008-07-22 08:13:44.943515                  NaN
2008-07-22 08:13:44.953453                  NaN
2008-07-22 08:13:47.046029                  NaN
Name: eth.trailer, Length: 95175, dtype: object

Ok. Most frames do not seem to have padding, but some have. Let's count per value to get an overview:


In [41]:
trailer.value_counts()


Out[41]:
00:00:00:00:00:00                                        7989
3b:02:a7:19:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02     913
00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00     606
3b:02:a7:19:00:1d:6b:99:98:6a:88:64:11:00:8f:da:00:42     303
00:00                                                     299
00:00:c0:a8:01:40:00:00:00:00:00:00:00:00:00:1d:d9:2e     259
32:01:67:06:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02     254
2d:66:6f:6f:65:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01     253
04:67:6b:64:63:03:75:61:73:03:61:6f:6c:03:63:6f:6d:00     160
70:03:6d:73:67:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01     151
73:6b:03:6d:61:63:03:63:6f:6d:00:00:01:00:01:00:01:00     146
2d:66:6f:6f:62:05:79:61:68:6f:6f:03:63:6f:6d:00:00:01     101
73:6b:03:6d:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02      66
72:65:76:73:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02      54
00:00:00:00:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02      52
...
2d:66:6f:6f:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02    1
00:00:c9:6e:87:fc                                        1
00:00:44:b7:84:43                                        1
00:00:3b:fc:30:86                                        1
00:00:7b:1f:5b:03                                        1
00:00:78:27:f5:37                                        1
00:00:f0:2c:e6:35                                        1
00:00:6e:f5:46:41                                        1
00:00:00:00:00:00:00:00:00:00:00:00:00:16:39:da:a9       1
00:00:7a:e4:d0:27                                        1
00:00:61:c8:85:63                                        1
00:00:e7:99:00:70                                        1
00:00:68:25:eb:a0                                        1
00:00:34:ba:2b:52                                        1
00:00:53:8a:e9:05                                        1
Length: 635, dtype: int64

Mostly zeros, but some data. Let's decode the hex strings:


In [42]:
import binascii

def unhex(s, sep=":"):
    return binascii.unhexlify("".join(s.split(sep)))

In [43]:
s=unhex("3b:02:a7:19:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02")
s


Out[43]:
';\x02\xa7\x19\xaa\xaa\x03\x00\x80\xc2\x00\x07\x00\x00\x00\x02;\x02'

In [44]:
padding = trailer_df.dropna()

In [45]:
padding["unhex"]=padding["eth.trailer"].map(unhex)

In [46]:
def printable(s):
    chars = []
    for c in s:
        if c.isalnum():
            chars.append(c)
        else:
            chars.append(".")
    return "".join(chars)

In [47]:
printable("\x95asd\x33")


Out[47]:
'.asd3'

In [48]:
padding["printable"]=padding["unhex"].map(printable)

In [49]:
padding["printable"].value_counts()


Out[49]:
......                8145
..................    1927
......k..j.d.....B     303
..                     299
2.g...............     254
.fooe.yahoo.com...     253
.gkdc.uas.aol.com.     160
p.msg.yahoo.com...     151
sk.mac.com........     148
.foob.yahoo.com...     101
sk.m..............      66
revs..............      54
ge.w..............      45
1.1...............      44
.goo..............      42
...
..........Wz......    1
..M...                1
...i.Z                1
..x...                1
..N...                1
..n.oN                1
....fK                1
....fk                1
..Y8..                1
..n.FA                1
...O.r                1
....Qn                1
..PK.e                1
...w..                1
..1...                1
Length: 375, dtype: int64

In [50]:
def ratio_printable(s):
    printable = sum(1.0 for c in s if c.isalnum())
    return printable / len(s)

In [51]:
ratio_printable("a\x93sdfs")


Out[51]:
0.8333333333333334

In [52]:
padding["ratio_printable"] = padding["unhex"].map(ratio_printable)

In [53]:
padding[padding["ratio_printable"] > 0.5]


Out[53]:
&ltclass 'pandas.core.frame.DataFrame'>
DatetimeIndex: 727 entries, 2008-07-22 03:51:20.018817 to 2008-07-22 05:40:13.338449
Data columns (total 5 columns):
eth.src            727  non-null values
eth.trailer        727  non-null values
unhex              727  non-null values
printable          727  non-null values
ratio_printable    727  non-null values
dtypes: float64(1), object(4)

In [54]:
_.printable.value_counts()


Out[54]:
.fooe.yahoo.com...    253
.gkdc.uas.aol.com.    160
p.msg.yahoo.com...    151
.foob.yahoo.com...    101
.weather.com......     31
ge.weather.com....     26
1.1..HOST.239.255.      1
..CDWW                  1
.foof.yahoo.com...      1
..3rbo                  1
..BIKM                  1
dtype: int64

Now find out which Ethernet cards sent those packets with more than 50% ASCII data in their padding:


In [55]:
padding[padding["ratio_printable"] > 0.5]['eth.src'].drop_duplicates()


Out[55]:
frame.time_epoch
2008-07-22 03:51:20.018817    00:1d:d9:2e:4f:61
2008-07-22 04:10:14.155085    00:1d:6b:99:98:68
Name: eth.src, dtype: object

In [56]:
HTML('<iframe src=http://www.coffer.com/mac_find/?string=00%3A1d%3Ad9%3A2e%3A4f%3A61 width=600 height=300></iframe>')


Out[56]:

Thats 'Hon Hai Precision' (and "Netopia Inc" for the other MAC address).