What is baiduspider and how would you identify it best with minimal possible errors?
Baiduspider is a web crawler serving the Chinese search engine Baidu. According to baidu, it can be identified with having 'Baiduspider' in the User-Agent header. However, this may be used by malicious actors to hide their tracks. To double check that the origin is actually baidu, a user can do a reverse IP lookup via DNS. On Mac OS this is done with the dig
command, and on Linux with host ip
.
The file contains recorded traffic from one of our customers after all customer's sensitive data been removed. Your challenge is to find attackers, and describe which attack they performed and how we can identify them from all the traffic. Please try to describe the entire process you made from the data processing stage to the actual findings.
Begin by importing pandas and the data and setting up matplotlib.
In [1]:
%matplotlib inline
import pandas as pd
test_df = pd.read_csv('test_20180320.csv')
Then import json and use json_normalize to convert the Headers data into columns of the csv. Merge this back with the original data.
In [2]:
import json
headers_df = pd.io.json.json_normalize(test_df['Headers'].apply(json.loads))
merged_test_df = pd.merge(test_df, headers_df, left_index=True, right_index=True)
Convert the time from unix epoch to human-readable form and set the index to the Timestamp.
In [3]:
import datetime
merged_test_df['Timestamp'] = pd.to_datetime(test_df.Timestamp, unit='s')
merged_test_df.index = pd.to_datetime(test_df.Timestamp, unit='s')
Just for additional clarity for exploration, drop user-agent since it is the same as User-Agent, drop Headers since it has been parsed, and drop Timestamp because it is the same as the index.
In [4]:
merged_test_df = merged_test_df.drop(['user-agent','Headers','Timestamp'], 1)
I'll begin by taking a look at summary statistics for the columns which stand out as most immediately relevant. As I discover insights from these, I will look back to these columns and others for more details and parse further to identify malicious traffic.
I will use merged_test_df['column'].value_counts().sample(5)
to identify the top few values for each column and look for patterns and outliers.
Mozilla/5.0 (Linux i686) AppleWebKit/505.0 (KHTML, like Gecko) Chrome/1.0143.94 Safari/505
- 8902
Mozilla/5.0 (compatible; MSIE 5.0; Linux x86_64; Tablet PC;Trident/4.0)
- 2262
Mozilla/5.0 Chrome/156.0
- 2201
Below I start looking deeper into Content-Length. I picked the top 11 instances because there is a significant gap between number of occurences of the 11th and 12th.
In [5]:
content_length_df = pd.DataFrame(merged_test_df['Content-Length'].value_counts())
top_content = content_length_df.head(11)
I will use top_content later to graph against overall traffic.
In [6]:
import matplotlib
import matplotlib.pyplot as plt
t = content_length_df['Content-Length']
o = content_length_df.index
fig, ax = plt.subplots()
ax.scatter(t, o, marker='.')
ax.set(xlabel='occurrences', ylabel='bytes')
plt.show()
In the chart above, we can see that there are several Content-Lengths which have unusually high rates of occurence, all grouped in a range between 150-160 bytes.
In [7]:
five_ip_df = merged_test_df['X-Forwarded-For'].value_counts() == 5
five_ip_arr = []
for i in range(len(five_ip_df)):
if five_ip_df.values[i]:
five_ip_arr.append(five_ip_df.index[i])
Since IP and X-Forwarded for appear to stay in consistent in most 5-instance patterns I have observed, I will just use X-Forwarded-For to avoid the complication of the one outlier in IP.
The following code checks each X-Forwarded-For address with 5 requests to see if it has any pair of requests which are separated by an interval which is some multiple of 1000 seconds. Those which are are added to thousand_arr
. Items in this array are most likely coming from a similar source due to their similarities.
The following code block takes quite a while to run and could be made more efficient.
In [8]:
thousand_arr = []
for i in range(len(five_ip_arr)):
temp_df = merged_test_df.loc[merged_test_df['X-Forwarded-For'] == five_ip_arr[i]].sort_index()
for j in range(len(temp_df)-1):
temp_time = (temp_df.index[j+1] - temp_df.index[j]).total_seconds()
if temp_time % 1000 == 0: # If this IP address has two records at some multiple of 1000 seconds apart
thousand_arr.append(five_ip_arr[i])
In [9]:
ip_nets = merged_test_df['IP'].str.extract(r'(\d{,3}.\d{,3}.\d{,3})')
ip_nets.columns = ['network']
ip_nets['network'].value_counts().head()
Out[9]:
In [10]:
forward_nets = merged_test_df['X-Forwarded-For'].str.extract(r'(\d{,3}.\d{,3}.\d{,3})')
forward_nets.columns = ['network']
forward_nets['network'].value_counts().head()
Out[10]:
Below I use resample and count to create a by-minute count of requests and then do the same for several subsets filtered by various clues to malicious behavior I have discovered while exploring the dataset.
In [11]:
overall_by_minute = merged_test_df['IP'].resample('T').count()
top_ip_by_minute = merged_test_df.loc[merged_test_df['IP'] == '225.19.49.85']['IP'].resample('T').count()
auth_by_minute = merged_test_df.loc[merged_test_df['path'] == '/auth']['IP'].resample('T').count()
zh_by_minute = merged_test_df.loc[merged_test_df['Accept-Language'].str.contains('zh', na = False)]['IP'].resample('T').count()
mozillia_by_minute = merged_test_df.loc[merged_test_df['User-Agent'].str.contains('Mozillia', na = False)]['IP'].resample('T').count()
tc_by_minute = merged_test_df.loc[merged_test_df["Content-Length"].isin(top_content.index)]['IP'].resample('T').count()
thousand_by_minute = merged_test_df.loc[merged_test_df["X-Forwarded-For"].isin(thousand_arr)]['IP'].resample('T').count()
In [12]:
graph_df = pd.concat([overall_by_minute, top_ip_by_minute, auth_by_minute, zh_by_minute, mozillia_by_minute, tc_by_minute, thousand_by_minute], axis=1)
In [13]:
graph_df.columns = ['overall', 'top_ip', 'auth', 'zh', 'mozillia', 'top_content', 'thousand']
In [14]:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
hours = mdates.HourLocator()
Fmt = mdates.DateFormatter('%Y')
t = graph_df.index
a = graph_df['overall']
b = graph_df['top_ip']
c = graph_df['auth']
d = graph_df['zh']
e = graph_df['mozillia']
f = graph_df['top_content']
g = graph_df['thousand']
fig, ax = plt.subplots(figsize=(20, 10))
ax.plot(t, a, label='overall')
ax.plot(t, b, label='top_ip')
ax.plot(t, c, label='auth')
ax.plot(t, d, label='zh')
ax.plot(t, e, label='mozillia')
ax.plot(t, f, label='top_content')
ax.plot(t, g, label='thousand')
ax.set(xlabel='time (m)', ylabel='traffic')
ax.grid()
ax.xaxis.set_major_locator(hours)
ax.format_xdata = mdates.DateFormatter('%Y')
fig.autofmt_xdate()
plt.legend(loc=1)
plt.show()
In [15]:
fig, ax = plt.subplots(figsize=(20, 10))
ax.plot(t, a, label='overall')
ax.plot(t, b, label='top_ip')
ax.plot(t, c, label='auth')
ax.plot(t, d, label='zh')
ax.plot(t, e, label='mozillia')
ax.plot(t, f, label='top_content')
ax.plot(t, g, label='thousand')
ax.set(xlabel='time (m)', ylabel='traffic')
ax.grid()
ax.xaxis.set_major_locator(hours)
ax.format_xdata = mdates.DateFormatter('%Y')
fig.autofmt_xdate()
left = '2015-02-17 20:30:00'
right = '2015-02-18 00:30:00'
bottom = 0
top = 250
ax.set_xlim(left, right)
ax.set_ylim(bottom, top)
plt.legend(loc=1)
plt.show()
There are a number of observations to be made from this graph, and lots of suggestions about how the data might be further explored.
The most difficult part of this question for me is identifying what kind of attack is being performed. With enough research and time, I'm sure that this dataset is rich enough to give me all the clues I need to figure it out. My guess at this stage is that this is some type of attack which is harvesting sensitive product and image data that is unique to each authenticated user. This would explain why there are so many bots requesting the same paths, and why there are so many instances of 0 indexed products and images (such as /images/id/0 and /products/id/0), with decreasing frequencies for requests as this index gets higher and higher. My theory is that the bots login as users in the first wave (/auth) of the attack, and then rotate through IP addresses while harvesting the sensitive data which requires authentication to access.
Which of the following 10 objects are part of PerimeterX homepage?
btn btn-default btn-b1
- Yesjfk-bubble fkbx-chm
- Noblog-bottom-borderlink
- Noblog-link has-bottom-border
-Nospch s2fp-h
- Nobutton headerItem
- Nocol-header
- Yesreg-fraud
- No_ctl0_frmMarginRightT op1
NoI first looked at the source and simply did ctrl+F for each of these. I then thought that this might not cover everything, as some objects may load later. I then tried checking for each with this jquery $('btn btn-default btn-b1').length
in my browser's console and found the same results.
Please list 5 ways you think about for identifying bots, without collecting private user information.
Start by finding periods of abnormally high traffic, then subsequently analyze it to find out what is different between the periods when traffic spikes and 'normal' traffic.
Look for IP addresses or ranges which are unusually active or exist as known bad actors.
Some bots identify themselves, such as baiduspider, googlebot, bingbot. Look for these and for other clues in the User-Agent
header, such as typos in the data analysis exercise.
Look for strange patterns in traffic, such as evenly spaced requests, requests happening faster than a human could possibly submit them, or other repeated patterns.
Pay particular attention when there is relatively heavier traffic in more sensitive areas of the website, such as authentication.
Write a script in a language of your preference (python, javascript, etc.) that will access the following URL http://www.cool-proxy.net/proxies/http_proxy_list/sort:score/direction:desc and create a CSV of ALL IPs and ports. (not just from the landing page)
In [16]:
import urllib.request
import json
from pandas.io.json import json_normalize
with urllib.request.urlopen("https://cool-proxy.net/proxies.json") as url:
data = json.loads(url.read().decode())
proxy_df = json_normalize(data)
final_list = proxy_df[['ip','port']]
final_list.to_csv("output_ips.csv", encoding='utf-8', index=False)
For each of the scenarios below, please provide (1) a written response to the support issues and (2) your thoughts / questions / additional information you would seek internally to better understand the issue
“I just signed up to your service, and you guys charged my credit card, but I can’t log in. This is pretty terrible service, and I honestly would just consider asking for a refund if this can’t be solved quickly.”
I'm sorry that you're having issues logging into the service, please provide me with your email address and I will try to help you get past this as quickly as possible.
When did they sign up? Have they contacted support before? Are there any hints about why they are angry enough to call it a 'terrible service', or are they just frustrated by their login issues?
“An end user of a site protected by PerimeterX service has opened a ticket: “ Hi, you guys blocked my access to my Walmart account and I can’t buy my groceries. I have a party in two days and I won’t be able to buy food. Please let me access”
I would be happy to look into why we may have blocked access to your Walmart account. Could you please provide me with your account details so I can help you try to unblock it?
Why was it blocked? Did it trigger something in the system? Is this an actual end user or someone looking to unban a malicious account? It seems unusual that a walmart customer would know and make the effort to go to PerimeterX unless they were directly informed that they were banned by the service.
“A paying customer writes to you saying: “ Your service suddenly stopped working. My site is being scraped and my content is jeopardised.” You look into this and determine that there’s a bug in the system. Write a response to him (bonus points if you can help provide him with a workaround in case the bug isn’t fixed by tomorrow!)
We have looked into your issue and discovered a previously unknown bug in our system. We apologize for any disruption this has caused and our team is working on fixing the bug as quickly as possible. In case the bug is not fixed by tomorrow, you might want to consider some basic measures for temporarily protecting the content on your site. Depending on your situation, you may want to consider the following:
- blocking suspicious IP addresses if you can find any (although this is very simple for attackers to circumvent)
- requiring CAPTCHA authentication for sensitive content
- changing the structure of your webpage to confound previously configured bots which might be scraping your site
- limiting request frequency or the amount of data a user is allowed to download
What is the bug? How long has it been in the codebase? How long will it likely take to fix and update?
An enterprise customer has sent an email to support, complaining about slow response, bad service, and threatening to leave. What would you do?
I'm sorry that you have been frustrated with your customer service experience at PerimeterX in the past. I would like to address this personally and find a way that we can resolve your issues.
Records about support history - how long have they waited? Have they had particularly hard questions for support? Have they receieved bad service, and if so, how? Would it help if they talked to someone higher up in the company? How much is the customer worth to the company, and should there be more attention paid to their issues?