This notebooks shows how to get an index name and value (also for multiple indicies) from the Bloomberg page. Reminder: first of all you need to check the robots.txt file to see whether we are allowed to scrape or not (being nice) and then go on, if everything is okay.
Let's find the current value of S&P index. Again, we should import the 2 libraries for getting the HTML page (requests) and for dealing with it (BeautifulSoup).
In [1]:
import requests
from BeautifulSoup import *
In [2]:
url = "http://www.bloomberg.com/quote/SPX:IND"
In [3]:
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)
If you check use the Inspect element feature from Google Chrome, you will see that the name of the index is inside an <h1>
tag which has a class = name attribute. Let's first find all the <h1>
guys from the page.
In [4]:
soup.findAll('h1')
Out[4]:
Obviously, this is not what we want (index name). Thus, we shoudl explicitly mention the attributes we are looking for:
In [5]:
index_name = soup.findAll('h1', attrs={'class': 'name'})
print(index_name)
Now, this works, we ere able to find the correct tag. Let's get the text out of it.
In [6]:
print(index_name[0].text)
The reason we used index_name[0] notation is that index_name is a list (although consisting of only one element, but still a list). The text methods works only on strings/tags, so we had to explicitly choose the tag from the list and then apply text method on it.
Similarly, we can use the Inspect element feature to understand where the index value is lying in and get it. Then we can get the text part out of it.
In [7]:
index_value = soup.findAll('div', attrs={'class':'price'})
print(index_value[0].text)
If one is interested in getting data on multiple indicies then you just need to specify all the urls you want to get data from inside a list, and then create a for loop that will iterate over that list, send a request, get the response and convert to text(as we did above and before), pass it as an argument to BeautifulSoup() function and get the index name and value as done above.
In [8]:
urls = ["https://www.bloomberg.com/quote/DM1:IND",
"https://www.bloomberg.com/quote/UKX:IND",
"https://www.bloomberg.com/quote/EURUSD:CUR" ]
In [9]:
for url in urls:
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)
index_name = soup.find("h1",attrs={'class':'name'})
index_value = soup.find("div",attrs={'class':'price'})
print(index_name.text+": "+index_value.text+"\n")
In [10]:
my_data = {}
for url in urls:
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)
index_name = soup.find("h1",attrs={'class':'name'})
index_value = soup.find("div",attrs={'class':'price'})
my_data.update({index_name.text:index_value.text})
Let's pretty print the content of the dictionary.
In [11]:
from pprint import pprint
pprint(my_data)
Once you create a for loop to send a request, get the data and so on, you should be attentive not to overwhelm the website server. For that purpose you may want to ask your for loop to sleep a bit between each iteration (let's say 10 seconds). We may use the time library and sleep() function from that library to make for loop sleep.
In [12]:
# This is the same code as above with 2 additional lines
import time # importing the library
my_data = {}
for url in urls:
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)
index_name = soup.find("h1",attrs={'class':'name'})
index_value = soup.find("div",attrs={'class':'price'})
my_data.update({index_name.text:index_value.text})
time.sleep(10) # asking for loop to sleep 10 seconds
Usually, different websites mention in their documentation how long you need to sleep before each request. THe average duration is 30 seconds.
In [13]:
import csv
from datetime import datetime
with open("index_data.csv","w") as file: # create a new file for writing purposes
writer = csv.writer(file)
for i in my_data:
writer.writerow([i,my_data[i],datetime.now()])
You may want to construct a pandas dataframe from the dictionary we had. the DataFrame.from_dict() functino from the pandas library might be useful. THe function takes two arguments: first the dictionary to get the data from, and second argument, which shows whether the dictionary keys should become row names (index) or column names in our dataframe. Let's make them row names.
In [14]:
import pandas as pd
data = pd.DataFrame.from_dict(my_data,"index")
print(data)
In [15]:
data.transpose()
Out[15]:
And of course, if we are dealing with pandas, we can easily save the dataframe to csv as follows.
In [16]:
data.transpose().to_csv("index_dataframe.csv")