Version : 1.0
Date : 2015-05-21
This notebook will illustrate the approach undertaken to extract the BMA doctor's registration. All doctors in Bangladesh recive a registration number at BMA after successfully completing their internship. Using that number they can establish their credivbility as a doctor. Using these numbers one can verify that someone is a legitimate doctor.
Using BMA search portal one can search using only the registration number. But since it is not common in this country to routinely publish their BMA number, we need an interface using which we can search the database using doctor's name also.
Unfortunately the data is very barebone at BMA website. Doctor's name, father's name, address and an official photo is provided against each id number. But we can create a master table which we can populate from other sources.
This interface provides us 66000 medical doctor and 4000 dental doctor's worth of information. Currently we have around 70000 doctors in our country. So up can expect data upto couple year ago.
This is a first attempt to collect the data and accumulate them. Several crude hacks were employed to ensure that a working model is up and running as soon as possible. Initially the informations are dumped in a CSV files after we have all the data they will be imported into a PostgreSQL database.
First use of the database might be to implement an mobile app interface where a patient can search for a doctor by his name or registration number and see his photo to verify that he is legit doctor.
In [6]:
#Load the necessary modules
from mechanize import Browser
import pandas as pd
from IPython.core.display import HTML
import requests
We need a function to parse the HTML data after extracting the result.
In [ ]:
def extract_sub_string(string, start, finish):
"""
extract a substring between the 'start' substring and the first occurence of 'finish' substring after that point.
:param string: main string, to be parsed
:type string: str
:param start: starting string
:type start: str
:param end: ending string
:type end: str
"""
new_string_index = string.find(start)
new_string = string[new_string_index:]
end_index =new_string.find(finish)
final_string = string[new_string_index:new_string_index+end_index]
return final_string
Now we extract the result pages against each of the id(1 to 66000) and store the strings in a pandas Dataframe. We will tokenize the resultant string later.
In [ ]:
start = 'doctor_info'
finish="</div"
extracted_strings = []
extracted_df = pd.DataFrame(columns=['extracted'])
for reg_no in xrange(1,66001):
browser = Browser()
browser.open("http://bmdc.org.bd/doctors-info/")
for form in browser.forms():
pass
# We have 2 forms in this page and we going to select the second form
browser.select_form(nr=1)
# This form has 2 input fields, first field, search_doc_id takes an number and second field type indicates if the
# id is assocated to a medical doctor or dentist
form['search_doc_id']=str(reg_no)
form['type']=['1']
# Submit the form and read the result
response = browser.submit()
content = response.read()
str_content = str(content)
#Extract only the relevant portion
extracted_str = extract_sub_string(str_content, start, finish)
extracted_strings.append(extracted_str)
# Originally these commnted out snipppets were run so that each group of 100 doctors are recorded at a time in
# seperate csv files. for testing and stability purpose. Each 100 doctors took around 6-7 minutes to record.
#if reg_no%100==0:
# file_number = reg_no/100
# extracted_df = pd.DataFrame(columns=['extracted'])
# extracted_df.extracted = extracted_strings
# extracted_df.to_csv(str(file_number)+'.csv')
# extracted_strings = []
extracted_df.extracted = extracted_strings
extracted_df.to_csv('all_bma_doctor.csv')
In [ ]:
tokenized_df = pd.DataFrame(columns=['Registration','Name','Father','Address', 'Division'])
#Since originally we created a number of csv files each containing 100 doctors we parsed them differently.
#file_list = []
#for item in xrange(1,66):
# file_list.append(str(item)+'.csv')
#for file_ in file_list:
df = pd.read_csv('all_bma_doctor.csv')
for index in df.index:
string = df.ix[index, 'extracted']
start="Registration Number</td>\r\n"
finish='</td>\r\n </tr>\r\n\r\n <tr class="odd">\r\n'
reg_no = extract_sub_string(string , start, finish)
reg_no = reg_no.strip()
reg_no = reg_no.split(" ")[-1]
#reg_no
start = '<td>Doctor\'s Name</td>\r\n'
finish = '</td>\r\n </tr>\r\n'
dr_name = extract_sub_string(string , start, finish)
dr_name=dr_name.strip()
dr_name = dr_name.split(">")[-1]
#dr_name
start = "<td>Father's Name</td>"
finish = "</td>\r\n </tr>"
father = extract_sub_string(string , start, finish)
father = father.strip()
father = father.split(">")[-1]
#father
start = '<td> <address> '
finish = "</address>"
address = extract_sub_string(string , start, finish)
address = address.strip()
address = address.split("<address>")[-1]
address = address.replace("<br/>",' ').strip()
#address
division = 'Medical'
values = pd.Series()
values['Registration'] = reg_no
values['Name'] = dr_name
values['Father'] = father
values['Address'] = address
values['Division'] = division
tokenized_df.loc[len(tokenized_df)] = values
In [17]:
tokenized_df[5000:5010]
Out[17]:
In [15]:
for bma_id in xrange(1,66001):
f = open(str(bma_id)+'.jpg','wb')
f.write(requests.get('http://bmdc.org.bd/dphotos/medical/'+str(bma_id)+'.JPG').content)
f.close()
In [ ]: