Amnesty International Deutschland: Jahresberichte 2015

The following steps download and extract all the Jahresberichte for the year 2015.

It runs in Python 3, and uses the requests and bs4 ($ conda install beautiful-soup) libraries.

Step 1: Index page has 4 pages of up to 50 links each:



In [3]:

    
import bs4
import requests



In [4]:

    
jbindexurl = lambda page: "http://www.amnesty.de/laenderbericht/australien?page=%d&country=&topic=&node_type=ai_annual_report&from_month=0&from_year=&to_month=0&to_year=&submit_x=103&submit_y=13&submit=Auswahl+anzeigen&result_limit=50&form_id=ai_core_search_form" % page
jbindices = [bs4.BeautifulSoup(requests.get(jbindexurl(i)).text) for i in range(4)]

Step 2: downloading each linked HTML page

For those 4 index pages, download all linked pages where the link itself matches the given RegEx.



In [5]:

    
import re
ar2015 = re.compile("Amnesty Report 2015")



In [6]:

    
reports = {}
for jbindex in jbindices:
    a_reports = jbindex.find_all("a", text=ar2015)
    for a in a_reports:
        country = ' '.join(a.contents[0].split()[3:])
        reports[country] = requests.get("http://www.amnesty.de" + a.get("href")).text
        print(country, end=", ")









    



Argentinien, Ägypten, Peru, Kambodscha, Kanada, Kuwait, Österreich, Algerien, Panama, Papua-Neuguinea, Nigeria, Simbabwe, Zentralafrikanische Republik, Saudi-Arabien, Malediven, Tadschikistan, Estland, Trinidad und Tobago, Kirgisistan, Burkina Faso, Tschad, Guinea, Irak, Tschechien, Suriname, Honduras, Thailand, Mali, Taiwan, Turkmenistan, Niederlande, Georgien, Vietnam, Kamerun, Serbien (einschließlich Kosovo), Norwegen, Neuseeland, Burundi, Chile, Mauretanien, Portugal, Tunesien, Oman, Libanon, Angola, Belgien, Palästina, Haiti, Gambia, Australien, Brunei Darussalam, Deutschland, Libyen, China, Lettland, Philippinen, Großbritannien und Nordirland, Usbekistan, Belarus, Israel und besetzte palästinensische Gebiete, Guyana, Kenia, Afghanistan, Montenegro, Bolivien, Südafrika, Nepal, Malta, Ghana, Bosnien und Herzegowina, Vereinigte Staaten von Amerika, Kongo (Demokratische Republik), Mazedonien, Dominikanische Republik, Russische Föderation, Sierra Leone, Dänemark, Spanien, Ukraine, Irland, Aserbaidschan, Zypern, Bulgarien, Mosambik, Timor-Leste, Slowakei, Guatemala, Polen, Ecuador, Mongolei, Puerto Rico, Albanien, Kuba, Mexiko, Äquatorialguinea, Katar, Guinea-Bissau, Bahrain, Pakistan, Moldau, Singapur, Laos, Türkei, Côte d'Ivoire, Armenien, Togo, Venezuela, Litauen, Kasachstan, Iran, Uganda, Schweden, Nicaragua, Indien, Swasiland, Südsudan, Rumänien, Italien, Frankreich, Brasilien, Sri Lanka, Somalia, Malawi, El Salvador, Eritrea, Griechenland, Tansania, Nauru, Myanmar, Jamaika, Finnland, Jemen, Korea (Süd), Bahamas, Kroatien, Syrien, Ungarn, Marokko und Westsahara, Jordanien, Korea (Nord), Kongo (Republik), Benin, Malaysia, Sudan, Namibia, Kolumbien, Senegal, Japan, Sambia, Indonesien, Niger, Uruguay, Slowenien, Paraguay, Fidschi, Vereinigte Arabische Emirate, Ruanda, Bangladesch, Äthiopien,

Step 3: HTML template and write to HTML file

Only the actual HTML of the report is written to a small HTML file. It's the parent of the parent of the <h3> header "Amnesty Report 2015" … and it also removes the remaining link-bar at the top.



In [7]:

    
TMPL = """\
<!DOCTYPE html>
<html>
<head>
<title>Amnesty Report 2015 {country}</title>
</head>
<body>
{content}
</body>
</html>
"""



In [12]:

    
from codecs import open

for country, report in reports.items():
    bs = bs4.BeautifulSoup(report)
    h3 = bs.find("h3", text=ar2015)
    
    # parent of parent contains the main content
    content = h3.parent.parent
    
    # changing the h3 header to a proper h1 header
    h3.name = "h1"
    
    # we neither want the top bar nor the bar at the bottom for "zurück"
    # (extract() removes it from the DOM)
    for bar in content.find_all("ul", class_ = "ai_core_service_bar"):
        bar.extract()
    
    # writing to html file
    with open(country.lower().replace(" ", "_") + ".html", "w", "utf8") as f:
        f.write(TMPL.format(country = country, content = str(content)))



In [ ]: