数据抓取

抓取历届政府工作报告

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com



In [2]:

    
import requests
from bs4 import BeautifulSoup



In [3]:

    
from IPython.display import display_html, HTML
HTML('<iframe src=http://www.hprc.org.cn/wxzl/wxysl/lczf/ width=1000 height=500></iframe>')
# the webpage we would like to crawl









    Out[3]:

Inspect

· 2016年政府工作报告

<td width="274" class="bl">· <a href="./d12qgrdzfbg/201603/t20160318_369509.html" target="_blank" title="2016年政府工作报告">2016年政府工作报告</a></td>



In [15]:

    
# get the link for each year
url = "http://www.hprc.org.cn/wxzl/wxysl/lczf/" 
content = requests.get(url)
content.encoding









    Out[15]:





'ISO-8859-1'

Encoding

ASCII
- 7位字符集
- 美国标准信息交换代码（American Standard Code for Information Interchange）的缩写, 为美国英语通信所设计。
- 它由128个字符组成，包括大小写字母、数字0-9、标点符号、非打印字符（换行符、制表符等4个）以及控制字符（退格、响铃等）组成。
iso8859-1 通常叫做Latin-1。
- 和ascii编码相似。
- 属于单字节编码，最多能表示的字符范围是0-255，应用于英文系列。比如，字母a的编码为0x61=97。
- 无法表示中文字符。
- 单字节编码，和计算机最基础的表示单位一致，所以很多时候，仍旧使用iso8859-1编码来表示。在很多协议上，默认使用该编码。

gb2312/gbk/gb18030
- 是汉字的国标码，专门用来表示汉字，是双字节编码，而英文字母和iso8859-1一致（兼容iso8859-1编码）。
- 其中gbk编码能够用来同时表示繁体字和简体字,K 为汉语拼音 Kuo Zhan（扩展）中“扩”字的声母
- gb2312只能表示简体字，gbk是兼容gb2312编码的。
- gb18030，全称：国家标准 GB 18030-2005《信息技术中文编码字符集》，是中华人民共和国现时最新的内码字集

unicode
- 最统一的编码，用来表示所有语言的字符。
- 占用更多的空间，定长双字节（也有四字节的）编码，包括英文字母在内。
- 不兼容iso8859-1等其它编码。相对于iso8859-1编码来说，uniocode编码只是在前面增加了一个0字节，比如字母a为"00 61"。
- 定长编码便于计算机处理（注意GB2312/GBK不是定长编码），unicode又可以用来表示所有字符，所以在很多软件内部是使用unicode编码来处理的，比如java。
UTF
- unicode不便于传输和存储，产生了utf编码
- utf编码兼容iso8859-1编码，同时也可以用来表示所有语言的字符
- utf编码是不定长编码，每一个字符的长度从1-6个字节不等。
- 其中，utf8（8-bit Unicode Transformation Format）是一种针对Unicode的可变长度字符编码，又称万国码。
  - 由Ken Thompson于1992年创建。现在已经标准化为RFC 3629。

decode

~~urllib2.urlopen(url).read().decode('gb18030')~~

content.encoding = 'gb18030'

content = content.text

content = content.text.encode(content.encoding).decode('gb18030')

html.parser

BeautifulSoup(content, 'html.parser')



In [16]:

    
# Specify the encoding
content.encoding = 'utf8' # 'gb18030'
content = content.text



In [17]:

    
soup = BeautifulSoup(content, 'html.parser') 
# links = soup.find_all('td', {'class', 'bl'})   
links = soup.select('.bl a')
print(links[0])









    



<a href="./dssjqgrmdbdh_1/201903/t20190318_4849567.html" target="_blank" title="2019年政府工作报告">2019年政府工作报告</a>



In [18]:

    
len(links)









    Out[18]:





50



In [19]:

    
links[-1]['href']









    Out[19]:





'./dishiyijie_10/200908/t20090818_3955459.html'



In [20]:

    
links[0]['href'].split('./')[1]









    Out[20]:





'dssjqgrmdbdh_1/201903/t20190318_4849567.html'



In [21]:

    
url + links[0]['href'].split('./')[1]









    Out[21]:





'http://www.hprc.org.cn/wxzl/wxysl/lczf/dssjqgrmdbdh_1/201903/t20190318_4849567.html'



In [22]:

    
hyperlinks = [url + i['href'].split('./')[1] for i in links]
hyperlinks[:5]









    Out[22]:





['http://www.hprc.org.cn/wxzl/wxysl/lczf/dssjqgrmdbdh_1/201903/t20190318_4849567.html',
 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dssjqgrmdbdh_1/201803/t20180323_4240852.html',
 'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201703/t20170317_4144138.html',
 'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201603/t20160318_4135203.html',
 'http://www.hprc.org.cn/wxzl/wxysl/lczf/d12qgrdzfbg/201503/t20150318_4106347.html']



In [23]:

    
hyperlinks[-5:]









    Out[23]:





['http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_9/200908/t20090818_3955464.html',
 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_3955462.html',
 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_3955461.html',
 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_3955460.html',
 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_10/200908/t20090818_3955459.html']



In [26]:

    
hyperlinks[12] # 2007年有分页









    Out[26]:





'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_1/200908/t20090818_3955570.html'



In [30]:

    
from IPython.display import display_html, HTML

HTML('<iframe src=http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_1/200908/t20090818_3955570.html width=1000 height=500></iframe>')
# 2007年有分页









    Out[30]:

Inspect 下一页

<a href="t20090818_27775_1.html"><span style="color:#0033FF;font-weight:bold">下一页</span></a>

a
- script
  - td



In [39]:

    
url_i = 'http://www.hprc.org.cn/wxzl/wxysl/lczf/dishiyijie_1/200908/t20090818_3955570.html'
content = requests.get(url_i)
content.encoding = 'utf8'
content = content.text
#content = content.text.encode(content.encoding).decode('gb18030')
soup = BeautifulSoup(content, 'html.parser') 
#scripts = soup.find_all('script')
#scripts[0]
scripts = soup.select('td script')[0]



In [40]:

    
scripts









    Out[40]:





<script>
	var currentPage = 0;//所在页从0开始
	var prevPage = currentPage-1//上一页
	var 下一页Page = currentPage+1//下一页
	var countPage = 4//共多少页
	//document.write("共"+countPage+"页&nbsp;&nbsp;");
	
	//循环
	var num = 17;
	for(var i=0+(currentPage-1-(currentPage-1)%num) ; i<=(num+(currentPage-1-(currentPage-1)%num))&&(i<countPage) ; i++){
		if(countPage >1){
			if(currentPage==i)
				document.write("【<span style=\"color:#FF0000;\" class=\"hui14_30_h\">"+(i+1)+"</span>】&nbsp;");
			else if(i==0)
				document.write("<a href=\"t20090818_3955570.html\" class=\"hui14_30_h\">【"+(i+1)+"】</a>&nbsp;");
			else
				document.write("<a href=\"t20090818_3955570"+"_" + i + "."+"html\" class=\"hui14_30_h\">【"+(i+1)+"】</a>&nbsp;");
		}	
	}
	
	document.write("<br><br>");
	//设置上一页代码
	if(countPage>1&&currentPage!=0&&currentPage!=1)
		document.write("<a href=\"t20090818_3955570"+"_" + prevPage + "."+"html\"><span style=\"color:#0033FF;font-weight:bold\">上一页</span></a>&nbsp;");
	else if(countPage>1&&currentPage!=0&&currentPage==1)
		document.write("<a href=\"t20090818_3955570.html\"><span style=\"color:#0033FF;font-weight:bold\">上一页</span></a>&nbsp;");
	//else
	//	document.write("上一页 &nbsp;");
	
	
	//设置下一页代码 
	if(countPage>1&&currentPage!=(countPage-1))
		document.write("<a href=\"t20090818_3955570"+"_" + 下一页Page + "."+"html\" ><span style=\"color:#0033FF;font-weight:bold\">下一页</span></a> &nbsp;");
	//else
	//	document.write("下一页 &nbsp;");
					 
	</script>



In [41]:

    
scripts.text









    Out[41]:





'\n\tvar currentPage = 0;//所在页从0开始\n\tvar prevPage = currentPage-1//上一页\n\tvar 下一页Page = currentPage+1//下一页\n\tvar countPage = 4//共多少页\n\t//document.write("共"+countPage+"页&nbsp;&nbsp;");\n\t\n\t//循环\n\tvar num = 17;\n\tfor(var i=0+(currentPage-1-(currentPage-1)%num) ; i<=(num+(currentPage-1-(currentPage-1)%num))&&(i<countPage) ; i++){\n\t\tif(countPage >1){\n\t\t\tif(currentPage==i)\n\t\t\t\tdocument.write("【<span style=\\"color:#FF0000;\\" class=\\"hui14_30_h\\">"+(i+1)+"</span>】&nbsp;");\n\t\t\telse if(i==0)\n\t\t\t\tdocument.write("<a href=\\"t20090818_3955570.html\\" class=\\"hui14_30_h\\">【"+(i+1)+"】</a>&nbsp;");\n\t\t\telse\n\t\t\t\tdocument.write("<a href=\\"t20090818_3955570"+"_" + i + "."+"html\\" class=\\"hui14_30_h\\">【"+(i+1)+"】</a>&nbsp;");\n\t\t}\t\n\t}\n\t\n\tdocument.write("<br><br>");\n\t//设置上一页代码\n\tif(countPage>1&&currentPage!=0&&currentPage!=1)\n\t\tdocument.write("<a href=\\"t20090818_3955570"+"_" + prevPage + "."+"html\\"><span style=\\"color:#0033FF;font-weight:bold\\">上一页</span></a>&nbsp;");\n\telse if(countPage>1&&currentPage!=0&&currentPage==1)\n\t\tdocument.write("<a href=\\"t20090818_3955570.html\\"><span style=\\"color:#0033FF;font-weight:bold\\">上一页</span></a>&nbsp;");\n\t//else\n\t//\tdocument.write("上一页 &nbsp;");\n\t\n\t\n\t//设置下一页代码 \n\tif(countPage>1&&currentPage!=(countPage-1))\n\t\tdocument.write("<a href=\\"t20090818_3955570"+"_" + 下一页Page + "."+"html\\" ><span style=\\"color:#0033FF;font-weight:bold\\">下一页</span></a> &nbsp;");\n\t//else\n\t//\tdocument.write("下一页 &nbsp;");\n\t\t\t\t\t \n\t'



In [42]:

    
# countPage = int(''.join(scripts).split('countPage = ')\
#                 [1].split('//')[0])
# countPage

countPage = int(scripts.text.split('countPage = ')[1].split('//')[0])
countPage









    Out[42]:





4



In [43]:

    
import sys
def flushPrint(s):
    sys.stdout.write('\r')
    sys.stdout.write('%s' % s)
    sys.stdout.flush()
    
def crawler(url_i):
    content = requests.get(url_i)
    content.encoding = 'utf8'  
    content = content.text
    soup = BeautifulSoup(content, 'html.parser') 
    year = soup.find('span', {'class', 'huang16c'}).text[:4]
    year = int(year)
    report = ''.join(s.text for s in soup('p'))
    # 找到分页信息
    scripts = soup.find_all('script')
    countPage = int(''.join(scripts[1]).split('countPage = ')[1].split('//')[0])
    if countPage == 1:
        pass
    else:
        for i in range(1, countPage):
            url_child = url_i.split('.html')[0] +'_'+str(i)+'.html'
            content = requests.get(url_child)
            content.encoding = 'gb18030'
            content = content.text
            soup = BeautifulSoup(content, 'html.parser') 
            report_child = ''.join(s.text for s in soup('p'))
            report = report + report_child
    return year, report



In [44]:

    
# 抓取50年政府工作报告内容
reports = {}
for link in hyperlinks:
    year, report = crawler(link)
    flushPrint(year)
    reports[year] = report



In [45]:

    
with open('../data/gov_reports1954-2019.txt', 'w', encoding = 'utf8') as f:
    for r in reports:
        line = str(r)+'\t'+reports[r].replace('\n', '\t') +'\n'
        f.write(line)



In [46]:

    
import pandas as pd

df = pd.read_table('../data/gov_reports1954-2019.txt', names = ['year', 'report'])



In [48]:

    
df[-5:]









    Out[48]:







  
    
      
      year
      report
    
  
  
    
      45
      2015
      国务院总理 李克强　　各位代表：　　现在，我代表国务院，向大会报告政府工作，请予审议，并...
    
    
      46
      2016
      政府工作报告
    
    
      47
      2017
      各位代表：　　现在，我代表国务院，向大会报告政府工作，请予审议，并请全国政协各位委员提出...
    
    
      48
      2018
      各位代表：　　现在，我代表国务院，向大会报告过去五年政府工作，对今年工作提出建议，请予审...
    
    
      49
      2019
      新华社北京3月16日电　　 政府工作报告　　——2019年3月5日在第十三届全国人民代表...

This is the end.

Thank you for your attention.

	year	report
45	2015	国务院总理李克强　　各位代表：　　现在，我代表国务院，向大会报告政府工作，请予审议，并...
46	2016	政府工作报告
47	2017	各位代表：　　现在，我代表国务院，向大会报告政府工作，请予审议，并请全国政协各位委员提出...
48	2018	各位代表：　　现在，我代表国务院，向大会报告过去五年政府工作，对今年工作提出建议，请予审...
49	2019	新华社北京3月16日电　　政府工作报告　　——2019年3月5日在第十三届全国人民代表...