数据抓取：

Beautifulsoup简介

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com

需要解决的问题

页面解析
获取Javascript隐藏源数据
自动翻页
自动登录
连接API接口



In [14]:

    
import urllib2
from bs4 import BeautifulSoup

一般的数据抓取，使用urllib2和beautifulsoup配合就可以了。
尤其是对于翻页时url出现规则变化的网页，只需要处理规则化的url就可以了。
以简单的例子是抓取天涯论坛上关于某一个关键词的帖子。
- 在天涯论坛，关于雾霾的帖子的第一页是： http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=雾霾
- 第二页是： http://bbs.tianya.cn/list.jsp?item=free&nextid=1&order=8&k=雾霾

Beautiful Soup

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup provides a few simple methods. It doesn't take much code to write an application
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Then you just have to specify the original encoding.
Beautiful Soup sits on top of popular Python parsers like lxml and html5lib.

Install beautifulsoup4

open your terminal/cmd

$ pip install beautifulsoup4

第一个爬虫

Beautifulsoup Quick Start

http://www.crummy.com/software/BeautifulSoup/bs4/doc/



In [25]:

    
url = 'file:///Users/chengjun/GitHub/cjc2016/data/test.html'
content = urllib2.urlopen(url).read() 
soup = BeautifulSoup(content, 'html.parser') 
soup









    Out[25]:





<html><head><title>The Dormouse's story</title></head>\n<body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p></body></html>

html.parser

Beautiful Soup supports the html.parser included in Python’s standard library

lxml

but it also supports a number of third-party Python parsers. One is the lxml parser lxml. Depending on your setup, you might install lxml with one of these commands:

$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml

html5lib

Another alternative is the pure-Python html5lib parser html5lib, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:

$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib



In [26]:

    
print(soup.prettify())









    



<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

html
- head
  - title
- body
  - p (class = 'title', 'story' )
    - a (class = 'sister')
      - href/id



In [72]:

    
for tag in soup.find_all(True):
    print(tag.name)









    



html
head
title
body
p
b
p
a
a
a
p



In [58]:

    
soup('head') # or soup.head









    Out[58]:





[<head><title>The Dormouse's story</title></head>]



In [59]:

    
soup('body') # or soup.body









    Out[59]:





[<body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p></body>]



In [29]:

    
soup('title')  # or  soup.title









    Out[29]:





[<title>The Dormouse's story</title>]



In [60]:

    
soup('p')









    Out[60]:





[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,
 <p class="story">...</p>]



In [62]:

    
soup.p









    Out[62]:





<p class="title"><b>The Dormouse's story</b></p>



In [30]:

    
soup.title.name









    Out[30]:





u'title'



In [31]:

    
soup.title.string









    Out[31]:





u"The Dormouse's story"



In [48]:

    
soup.title.text









    Out[48]:





u"The Dormouse's story"



In [32]:

    
soup.title.parent.name









    Out[32]:





u'head'



In [33]:

    
soup.p









    Out[33]:





<p class="title"><b>The Dormouse's story</b></p>



In [34]:

    
soup.p['class']









    Out[34]:





[u'title']



In [50]:

    
soup.find_all('p', {'class', 'title'})









    Out[50]:





[<p class="title"><b>The Dormouse's story</b></p>]



In [78]:

    
soup.find_all('p', class_= 'title')









    Out[78]:





[<p class="title"><b>The Dormouse's story</b></p>]



In [49]:

    
soup.find_all('p', {'class', 'story'})









    Out[49]:





[<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,
 <p class="story">...</p>]



In [57]:

    
soup.find_all('p', {'class', 'story'})[0].find_all('a')









    Out[57]:





[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]



In [35]:

    
soup.a









    Out[35]:





<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>



In [79]:

    
soup('a')









    Out[79]:





[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]



In [37]:

    
soup.find(id="link3")









    Out[37]:





<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>



In [36]:

    
soup.find_all('a')









    Out[36]:





[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]



In [80]:

    
soup.find_all('a', {'class', 'sister'}) # compare with soup.find_all('a')









    Out[80]:





[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]



In [81]:

    
soup.find_all('a', {'class', 'sister'})[0]









    Out[81]:





<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>



In [44]:

    
soup.find_all('a', {'class', 'sister'})[0].text









    Out[44]:





u'Elsie'



In [46]:

    
soup.find_all('a', {'class', 'sister'})[0]['href']









    Out[46]:





u'http://example.com/elsie'



In [47]:

    
soup.find_all('a', {'class', 'sister'})[0]['id']









    Out[47]:





u'link1'



In [71]:

    
soup.find_all(["a", "b"])









    Out[71]:





[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]



In [38]:

    
print(soup.get_text())









    



The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

数据抓取：

根据URL抓取微信公众号文章内容

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com



In [11]:

    
from IPython.display import display_html, HTML
HTML('<iframe src=http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&\
mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd\
width=500 height=500></iframe>')
# the webpage we would like to crawl









    Out[11]:

查看源代码

Inspect



In [15]:

    
url = "http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&\
mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd"
content = urllib2.urlopen(url).read() #获取网页的html文本
soup = BeautifulSoup(content, 'html.parser') 
title = soup.title.text
rmml = soup.find('div', {'class', 'rich_media_meta_list'})
date = rmml.find(id = 'post-date').text
rmc = soup.find('div', {'class', 'rich_media_content'})
content = rmc.get_text()
print title
print date
print content









    



南大新传 | 微议题：地震中民族自豪—“中国人先撤”
2015-05-04

点击上方“微议题排行榜”可以订阅哦！导读2015年4月25日，尼泊尔发生8.1级地震，造成至少7000多人死亡，中国西藏、印度、孟加拉国、不丹等地均出现人员伤亡。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。  热词图现 本文以“地震”为关键词，选取了2015年4月10日至4月30日期间微议题TOP100阅读排行进行分析。根据微议题TOP100标题的词频统计，我们可以看出有关“地震”的话题最热词汇的有“尼泊尔”、“油价”、“发改委”。4月25日尼泊尔发生了8级地震，深受人们的关注。面对国外灾难性事件，微媒体的重心却转向“油价”、“发改委”、“祖国先撤”，致力于将世界重大事件与中国政府关联起来。  微议题演化趋势 总文章数总阅读数从4月10日到4月30日，有关“地震”议题出现三个峰值，分别是在4月15日内蒙古地震，20日台湾地震和25日尼泊尔地震。其中对台湾地震与内蒙古地震报道文章较少，而对尼泊尔地震却给予了极大的关注，无论是在文章量还是阅读量上都空前增多。内蒙古、台湾地震由于级数较小，关注少，议程时间也比较短，一般3天后就会淡出公共视野。而尼泊尔地震虽然接近性较差，但规模大，且衍生话题性较强，其讨论热度持续了一周以上。  议题分类 如图，我们将此议题分为6大类。1尼泊尔地震这类文章是对4月25日尼泊尔地震的新闻报道，包括现场视频，地震强度、规模，损失程度、遇难人员介绍等。更进一步的，有对尼泊尔地震原因探析，认为其处在板块交界处，灾难是必然的。因尼泊尔是佛教圣地，也有从佛学角度解释地震的启示。2国内地震报道主要是对10日内蒙古、甘肃、山西等地的地震，以及20日台湾地震的报道。偏重于对硬新闻的呈现，介绍地震范围、级数、伤亡情况，少数几篇是对甘肃地震的辟谣，称其只是微震。3中国救援回应地震救援的报道大多是与尼泊尔地震相关，并且80%的文章是中国政府做出迅速反应派出救援机接国人回家。以“中国人又先撤了”，来为祖国点赞。少数几篇是滴滴快的、腾讯基金、万达等为尼泊尔捐款的消息。4发改委与地震这类文章内容相似，纯粹是对发改委的调侃。称其“预测”地震非常准确，只要一上调油价，便会发生地震。5地震常识介绍该类文章介绍全国地震带、地震频发地，地震逃生注意事项，“专家传受活命三角”，如何用手机自救等小常识。6地震中的故事讲述地震中的感人瞬间，回忆汶川地震中的故事，传递“：地震无情，人间有情”的正能量。 国内外地震关注差异大关于“地震”本身的报道仍旧是媒体关注的重点，尼泊尔地震与国内地震报道占一半的比例。而关于尼泊尔话题的占了45%，国内地震相关的只有22%。微媒体对国内外地震关注有明显的偏差，而且在衍生话题方面也相差甚大。尼泊尔地震中，除了硬新闻报道外，还有对其原因分析、中国救援情况等，而国内地震只是集中于硬新闻。地震常识介绍只占9%，地震知识普及还比较欠缺。  阅读与点赞分析  爱国新闻容易激起点赞狂潮整体上来说，网民对地震议题关注度较高，自然灾害类话题一旦爆发，很容易引起人们情感共鸣，掀起热潮。但从点赞数来看，“中国救援回应”类的总点赞与平均点赞都是最高的，网民对地震的关注点并非地震本身，而是与之相关的“政府行动”。尼泊尔地震后，祖国派出救援机接国人回家，这一“先撤”行为被大量报道，上演了一出霸道总裁不由分说爱国民的新闻。而爱国新闻则往往是最容易煽动民族情绪，产生民族优越感，激起点赞狂潮。 人的关注小于国民尊严的保护另一方面，国内地震的关注度却很少，不仅体现在政府救援的报道量小，网民的兴趣点与评价也较低。我们对“地震”中人的关注，远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当，灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。“发改委与地震”的点赞量也相对较高，网民对发改委和地震的调侃，反映出的是对油价上涨的不满，这种“怨气”也容易产生共鸣。一面是民族优越感，一面是对政策不满，两种情绪虽矛盾，但同时体现了网民心理趋同。  数据附表 微文章排行TOP50：公众号排行TOP20：作者：晏雪菲出品单位：南京大学计算传播学实验中心技术支持：南京大学谷尼舆情监测分析实验室题图鸣谢：谷尼舆情新微榜、图悦词云

作业：

抓取复旦新媒体微信公众号最新一期的内容