数据抓取:

Beautifulsoup简介


王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com

需要解决的问题

  • 页面解析
  • 获取Javascript隐藏源数据
  • 自动翻页
  • 自动登录
  • 连接API接口

In [14]:
import urllib2
from bs4 import BeautifulSoup

Beautiful Soup

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

  • Beautiful Soup provides a few simple methods. It doesn't take much code to write an application
  • Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Then you just have to specify the original encoding.
  • Beautiful Soup sits on top of popular Python parsers like lxml and html5lib.

Install beautifulsoup4

open your terminal/cmd

$ pip install beautifulsoup4

第一个爬虫

Beautifulsoup Quick Start

http://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [25]:
url = 'file:///Users/chengjun/GitHub/cjc2016/data/test.html'
content = urllib2.urlopen(url).read() 
soup = BeautifulSoup(content, 'html.parser') 
soup


Out[25]:
<html><head><title>The Dormouse's story</title></head>\n<body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p></body></html>

html.parser

Beautiful Soup supports the html.parser included in Python’s standard library

lxml

but it also supports a number of third-party Python parsers. One is the lxml parser lxml. Depending on your setup, you might install lxml with one of these commands:

$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml

html5lib

Another alternative is the pure-Python html5lib parser html5lib, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:

$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib


In [26]:
print(soup.prettify())


<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
  • html
    • head
      • title
    • body
      • p (class = 'title', 'story' )
        • a (class = 'sister')
          • href/id

In [72]:
for tag in soup.find_all(True):
    print(tag.name)


html
head
title
body
p
b
p
a
a
a
p

In [58]:
soup('head') # or soup.head


Out[58]:
[<head><title>The Dormouse's story</title></head>]

In [59]:
soup('body') # or soup.body


Out[59]:
[<body>\n<p class="title"><b>The Dormouse's story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p></body>]

In [29]:
soup('title')  # or  soup.title


Out[29]:
[<title>The Dormouse's story</title>]

In [60]:
soup('p')


Out[60]:
[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [62]:
soup.p


Out[62]:
<p class="title"><b>The Dormouse's story</b></p>

In [30]:
soup.title.name


Out[30]:
u'title'

In [31]:
soup.title.string


Out[31]:
u"The Dormouse's story"

In [48]:
soup.title.text


Out[48]:
u"The Dormouse's story"

In [32]:
soup.title.parent.name


Out[32]:
u'head'

In [33]:
soup.p


Out[33]:
<p class="title"><b>The Dormouse's story</b></p>

In [34]:
soup.p['class']


Out[34]:
[u'title']

In [50]:
soup.find_all('p', {'class', 'title'})


Out[50]:
[<p class="title"><b>The Dormouse's story</b></p>]

In [78]:
soup.find_all('p', class_= 'title')


Out[78]:
[<p class="title"><b>The Dormouse's story</b></p>]

In [49]:
soup.find_all('p', {'class', 'story'})


Out[49]:
[<p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [57]:
soup.find_all('p', {'class', 'story'})[0].find_all('a')


Out[57]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [35]:
soup.a


Out[35]:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [79]:
soup('a')


Out[79]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [37]:
soup.find(id="link3")


Out[37]:
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [36]:
soup.find_all('a')


Out[36]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [80]:
soup.find_all('a', {'class', 'sister'}) # compare with soup.find_all('a')


Out[80]:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [81]:
soup.find_all('a', {'class', 'sister'})[0]


Out[81]:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [44]:
soup.find_all('a', {'class', 'sister'})[0].text


Out[44]:
u'Elsie'

In [46]:
soup.find_all('a', {'class', 'sister'})[0]['href']


Out[46]:
u'http://example.com/elsie'

In [47]:
soup.find_all('a', {'class', 'sister'})[0]['id']


Out[47]:
u'link1'

In [71]:
soup.find_all(["a", "b"])


Out[71]:
[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [38]:
print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


数据抓取:

根据URL抓取微信公众号文章内容



王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com


In [11]:
from IPython.display import display_html, HTML
HTML('<iframe src=http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&\
mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd\
width=500 height=500></iframe>')
# the webpage we would like to crawl


Out[11]:

查看源代码

Inspect


In [15]:
url = "http://mp.weixin.qq.com/s?__biz=MzA3MjQ5MTE3OA==&\
mid=206241627&idx=1&sn=471e59c6cf7c8dae452245dbea22c8f3&3rd=MzA3MDU4NTYzMw==&scene=6#rd"
content = urllib2.urlopen(url).read() #获取网页的html文本
soup = BeautifulSoup(content, 'html.parser') 
title = soup.title.text
rmml = soup.find('div', {'class', 'rich_media_meta_list'})
date = rmml.find(id = 'post-date').text
rmc = soup.find('div', {'class', 'rich_media_content'})
content = rmc.get_text()
print title
print date
print content


南大新传 | 微议题:地震中民族自豪—“中国人先撤”
2015-05-04

点击上方“微议题排行榜”可以订阅哦!导读2015年4月25日,尼泊尔发生8.1级地震,造成至少7000多人死亡,中国西藏、印度、孟加拉国、不丹等地均出现人员伤亡。尼泊尔地震后,祖国派出救援机接国人回家,这一“先撤”行为被大量报道,上演了一出霸道总裁不由分说爱国民的新闻。我们对“地震”中人的关注,远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当,灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。  热词图现 本文以“地震”为关键词,选取了2015年4月10日至4月30日期间微议题TOP100阅读排行进行分析。根据微议题TOP100标题的词频统计,我们可以看出有关“地震”的话题最热词汇的有“尼泊尔”、“油价”、“发改委”。4月25日尼泊尔发生了8级地震,深受人们的关注。面对国外灾难性事件,微媒体的重心却转向“油价”、“发改委”、“祖国先撤”,致力于将世界重大事件与中国政府关联起来。  微议题演化趋势 总文章数总阅读数从4月10日到4月30日,有关“地震”议题出现三个峰值,分别是在4月15日内蒙古地震,20日台湾地震和25日尼泊尔地震。其中对台湾地震与内蒙古地震报道文章较少,而对尼泊尔地震却给予了极大的关注,无论是在文章量还是阅读量上都空前增多。内蒙古、台湾地震由于级数较小,关注少,议程时间也比较短,一般3天后就会淡出公共视野。而尼泊尔地震虽然接近性较差,但规模大,且衍生话题性较强,其讨论热度持续了一周以上。  议题分类 如图,我们将此议题分为6大类。1尼泊尔地震这类文章是对4月25日尼泊尔地震的新闻报道,包括现场视频,地震强度、规模,损失程度、遇难人员介绍等。更进一步的,有对尼泊尔地震原因探析,认为其处在板块交界处,灾难是必然的。因尼泊尔是佛教圣地,也有从佛学角度解释地震的启示。2国内地震报道主要是对10日内蒙古、甘肃、山西等地的地震,以及20日台湾地震的报道。偏重于对硬新闻的呈现,介绍地震范围、级数、伤亡情况,少数几篇是对甘肃地震的辟谣,称其只是微震。3中国救援回应地震救援的报道大多是与尼泊尔地震相关,并且80%的文章是中国政府做出迅速反应派出救援机接国人回家。以“中国人又先撤了”,来为祖国点赞。少数几篇是滴滴快的、腾讯基金、万达等为尼泊尔捐款的消息。4发改委与地震这类文章内容相似,纯粹是对发改委的调侃。称其“预测”地震非常准确,只要一上调油价,便会发生地震。5地震常识介绍该类文章介绍全国地震带、地震频发地,地震逃生注意事项,“专家传受活命三角”,如何用手机自救等小常识。6地震中的故事讲述地震中的感人瞬间,回忆汶川地震中的故事,传递“:地震无情,人间有情”的正能量。 国内外地震关注差异大关于“地震”本身的报道仍旧是媒体关注的重点,尼泊尔地震与国内地震报道占一半的比例。而关于尼泊尔话题的占了45%,国内地震相关的只有22%。微媒体对国内外地震关注有明显的偏差,而且在衍生话题方面也相差甚大。尼泊尔地震中,除了硬新闻报道外,还有对其原因分析、中国救援情况等,而国内地震只是集中于硬新闻。地震常识介绍只占9%,地震知识普及还比较欠缺。  阅读与点赞分析  爱国新闻容易激起点赞狂潮整体上来说,网民对地震议题关注度较高,自然灾害类话题一旦爆发,很容易引起人们情感共鸣,掀起热潮。但从点赞数来看,“中国救援回应”类的总点赞与平均点赞都是最高的,网民对地震的关注点并非地震本身,而是与之相关的“政府行动”。尼泊尔地震后,祖国派出救援机接国人回家,这一“先撤”行为被大量报道,上演了一出霸道总裁不由分说爱国民的新闻。而爱国新闻则往往是最容易煽动民族情绪,产生民族优越感,激起点赞狂潮。 人的关注小于国民尊严的保护另一方面,国内地震的关注度却很少,不仅体现在政府救援的报道量小,网民的兴趣点与评价也较低。我们对“地震”中人的关注,远远小于国民尊严的保护。通过“撤离”速度来证明中国的影响力也显得有失妥当,灾难应急管理、救援和灾后重建能力才应是“地震”关注焦点。“发改委与地震”的点赞量也相对较高,网民对发改委和地震的调侃,反映出的是对油价上涨的不满,这种“怨气”也容易产生共鸣。一面是民族优越感,一面是对政策不满,两种情绪虽矛盾,但同时体现了网民心理趋同。  数据附表 微文章排行TOP50:公众号排行TOP20:作者:晏雪菲出品单位:南京大学计算传播学实验中心技术支持:南京大学谷尼舆情监测分析实验室题图鸣谢:谷尼舆情新微榜、图悦词云

作业:

  • 抓取复旦新媒体微信公众号最新一期的内容