By Terrill Yang (Github: https://github.com/yttty)
In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
bsObj = BeautifulSoup(html.read(), "html.parser")
In [2]:
print(bsObj)
In [3]:
print(bsObj.h1)
In [4]:
print(bsObj.title)
可以看到,我们从网页中提取的标签被嵌在BeautifulSoup对象的第二层。但是,当我们从对象里提取h1标签的时候,可以直接这样调用他。
bsObj.h1
事实上,下面的所有函数调用都可以产生同样的结果
bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1
In [5]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import sys
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
print(e)
return None
try:
bsObj = BeautifulSoup(html.read(), "html.parser")
title = bsObj.h1
except AttributeError as e:
return None
return title
title = getTitle("http://www.pythonscraping.com/exercises/exercise1.html")
if title == None:
print("Title could not be found")
else:
print(title)
在写爬虫的时候,思考代码的总体格局,让代码既可以捕捉异常又容易阅读是很重要的。拥有像getTitle这样的通用的函数(具有周密的异常处理功能)会让快速稳定的网络数据采集变得简单
基本上,每个网站都会有层叠样式表(CSS),CSS可以让HTML元素呈现出差异化,使那些具有完全相同修饰的元素呈现出不同的样式。比如,有一些标签看起来是这样:
<span class="green"></span>
而另一些标签看起来是这样
<span class="red"></span>
爬虫可以通过class属性的值,轻松区分出两种不同的标签,例如,可以用BeautifulSoup抓取网页上所有红色的文字,而绿色的一个都不抓。
下面我们创建一个网络爬虫来抓取 http://www.pythonscraping.com/pages/warandpeace.html 这个网页。这个网页上任务的对话内容是红色的,而人名是绿色的,可以看到网页源码里的span标签,如下所示
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the first to arrive at her
reception. <span class="green">Anna Pavlovna</span> had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
<span class="green">St. Petersburg</span>, used only by the elite.
我们先抓出整个页面,然后用findAll函数抽取只包含在<span class="green"></span>
里的文字
In [6]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html, "html.parser")
nameList = bsObj.findAll("span", {"class":"green"})
for name in nameList:
print(name.get_text())
In [7]:
bsObj.find({'span'})
Out[7]:
In [8]:
bsObj.findAll({'span'})
Out[8]:
In [9]:
bsObj.find({'h1','h2'})
Out[9]:
In [10]:
bsObj.findAll({'h1','h2'})
Out[10]:
可以看到,find()
只会找出一个的字段,而findAll()
则会找出所有符合的字段。
此外,还有一个关键词参数keyword
可以选择哪些具有指定属性的标签(就和上面演示的findAll
功能一样,这样子设计是为了功能的冗余)
In [11]:
allText = bsObj.findAll(id='text')
print(allText[0].get_text())
findAll
用标签的名称和属性来查找标签,如果需要通过标签在文档中的位置来查找标签,我们用下面的函数来收集纵向的标签
In [12]:
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "html.parser")
In [13]:
# child 标签
for child in bsObj.find("table",{"id":"giftList"}).children:
print(child)
In [14]:
# sibling标签
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
print(sibling)
In [15]:
# parent标签
print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())
In [16]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "html.parser")
images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:
print(image["src"])
In [ ]: