用Python 3开发网络爬虫

By Terrill Yang (Github: https://github.com/yttty)

用Python 3开发网络爬虫 - Chapter 05 使用BeautifulSoup

基础用法

BeautifulSoup库里面最常用的对象就是BeautifulSoup对象，下面用一个例子来看一下BeautifulSoup对象的用法



In [1]:

    
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")
bsObj = BeautifulSoup(html.read(), "html.parser")



In [2]:

    
print(bsObj)









    



<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



In [3]:

    
print(bsObj.h1)









    



<h1>An Interesting Title</h1>



In [4]:

    
print(bsObj.title)









    



<title>A Useful Page</title>

可以看到，我们从网页中提取的标签被嵌在BeautifulSoup对象的第二层。但是，当我们从对象里提取h1标签的时候，可以直接这样调用他。

bsObj.h1

事实上，下面的所有函数调用都可以产生同样的结果

bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1

处理网络连接异常

让我们看看爬虫的第二行代码

html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html")

这行代码可能会发生下面两种异常：

网页在服务器上不存在（或者获取页面的时候出现错误）
服务器不存在

当第一种异常发生时，urlopen会抛出HTTPError错误，我们可以用try-catch语句处理这种异常

当第二种异常发生时，urlopen会返回一个None对象，我们用if语句来检查这种错误



In [5]:

    
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import sys

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "html.parser")
        title = bsObj.h1
    except AttributeError as e:
        return None
    return title

title = getTitle("http://www.pythonscraping.com/exercises/exercise1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)









    



<h1>An Interesting Title</h1>

在写爬虫的时候，思考代码的总体格局，让代码既可以捕捉异常又容易阅读是很重要的。拥有像getTitle这样的通用的函数（具有周密的异常处理功能）会让快速稳定的网络数据采集变得简单

解析复杂HTML

基本上，每个网站都会有层叠样式表（CSS），CSS可以让HTML元素呈现出差异化，使那些具有完全相同修饰的元素呈现出不同的样式。比如，有一些标签看起来是这样：

<span class="green"></span>

而另一些标签看起来是这样

<span class="red"></span>

爬虫可以通过class属性的值，轻松区分出两种不同的标签，例如，可以用BeautifulSoup抓取网页上所有红色的文字，而绿色的一个都不抓。

下面我们创建一个网络爬虫来抓取 http://www.pythonscraping.com/pages/warandpeace.html 这个网页。这个网页上任务的对话内容是红色的，而人名是绿色的，可以看到网页源码里的span标签，如下所示

It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the first to arrive at her
reception. <span class="green">Anna Pavlovna</span> had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
<span class="green">St. Petersburg</span>, used only by the elite.

我们先抓出整个页面，然后用findAll函数抽取只包含在<span class="green"></span>里的文字



In [6]:

    
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html, "html.parser")
nameList = bsObj.findAll("span", {"class":"green"})
for name in nameList:
    print(name.get_text())









    



Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna

find()和findAll()

文档中的定义是这样的

BeautifulSoup.find(self, name=None, attrs={}, recursive=True, text=None, **kwargs)

BeautifulSoup.findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

下面我们通过实例来看一下用法



In [7]:

    
bsObj.find({'span'})









    Out[7]:





<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>



In [8]:

    
bsObj.findAll({'span'})









    Out[8]:





[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">St. Petersburg</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 <span class="red">Heavens! what a virulent attack!</span>,
 <span class="green">the prince</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="red">First of all, dear friend, tell me how you are. Set your friend's
 mind at rest,</span>,
 <span class="red">Can one be well while suffering morally? Can one be calm in times
 like these if one has any feeling?</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="red">You are
 staying the whole evening, I hope?</span>,
 <span class="red">And the fete at the English ambassador's? Today is Wednesday. I
 must put in an appearance there,</span>,
 <span class="green">the prince</span>,
 <span class="red">My daughter is
 coming for me to take me there.</span>,
 <span class="red">I thought today's fete had been canceled. I confess all these
 festivities and fireworks are becoming wearisome.</span>,
 <span class="red">If they had known that you wished it, the entertainment would
 have been put off,</span>,
 <span class="green">the prince</span>,
 <span class="red">Don't tease! Well, and what has been decided about Novosiltsev's
 dispatch? You know everything.</span>,
 <span class="red">What can one say about it?</span>,
 <span class="green">the prince</span>,
 <span class="red">What has been decided? They have decided that
 Buonaparte has burnt his boats, and I believe that we are ready to
 burn ours.</span>,
 <span class="green">Prince Vasili</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="red">Oh, don't speak to me of Austria. Perhaps I don't understand
 things, but Austria never has wished, and does not wish, for war.
 She is betraying us! Russia alone must save Europe. Our gracious
 sovereign recognizes his high vocation and will be true to it. That is
 the one thing I have faith in! Our good and wonderful sovereign has to
 perform the noblest role on earth, and he is so virtuous and noble
 that God will not forsake him. He will fulfill his vocation and
 crush the hydra of revolution, which has become more terrible than
 ever in the person of this murderer and villain! We alone must
 avenge the blood of the just one.... Whom, I ask you, can we rely
 on?... England with her commercial spirit will not and cannot
 understand the Emperor Alexander's loftiness of soul. She has
 refused to evacuate Malta. She wanted to find, and still seeks, some
 secret motive in our actions. What answer did Novosiltsev get? None.
 The English have not understood and cannot understand the
 self-abnegation of our Emperor who wants nothing for himself, but only
 desires the good of mankind. And what have they promised? Nothing! And
 what little they have promised they will not perform! Prussia has
 always declared that Buonaparte is invincible, and that all Europe
 is powerless before him.... And I don't believe a word that Hardenburg
 says, or Haugwitz either. This famous Prussian neutrality is just a
 trap. I have faith only in God and the lofty destiny of our adored
 monarch. He will save Europe!</span>,
 <span class="red">I think,</span>,
 <span class="green">the prince</span>,
 <span class="red">that if you had been
 sent instead of our dear <span class="green">Wintzingerode</span> you would have captured the
 <span class="green">King of Prussia</span>'s consent by assault. You are so eloquent. Will you
 give me a cup of tea?</span>,
 <span class="green">Wintzingerode</span>,
 <span class="green">King of Prussia</span>,
 <span class="red">In a moment. A propos,</span>,
 <span class="red">I am
 expecting two very interesting men tonight, <span class="green">le Vicomte de Mortemart</span>,
 who is connected with the <span class="green">Montmorencys</span> through the <span class="green">Rohans</span>, one of
 the best French families. He is one of the genuine emigres, the good
 ones. And also the <span class="green">Abbe Morio</span>. Do you know that profound thinker? He
 has been received by <span class="green">the Emperor</span>. Had you heard?</span>,
 <span class="green">le Vicomte de Mortemart</span>,
 <span class="green">Montmorencys</span>,
 <span class="green">Rohans</span>,
 <span class="green">Abbe Morio</span>,
 <span class="green">the Emperor</span>,
 <span class="red">I shall be delighted to meet them,</span>,
 <span class="green">the prince</span>,
 <span class="red">But tell me,</span>,
 <span class="red">is it true that the Dowager Empress wants Baron Funke
 to be appointed first secretary at Vienna? The baron by all accounts
 is a poor creature.</span>,
 <span class="green">Prince Vasili</span>,
 <span class="green">Dowager Empress Marya Fedorovna</span>,
 <span class="green">the baron</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">the Empress</span>,
 <span class="red">Baron Funke has been recommended to the Dowager Empress by her
 sister,</span>,
 <span class="green">the Empress</span>,
 <span class="green">Anna Pavlovna's</span>,
 <span class="green">Her Majesty</span>,
 <span class="green">Baron
 Funke</span>,
 <span class="green">The prince</span>,
 <span class="green">Anna
 Pavlovna</span>,
 <span class="green">the Empress</span>,
 <span class="red">Now about your family. Do you know that since your daughter came
 out everyone has been enraptured by her? They say she is amazingly
 beautiful.</span>,
 <span class="green">The prince</span>,
 <span class="red">I often think,</span>,
 <span class="red">I often think how unfairly sometimes the
 joys of life are distributed. Why has fate given you two such splendid
 children? I don't speak of <span class="green">Anatole</span>, your youngest. I don't like
 him,</span>,
 <span class="green">Anatole</span>,
 <span class="red">Two such charming children. And really you appreciate
 them less than anyone, and so you don't deserve to have them.</span>,
 <span class="red">I can't help it,</span>,
 <span class="green">the prince</span>,
 <span class="red">Lavater would have said I
 lack the bump of paternity.</span>,
 <span class="red">Don't joke; I mean to have a serious talk with you. Do you know I
 am dissatisfied with your younger son? Between ourselves</span>,
 <span class="red">he was mentioned at Her
 Majesty's and you were pitied....</span>,
 <span class="green">The prince</span>,
 <span class="red">What would you have me do?</span>,
 <span class="red">You know I did all
 a father could for their education, and they have both turned out
 fools. Hippolyte is at least a quiet fool, but Anatole is an active
 one. That is the only difference between them.</span>,
 <span class="red">And why are children born to such men as you? If you were not a
 father there would be nothing I could reproach you with,</span>,
 <span class="green">Anna
 Pavlovna</span>,
 <span class="red">I am your faithful slave and to you alone I can confess that my
 children are the bane of my life. It is the cross I have to bear. That
 is how I explain it to myself. It can't be helped!</span>,
 <span class="green">Anna Pavlovna</span>]



In [9]:

    
bsObj.find({'h1','h2'})









    Out[9]:





<h1>War and Peace</h1>



In [10]:

    
bsObj.findAll({'h1','h2'})









    Out[10]:





[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

可以看到，find()只会找出一个的字段，而findAll()则会找出所有符合的字段。

此外，还有一个关键词参数keyword可以选择哪些具有指定属性的标签（就和上面演示的findAll功能一样，这样子设计是为了功能的冗余）



In [11]:

    
allText = bsObj.findAll(id='text')
print(allText[0].get_text())









    



"Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news."

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite.

All her invitations without exception, written in French, and
delivered by a scarlet-liveried footman that morning, ran as follows:

"If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer."

"Heavens! what a virulent attack!" replied the prince, not in the
least disconcerted by this reception. He had just entered, wearing
an embroidered court uniform, knee breeches, and shoes, and had
stars on his breast and a serene expression on his flat face. He spoke
in that refined French in which our grandfathers not only spoke but
thought, and with the gentle, patronizing intonation natural to a
man of importance who had grown old in society and at court. He went
up to Anna Pavlovna, kissed her hand, presenting to her his bald,
scented, and shining head, and complacently seated himself on the
sofa.

"First of all, dear friend, tell me how you are. Set your friend's
mind at rest," said he without altering his tone, beneath the
politeness and affected sympathy of which indifference and even
irony could be discerned.

"Can one be well while suffering morally? Can one be calm in times
like these if one has any feeling?" said Anna Pavlovna. "You are
staying the whole evening, I hope?"

"And the fete at the English ambassador's? Today is Wednesday. I
must put in an appearance there," said the prince. "My daughter is
coming for me to take me there."

"I thought today's fete had been canceled. I confess all these
festivities and fireworks are becoming wearisome."

"If they had known that you wished it, the entertainment would
have been put off," said the prince, who, like a wound-up clock, by
force of habit said things he did not even wish to be believed.

"Don't tease! Well, and what has been decided about Novosiltsev's
dispatch? You know everything."

"What can one say about it?" replied the prince in a cold,
listless tone. "What has been decided? They have decided that
Buonaparte has burnt his boats, and I believe that we are ready to
burn ours."

Prince Vasili always spoke languidly, like an actor repeating a
stale part. Anna Pavlovna Scherer on the contrary, despite her forty
years, overflowed with animation and impulsiveness. To be an
enthusiast had become her social vocation and, sometimes even when she
did not feel like it, she became enthusiastic in order not to
disappoint the expectations of those who knew her. The subdued smile
which, though it did not suit her faded features, always played
round her lips expressed, as in a spoiled child, a continual
consciousness of her charming defect, which she neither wished, nor
could, nor considered it necessary, to correct.

In the midst of a conversation on political matters Anna Pavlovna
burst out:

"Oh, don't speak to me of Austria. Perhaps I don't understand
things, but Austria never has wished, and does not wish, for war.
She is betraying us! Russia alone must save Europe. Our gracious
sovereign recognizes his high vocation and will be true to it. That is
the one thing I have faith in! Our good and wonderful sovereign has to
perform the noblest role on earth, and he is so virtuous and noble
that God will not forsake him. He will fulfill his vocation and
crush the hydra of revolution, which has become more terrible than
ever in the person of this murderer and villain! We alone must
avenge the blood of the just one.... Whom, I ask you, can we rely
on?... England with her commercial spirit will not and cannot
understand the Emperor Alexander's loftiness of soul. She has
refused to evacuate Malta. She wanted to find, and still seeks, some
secret motive in our actions. What answer did Novosiltsev get? None.
The English have not understood and cannot understand the
self-abnegation of our Emperor who wants nothing for himself, but only
desires the good of mankind. And what have they promised? Nothing! And
what little they have promised they will not perform! Prussia has
always declared that Buonaparte is invincible, and that all Europe
is powerless before him.... And I don't believe a word that Hardenburg
says, or Haugwitz either. This famous Prussian neutrality is just a
trap. I have faith only in God and the lofty destiny of our adored
monarch. He will save Europe!"

She suddenly paused, smiling at her own impetuosity.

"I think," said the prince with a smile, "that if you had been
sent instead of our dear Wintzingerode you would have captured the
King of Prussia's consent by assault. You are so eloquent. Will you
give me a cup of tea?"

"In a moment. A propos," she added, becoming calm again, "I am
expecting two very interesting men tonight, le Vicomte de Mortemart,
who is connected with the Montmorencys through the Rohans, one of
the best French families. He is one of the genuine emigres, the good
ones. And also the Abbe Morio. Do you know that profound thinker? He
has been received by the Emperor. Had you heard?"

"I shall be delighted to meet them," said the prince. "But tell me,"
he added with studied carelessness as if it had only just occurred
to him, though the question he was about to ask was the chief motive
of his visit, "is it true that the Dowager Empress wants Baron Funke
to be appointed first secretary at Vienna? The baron by all accounts
is a poor creature."

Prince Vasili wished to obtain this post for his son, but others
were trying through the Dowager Empress Marya Fedorovna to secure it
for the baron.

Anna Pavlovna almost closed her eyes to indicate that neither she
nor anyone else had a right to criticize what the Empress desired or
was pleased with.

"Baron Funke has been recommended to the Dowager Empress by her
sister," was all she said, in a dry and mournful tone.

As she named the Empress, Anna Pavlovna's face suddenly assumed an
expression of profound and sincere devotion and respect mingled with
sadness, and this occurred every time she mentioned her illustrious
patroness. She added that Her Majesty had deigned to show Baron
Funke, and again her face clouded over with sadness.

The prince was silent and looked indifferent. But, with the
womanly and courtierlike quickness and tact habitual to her, Anna
Pavlovna wished both to rebuke him (for daring to speak he had done of
a man recommended to the Empress) and at the same time to console him,
so she said:

"Now about your family. Do you know that since your daughter came
out everyone has been enraptured by her? They say she is amazingly
beautiful."

The prince bowed to signify his respect and gratitude.

"I often think," she continued after a short pause, drawing nearer
to the prince and smiling amiably at him as if to show that
political and social topics were ended and the time had come for
intimate conversation- "I often think how unfairly sometimes the
joys of life are distributed. Why has fate given you two such splendid
children? I don't speak of Anatole, your youngest. I don't like
him," she added in a tone admitting of no rejoinder and raising her
eyebrows. "Two such charming children. And really you appreciate
them less than anyone, and so you don't deserve to have them."

And she smiled her ecstatic smile.

"I can't help it," said the prince. "Lavater would have said I
lack the bump of paternity."

"Don't joke; I mean to have a serious talk with you. Do you know I
am dissatisfied with your younger son? Between ourselves" (and her
face assumed its melancholy expression), "he was mentioned at Her
Majesty's and you were pitied...."

The prince answered nothing, but she looked at him significantly,
awaiting a reply. He frowned.

"What would you have me do?" he said at last. "You know I did all
a father could for their education, and they have both turned out
fools. Hippolyte is at least a quiet fool, but Anatole is an active
one. That is the only difference between them." He said this smiling
in a way more natural and animated than usual, so that the wrinkles
round his mouth very clearly revealed something unexpectedly coarse
and unpleasant.

"And why are children born to such men as you? If you were not a
father there would be nothing I could reproach you with," said Anna
Pavlovna, looking up pensively.

"I am your faithful slave and to you alone I can confess that my
children are the bane of my life. It is the cross I have to bear. That
is how I explain it to myself. It can't be helped!"

He said no more, but expressed his resignation to cruel fate by a
gesture. Anna Pavlovna meditated.

处理HTML的tag

findAll用标签的名称和属性来查找标签，如果需要通过标签在文档中的位置来查找标签，我们用下面的函数来收集纵向的标签



In [12]:

    
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "html.parser")



In [13]:

    
# child 标签
for child in bsObj.find("table",{"id":"giftList"}).children:
    print(child)









    




<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg">
</img></td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg">
</img></td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg">
</img></td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg">
</img></td></tr>


<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg">
</img></td></tr>



In [14]:

    
# sibling标签
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling)









    




<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg">
</img></td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg">
</img></td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg">
</img></td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg">
</img></td></tr>


<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg">
</img></td></tr>



In [15]:

    
# parent标签
print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

在BeautifulSoup中使用正则表达式

当然bs4是支持正则表达式的，在bs4中使用正则表达式的例子请参考文档，这里不做过多叙述，仅仅给出一个例子：



In [16]:

    
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "html.parser")
images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images: 
    print(image["src"])









    



../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg

关于BeautifulSoup的介绍就先讲这么多，更加细节的介绍请参阅其官方文档，或者中文官方文档



In [ ]: