By liupengyuan[at]pku.edu.cn

Project: https://github.com/liupengyuan/

1. 什么是爬虫

简而言之，爬虫就是一段能够获取互联网信息（数据）的程序/工具。

一般需要通过抓取网页来获取互联网的信息与数据。

网页本身就是一个本文文件，只不过这个文本文件是由特定规则和符号标记的(HTML,超文本标记语言)，称为超文本文件，也可称为网页源代码。

这段文本经过浏览器的解析（各类图片视频等在此过程中从网页外部加载），就成为我们日常浏览的网页展示给我们的样子了。

因此，爬虫首先就是要获得网页这个文本文件，并进行解析。

如果所需数据就直接在网页文本文件中，则可直接得到；如果所需数据在网页外部，则通过解析的结果得到数据所在地址，将该数据下载。

2. 最简爬虫

这里我们以优秀的python第三方库requests为例，该库是“唯一一个非转基因的Python HTTP库，人类可以安全享用”。

其中HTTP(HyperText Transfer Protocol)是互联网上应用最为广泛的一种网络协议，超文本文件传输时，必须遵守这个协议。

非转基因是作者幽默的说法，表明该库是原汁原味符合python设计理念，易用易于理解。

输入以下代码，并观察执行结果。



In [1]:

    
import requests
from bs4 import BeautifulSoup
import re

导入必要的模块



In [2]:

    
r = requests.get('https://github.com/liupengyuan/python_tutorial/blob/master/chapter1/0.md')
print(r.text[:1000])









    








<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://assets-cdn.github.com">
  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">



  <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/frameworks-c9193575f18b28be82c0a963e144ff6fa7a809dd8ae003a1d1e5979bed3f7f00.css" media="all" rel="stylesheet" />
  <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github-1214f71e8309b7d4dab489c2765c4de1523517be4c68b861a78d93c8e840648e.css" media="all" rel="stylesheet" />
  
  
  <link crossorigin="anonymous"

import requests，首先导入requests库
r = requests.get('https://github.com/liupengyuan/python_tutorial/blob/master/chapter1/0.md')，调用requests的get方法，向网址https://github.com/liupengyuan/python_tutorial/blob/master/chapter1/0.md 发送获取(get)请求，该方法返回一个应答(Response)对象r，其中包含该网页的所有信息。
print(r.text[:1000])，利用对象r的text属性字符串，打印网页内容，因为内容较多，暂取前1000个字符。
打开https://github.com/liupengyuan/python_tutorial/blob/master/chapter1/0.md网页，在网页正文区域点击右键，选择查看源代码，会发现代码示例D-1确实已经获取/抓取这个网页的文本文件。
至此，一个最简爬虫已经完成。

3. 静态定向爬虫基础

本小节中的爬虫是开始就指定了要抓取网页的地址（定向），要抓取的内容直接存在在网页源代码中（静态），因此称为静态定向爬虫。
由于需要用到Chrome浏览器中的检查功能，因此需要读者下载安装Chrome浏览器。

3.1 抓取网易首页(www.163.com)上的全部新闻

我们打开网易首页，然后用鼠标右键点击页面，用chrome浏览器可选择检查，进入开发者模式。
发者模式中，右侧为开发者工具面板，右上部的Elements标签页面中显示一片代码，是该网页html标记的文本（经过整理对齐的）。
推荐教程：http://www.w3school.com.cn/html/index.asp ，快速浏览，可了解一些html基础知识。



In [3]:

    
r = requests.get('http://www.163.com/')
print(r.text[:1000])









    



 <!DOCTYPE HTML>
<!--[if IE 6 ]> <html class="ne_ua_ie6 ne_ua_ielte8" id="ne_wrap"> <![endif]-->
<!--[if IE 7 ]> <html class="ne_ua_ie7 ne_ua_ielte8" id="ne_wrap"> <![endif]-->
<!--[if IE 8 ]> <html class="ne_ua_ie8 ne_ua_ielte8" id="ne_wrap"> <![endif]-->
<!--[if IE 9 ]> <html class="ne_ua_ie9" id="ne_wrap"> <![endif]-->
<!--[if (gte IE 10)|!(IE)]><!--> <html phone="1" id="ne_wrap"> <!--<![endif]-->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gbk">
<meta name="google-site-verification" content="PXunD38D6Oui1T44OkAPSLyQtFUloFi5plez040mUOc" />
<meta name="baidu-site-verification" content="oiT8OEfzes" />
<meta name="360-site-verification" content="527ad00f66a93c31134d6a20b2246950" />
<meta name="shenma-site-verification" content="12c2d7067c72735f0bd75c8dcd26b0d8_1509937417"/>
<meta name="sogou_site_verification" content="tCLG1xJc76"/>
<meta name="model_url" content="http://www.163.com/special/0077rt/index.html" />
<title>网易</title>
<base target="_blank" />
<meta na

用鼠标右键在新闻标题上点击，并选择检查，可在Elements标签页面找到该新闻标题以及对应的链接在html代码中的位置。

代码大致如下形式：

<li class="cm_fb">
<a href="http://news.163.com/xxxxxx.html">yyyyyyyyy</a>
::after
</li>

其中xxxxx及yyyyyyy随新闻标题不同而不同。前者是新闻链接，后者新闻是标题，因此这两项是我们感兴趣并希望抓取的内容
在Elements标签页面中继续查看各个新闻标题，可以发现各个新闻其实都在标签对<li>及</li>之间，且没有其间没有换行符。



In [4]:

    
p = r'<li>(.+)?</li>'
contents = re.findall(p, r.text)
print('\n'.join(contents[:5]))









    



<a href="http://you.163.com/item/recommend?from=web_fc_menhu_xinrukou_4"><span>999+人气好评品</span></a>
<a href="http://news.163.com/17/1204/20/D4RAVK1N000189FH.html">互联网大会蓝皮书发布 中国数字经济规模全球第二</a>
<a href="http://news.163.com/17/1204/09/D4Q8CECJ000189FH.html">习近平的这几句话 引发外国政党领导人的强烈共鸣</a>
<a href="http://news.cri.cn/special/2017hlwdhfpm.html">互联网精准扶贫</a> <a target="_blank" href="http://www.chinanews.com/m/shipin/spfts/20171202/1277.shtml">大连接时代：创新 智能 变革</a>
<a href="http://news.163.com/17/1204/11/D4QD03RS000189FH.html">上海:对标全球最高 开放之风劲吹</a> <a target="_blank" href="http://gov.163.com/special/S1509677294216/">领航新征程</a>

p = r'<li>(.+)?</li>' 构造了正则表达式，可提取<li>...</li>之间的内容(不包括之间的换行符)
re.findall(p, r.text) 将r.text中所有的符合上正则的提取出来放入列表
最终打印前50行，可以发现后面有一些并非新闻标题与对应链接，并非所有在<li>...</li>之间的内容都是目标内容
有关正则表达式，可参考教程：https://github.com/liupengyuan/python_tutorial/blob/master/chapter3 中的正则表达式快速教程。可以通过构造一些正则表达式来完成网页解析将目标内容提取。
这里我们将介绍一个优秀的第三方网页解析包：Beautifulsoap。该包已经随Anaconda安装，名称为bs4。可以用这个包(必要时联合正则表达式)来进行网页解析。
该包已经通过from bs4 import BeautifulSoup在开始导入



In [5]:

    
soup = BeautifulSoup(r.text, 'html.parser')
contents = soup.find_all('ul', attrs='cm_ul_round')
print('type is:', type(contents))
contents[:1]









    



type is: <class 'bs4.element.ResultSet'>






    Out[5]:





[<ul class="cm_ul_round">
 <li class="cm_fb"><a href="http://news.163.com/17/1204/20/D4RAQ8GD000189FH.html">关于互联网，习近平给出的20条重要论断</a></li>
 <li><a href="http://news.163.com/17/1204/20/D4RAVK1N000189FH.html">互联网大会蓝皮书发布 中国数字经济规模全球第二</a></li>
 <li><a href="http://news.163.com/17/1204/09/D4Q8CECJ000189FH.html">习近平的这几句话 引发外国政党领导人的强烈共鸣</a></li>
 <li><a href="http://news.cri.cn/special/2017hlwdhfpm.html">互联网精准扶贫</a> <a href="http://www.chinanews.com/m/shipin/spfts/20171202/1277.shtml" target="_blank">大连接时代：创新 智能 变革</a></li>
 <li><a href="http://news.163.com/17/1204/11/D4QD03RS000189FH.html">上海:对标全球最高 开放之风劲吹</a> <a href="http://gov.163.com/special/S1509677294216/" target="_blank">领航新征程</a></li>
 </ul>]

函数BeautifulSoup(r.text, 'html.parser')，返回一个BeautifulSoup对象。第一个参数为要解析的网页文本，第二个参数是解析器的选择，暂选择python内置的'html.parser'作为解析器
soup.find_all('ul', attrs='cm_ul_round')，BeautifulSoup对象的方法，可返回指定标签之间的所有内容的ResultSet对象。其中第一个参数为标签，第二个参数为标签的属性。根据新闻标题，在Elements标签页面稍加分析可知：新闻标题及链接均在标签对<ul class="cm_ul_round">及</ul>之间，且其他非新闻标题的内容，均不在这种规定了class的<ul>标签之间



In [6]:

    
for line in contents:
    url_titles = line.find_all('a')
    for url_title in url_titles:
        url = url_title.get('href')
        title = url_title.string
        print(type(url_title), url, title)
    if input()=='b':
        break









    



<class 'bs4.element.Tag'> http://news.163.com/17/1204/20/D4RAQ8GD000189FH.html 关于互联网，习近平给出的20条重要论断
<class 'bs4.element.Tag'> http://news.163.com/17/1204/20/D4RAVK1N000189FH.html 互联网大会蓝皮书发布 中国数字经济规模全球第二
<class 'bs4.element.Tag'> http://news.163.com/17/1204/09/D4Q8CECJ000189FH.html 习近平的这几句话 引发外国政党领导人的强烈共鸣
<class 'bs4.element.Tag'> http://news.cri.cn/special/2017hlwdhfpm.html 互联网精准扶贫
<class 'bs4.element.Tag'> http://www.chinanews.com/m/shipin/spfts/20171202/1277.shtml 大连接时代：创新 智能 变革
<class 'bs4.element.Tag'> http://news.163.com/17/1204/11/D4QD03RS000189FH.html 上海:对标全球最高 开放之风劲吹
<class 'bs4.element.Tag'> http://gov.163.com/special/S1509677294216/ 领航新征程
b

可对ResultSet进行遍历，利用line.find_all('a')取得其中在<a>及</a>标签对(该标签对表示超链接)中间的内容，url_titles的内容仍然是一个ResultSet对象，含有所有的url及title
对url_titles进行遍历，每一个对象均为Tag对象，利用Tag对象的get('href')方法，该方法可以取得标签的属性值，本例中参数为属性href，其值为超链接的URL
利用标签的string属性，取得该超链接的文本



In [7]:

    
url_title_dict = {}
for line in contents:
    url_titles = line.find_all('a')
    for url_title in url_titles:
        url = url_title.get('href')
        title = url_title.string
        if url and url.endswith(r'.html'):
            try:
                url_title_dict[url] = title
            except KeyError:
                continue

这里暂只需要得到首页的标题及对应的新闻页面链接，因此只抽取链接结尾为.html的条目
所有结果存入到词典url_title_dict中

我们还要获取各链接下的新闻文本



In [ ]:

    
title_text_dict = {}
for url, title in url_title_dict.items():
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    html = soup.find('div', attrs = 'post_text')
    if html:
        passages = html.find_all('p')
        text = ''
        for passage in passages:
            if passage.string:
                text += passage.string

        title_text_dict[url] = title, text

通过点击新闻页面，可以发现，所有正文在<div class = 'post_text>...</div>中
通过soup对象的find()方法，可以得到上述标签中的内容，存放在html变量中
内容是由<p>...</p>标签对之间(p即passage)，因此需要利用find_all()方法提取之中所有的段落，放入passages中
遍历passages，其中每一个passage均是Tag对象，利用Tag对象的string方法，得到该标签对应的文本

至此，形成了一个词典，key为新闻的url，value为对应的标题与正文，可将上述程序整合，并将抓取内容保存到文件中



In [1]:

    
import requests
from bs4 import BeautifulSoup
import re

def get_163_home_news(filename):
    try:
        r = requests.get(r'http://www.163.com')
    except:
        print('Can not get the page.\n')
        return False
    

    soup = BeautifulSoup(r.text, 'html.parser')
    contents = soup.find_all('ul', attrs='cm_ul_round')
    
    error_url = []
    f_err = open('error_urls', 'w', encoding = 'utf-8')
    fh = open(filename, 'w', encoding = 'utf-8')
    for line in contents:
        url_titles = line.find_all('a')
        for url_title in url_titles:
            url = url_title.get('href')
            title = url_title.string
            if url and url.endswith(r'.html'):
                try:
                    r = requests.get(url)
                except:
                    print('Error in getting:', url)
                    error_url.append(url)
                    continue
                soup = BeautifulSoup(r.text, 'html.parser')
                html = soup.find('div', attrs = 'post_text')
                if html:
                    passages = html.find_all('p')
                    text = ''
                    for passage in passages:
                        if passage.string:
                            text += passage.string
                    fh.write('{}\n{}\n{}\n'.format(url, title, text))
    
    f_err.write('\n'.join(error_url))
    fh.close()
    f_err.close()
    return True



In [2]:

    
filename = r'news_163_home.txt'
get_163_home_news(filename)









    Out[2]:





True

3.2 抓取有道词典查询词的双语例句

分析：

通过在有道词典(dict.youdao.com)进行单词查询，发现对任意单词xxxxx，给出该单词翻译信息的页面URL为：http://dict.youdao.com/w/xxxxx
由于我要抓取词xxxxx中查询结果的所有双语例句，则需要在页面下部的显示最后一个例句后，点击更多双语例句，这会更新当前页面的URL为：http://dict.youdao.com/example/blng/eng/xxxxx/#keyfrom=dict.main.moreblng 。所有双语例句均在其中，因此，我们只对这个URL进行爬取分析即可
输入一个中文词汇进行查询，选中第一个双语例句，右键点击并选择检查，进入chrome开发者模式，发现所有例句均在<ul class='ol'>...</ul>标签对之间。
每个例句均在<p>...</p>标签对之间



In [ ]:

    
import requests
from bs4 import BeautifulSoup

def get_word_sents(word):
    url = r'http://dict.youdao.com/example/blng/eng/{}/#keyfrom=dict.main.moreblng'.format(word)
    try:
        r = requests.get(url)
    except:
        print('Can not get the page.\n')
        return False
    
    word_sents = []
    soup = BeautifulSoup(r.text, 'html.parser')
    contents = soup.find('ul', attrs = 'ol')
    sents = contents.find_all('p')
    for sent in sents:
        word_sents.append(sent.text.replace('\n','')+'\n')
    
    return word_sents

整个过程与网易新闻抓取的方法类似
word_sents.append(sent.text.replace('\n','')+'\n')，是为了写入文件时候较为整齐



In [ ]:

    
import os
test_word = '爬虫'
sents = get_word_sents(test_word)
with open(test_word+'.txt', 'w', encoding = 'utf-8') as f:
    f.writelines(sents)

4. 动态页面的抓取(以后)

5. Scrapy爬虫框架(以后)