用Python 3开发网络爬虫

By Terrill Yang (Github: https://github.com/yttty)

由你需要这些：Python3.x爬虫学习资料整理 - 知乎专栏整理而来。

本篇来自零基础自学用Python 3开发网络爬虫(二): 用到的数据结构简介以及爬虫Ver1.0 alpha

用Python 3开发网络爬虫 - Chapter 02

上一回, 我们学会了

用伪代码写出爬虫的主要框架;
用Python的 urllib.request 库抓取指定url的页面;
用Python的 urllib.parse 库对普通字符串转符合url的字符串.

这一回, 开始用Python将伪代码中的所有部分实现. 由于文章的标题就是"零基础", 因此会先把用到的两种数据结构队列和集合介绍一下. 而对于正则表达式部分, 会给出我比较喜欢的几个参考资料, 等到以后有时间再补充.

1. Python的队列

在爬虫程序中, 用到了广度优先搜索(BFS)算法. 这个算法用到的数据结构就是队列.

Python的List功能已经足够完成队列的功能, 可以用 append() 来向队尾添加元素, 可以用类似数组的方式来获取队首元素, 可以用 pop(0) 来弹出队首元素. 但是List用来完成队列功能其实是低效率的, 因为List在队首使用 pop(0) 和 append() 都是效率比较低的, Python官方建议使用collection.deque来高效的完成队列任务.

(以下例子引用自官方文档)



In [24]:

    
from collections import deque
queue = deque(["Eric", "John", "Michael"])
queue.append("Terry")           # Terry 入队
queue.append("Graham")          # Graham 入队



In [25]:

    
queue.pop()                     # 队尾元素出队









    Out[25]:





'Graham'



In [26]:

    
queue.popleft()                 # 队首元素出队









    Out[26]:





'Eric'



In [27]:

    
queue                           # 队列中剩下的元素









    Out[27]:





deque(['John', 'Michael', 'Terry'])

2. Python的集合

在爬虫程序中, 为了不重复爬那些已经爬过的网站, 我们需要把爬过的页面的url放进集合中, 在每一次要爬某一个url之前, 先看看集合里面是否已经存在. 如果已经存在, 我们就跳过这个url; 如果不存在, 我们先把url放入集合中, 然后再去爬这个页面.

Python提供了set这种数据结构. set是一种无序的, 不包含重复元素的结构. 一般用来测试是否已经包含了某元素, 或者用来对众多元素们去重. 与数学中的集合论同样, 他支持的运算有交, 并, 差, 对称差.

创建一个set可以用 set() 函数或者花括号 {} . 但是创建一个空集是不能使用一个花括号的, 只能用 set() 函数. 因为一个空的花括号创建的是一个字典数据结构. 以下同样是Python官网提供的示例.



In [28]:

    
basket = {'apple', 'orange', 'apple', 'pear', 'orange', 'banana'}



In [29]:

    
print(basket)                      # 这里演示的是去重功能









    



{'banana', 'pear', 'apple', 'orange'}



In [30]:

    
print('orange in basket? ', 'orange' in basket)                # 快速判断元素是否在集合内
print('crabgrass in basket? ', 'crabgrass' in basket)









    



orange in basket?  True
crabgrass in basket?  False



In [31]:

    
# 下面展示两个集合间的运算.
a = set('abracadabra')
b = set('alacazam')
print(a)
print(b)









    



{'d', 'a', 'b', 'r', 'c'}
{'a', 'z', 'c', 'l', 'm'}



In [32]:

    
print(a & b)  # 交集









    



{'a', 'c'}



In [33]:

    
print(a | b)  # 并集









    



{'b', 'r', 'm', 'c', 'l', 'a', 'd', 'z'}



In [34]:

    
print(a - b)  # 差集









    



{'d', 'b', 'r'}



In [35]:

    
print(a ^ b)  # 对称差









    



{'b', 'r', 'm', 'l', 'd', 'z'}

在我们的爬虫中, 只是用到其中的快速判断元素是否在集合内的功能, 以及集合的并运算.

3. Python的正则表达式

在爬虫程序中, 爬回来的数据是一个字符串, 字符串的内容是页面的html代码. 我们要从字符串中, 提取出页面提到过的所有url. 这就要求爬虫程序要有简单的字符串处理能力, 而正则表达式可以很轻松的完成这一任务.

参考资料

虽然正则表达式功能异常强大, 很多实际上用的规则也非常巧妙, 真正熟练正则表达式需要比较长的实践锻炼. 不过我们只需要掌握如何使用正则表达式在一个字符串中, 把所有的url都找出来, 就可以了. 如果实在想要跳过这一部分, 可以在网上找到很多现成的匹配url的表达式, 拿来用即可.

正则表达式简介

Courtesy of AstralWind - cnblogs

正则表达式并不是Python的一部分。正则表达式是用于处理字符串的强大工具，拥有自己独特的语法以及一个独立的处理引擎，效率上可能不如str自带的方法，但功能十分强大。得益于这一点，在提供了正则表达式的语言里，正则表达式的语法都是一样的，区别只在于不同的编程语言实现支持的语法数量不同；但不用担心，不被支持的语法通常是不常用的部分。如果已经在其他语言里使用过正则表达式，只需要简单看一看就可以上手了。

正则表达式的大致匹配过程是：依次拿出表达式和文本中的字符比较，如果每一个字符都能匹配，则匹配成功；一旦有匹配不成功的字符则匹配失败。如果表达式中有量词或边界，这个过程会稍微有一些不同，但也是很好理解的，看下图中的示例以及自己多使用几次就能明白。

正则表达式通常用于在文本中查找匹配的字符串。Python里数量词默认是贪婪的（在少数语言里也可能是默认非贪婪），总是尝试匹配尽可能多的字符；非贪婪的则相反，总是尝试匹配尽可能少的字符。例如：正则表达式"ab*"如果用于查找"abbbc"，将找到"abbb"。而如果使用非贪婪的数量词"ab*?"，将找到"a"。

下图列出了Python支持的正则表达式元字符和语法：

开始使用re

Python通过re模块提供对正则表达式的支持。使用re的一般步骤是先将正则表达式的字符串形式编译为Pattern实例，然后使用Pattern实例处理文本并获得匹配结果（一个Match实例），最后使用Match实例获得信息，进行其他的操作。



In [36]:

    
import re
 
# 将正则表达式编译成Pattern对象
pattern = re.compile(r'hello')
 
# 使用Pattern匹配文本，获得匹配结果，无法匹配时将返回None
match = pattern.match('hello world!')
 
if match:
    # 使用Match获得分组信息
    print(match.group())









    



hello

re提供了众多模块方法用于完成正则表达式的功能。这些方法可以使用Pattern实例的相应方法替代，唯一的好处是少写一行re.compile()代码，但同时也无法复用编译后的Pattern对象。这些方法将在Pattern类的实例方法部分一起介绍。如上面这个例子可以简写为：



In [37]:

    
m = re.match(r'hello', 'hello world!')
print(m.group())









    



hello

Compile

re.compile(strPattern[, flag]):这个方法是Pattern类的工厂方法，用于将字符串形式的正则表达式编译为Pattern对象。第二个参数flag是匹配模式，取值可以使用按位或运算符'|'表示同时生效，比如re.I | re.M。另外，你也可以在regex字符串中指定模式，比如re.compile('pattern', re.I | re.M)与re.compile('(?im)pattern')是等价的。

可选值有：

re.I(re.IGNORECASE): 忽略大小写（括号内是完整写法，下同）
re.M(MULTILINE): 多行模式，改变'^'和'$'的行为（参见上图）
re.S(DOTALL): 点任意匹配模式，改变'.'的行为
re.L(LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定
re.U(UNICODE): 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
re.X(VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。以下两个正则表达式是等价的：



In [38]:

    
a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")

re模块还提供了一个方法escape(string)，用于将string中的正则表达式元字符如*/+/?等之前加上转义符再返回，在需要大量匹配元字符时有那么一点用。

Match

Match对象是一次匹配的结果，包含了很多关于此次匹配的信息，可以使用Match提供的可读属性或方法来获取这些信息。

属性
- string: 匹配时使用的文本。
- re: 匹配时使用的Pattern对象。
- pos: 文本中正则表达式开始搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
- endpos: 文本中正则表达式结束搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
- lastindex: 最后一个被捕获的分组在文本中的索引。如果没有被捕获的分组，将为None。
- lastgroup: 最后一个被捕获的分组的别名。如果这个分组没有别名或者没有被捕获的分组，将为None。
方法
- group([group1, …]): 获得一个或多个分组截获的字符串；指定多个参数时将以元组形式返回。group1可以使用编号也可以使用别名；编号0代表整个匹配的子串；不填写参数时，返回group(0)；没有截获字符串的组返回None；截获了多次的组返回最后一次截获的子串。
- groups([default]): 以元组形式返回全部分组截获的字符串。相当于调用group(1,2,…last)。default表示没有截获字符串的组以这个值替代，默认为None。
- groupdict([default]): 返回以有别名的组的别名为键、以该组截获的子串为值的字典，没有别名的组不包含在内。default含义同上。
- start([group]): 返回指定的组截获的子串在string中的起始索引（子串第一个字符的索引）。group默认值为0。
- end([group]): 返回指定的组截获的子串在string中的结束索引（子串最后一个字符的索引+1）。group默认值为0。
- span([group]): 返回(start(group), end(group))。
- expand(template):将匹配到的分组代入template中然后返回。template中可以使用\id或\g<id>、\g<name>引用分组，但不能使用编号0。\id与\g<id>是等价的；但\10将被认为是第10个分组，如果你想表达\1之后是字符'0'，只能使用\g<1>0。



In [39]:

    
m = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')
 
print("m.string:", m.string)
print("m.re:", m.re)
print("m.pos:", m.pos)
print("m.endpos:", m.endpos)
print("m.lastindex:", m.lastindex)
print("m.lastgroup:", m.lastgroup)
 
print("m.group(1,2):", m.group(1, 2))
print("m.groups():", m.groups())
print("m.groupdict():", m.groupdict())
print("m.start(2):", m.start(2))
print("m.end(2):", m.end(2))
print("m.span(2):", m.span(2))
print(r"m.expand(r'\2 \1\3'):", m.expand(r'\2 \1\3'))









    



m.string: hello world!
m.re: re.compile('(\\w+) (\\w+)(?P<sign>.*)')
m.pos: 0
m.endpos: 12
m.lastindex: 3
m.lastgroup: sign
m.group(1,2): ('hello', 'world')
m.groups(): ('hello', 'world', '!')
m.groupdict(): {'sign': '!'}
m.start(2): 6
m.end(2): 11
m.span(2): (6, 11)
m.expand(r'\2 \1\3'): world hello!

Pattern

Pattern对象是一个编译好的正则表达式，通过Pattern提供的一系列方法可以对文本进行匹配查找。

Pattern不能直接实例化，必须使用re.compile()进行构造。

Pattern提供了几个可读属性用于获取表达式的相关信息：

pattern: 编译时用的表达式字符串。
flags: 编译时用的匹配模式。数字形式。
groups: 表达式中分组的数量。
groupindex: 以表达式中有别名的组的别名为键、以该组对应的编号为值的字典，没有别名的组不包含在内。



In [40]:

    
p = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL)
 
print("p.pattern:", p.pattern)
print("p.flags:", p.flags)
print("p.groups:", p.groups)
print("p.groupindex:", p.groupindex)









    



p.pattern: (\w+) (\w+)(?P<sign>.*)
p.flags: 48
p.groups: 3
p.groupindex: {'sign': 3}

re模块方法

match(string[, pos[, endpos]]) | re.match(pattern, string[, flags])

这个方法将从string的pos下标处起尝试匹配pattern；如果pattern结束时仍可匹配，则返回一个Match对象；如果匹配过程中pattern无法匹配，或者匹配未结束就已到达endpos，则返回None。

pos和endpos的默认值分别为0和len(string)；re.match()无法指定这两个参数，参数flags用于编译pattern时指定匹配模式。

注意：这个方法并不是完全匹配。当pattern结束时若string还有剩余字符，仍然视为成功。想要完全匹配，可以在表达式末尾加上边界匹配符'$'。



In [41]:

    
pattern = re.compile(r'hello')
# 使用Pattern匹配文本，获得匹配结果，无法匹配时将返回None
match = pattern.match('hello world!')
if match:
    print(match.group())









    



hello

search(string[, pos[, endpos]]) | re.search(pattern, string[, flags])

这个方法用于查找字符串中可以匹配成功的子串。从string的pos下标处起尝试匹配pattern，如果pattern结束时仍可匹配，则返回一个Match对象；若无法匹配，则将pos加1后重新尝试匹配；直到pos=endpos时仍无法匹配则返回None。

pos和endpos的默认值分别为0和len(string))；re.search()无法指定这两个参数，参数flags用于编译pattern时指定匹配模式。



In [42]:

    
p = re.compile(r'world') 
# 使用search()查找匹配的子串，不存在能匹配的子串时将返回None 
# 这个例子中使用match()无法成功匹配 
match1 = p.search('hello world!') 
match2 = p.match('hello world!') 
if match1: 
    print('pattern.search result: ', match1.group())
if match2: 
    print('pattern.match result: ', match2.group())









    



pattern.search result:  world

split(string[, maxsplit]) | re.split(pattern, string[, maxsplit])

按照能够匹配的子串将string分割后返回列表。maxsplit用于指定最大分割次数，不指定将全部分割。



In [43]:

    
p = re.compile(r'\d+')
print(p.split('one1two2three3four4'))









    



['one', 'two', 'three', 'four', '']

findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags])

搜索string，以列表形式返回全部能匹配的子串。



In [44]:

    
p = re.compile(r'\d+')
print(p.findall('one1two2three3four4'))









    



['1', '2', '3', '4']

finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags])

搜索string，返回一个顺序访问每一个匹配结果（Match对象）的迭代器。



In [45]:

    
p = re.compile(r'\d+')
for m in p.finditer('one1two2three3four4'):
    print(m.group())

sub(repl, string[, count]) | re.sub(pattern, repl, string[, count])

使用repl替换string中每一个匹配的子串后返回替换后的字符串。

当repl是一个字符串时，可以使用\id或\g<id>、\g<name>引用分组，但不能使用编号0。

当repl是一个方法时，这个方法应当只接受一个参数（Match对象），并返回一个字符串用于替换（返回的字符串中不能再引用分组）。

count用于指定最多替换次数，不指定时全部替换。



In [46]:

    
p = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'
 
print(p.sub(r'\2 \1', s))
 
def func(m):
    return m.group(1).title() + ' ' + m.group(2).title()
 
print(p.sub(func, s))









    



say i, world hello!
I Say, Hello World!

subn(repl, string[, count]) |re.sub(pattern, repl, string[, count])

返回 (sub(repl, string[, count]), 替换次数)。



In [47]:

    
p = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'
 
print(p.subn(r'\2 \1', s))
 
def func(m):
    return m.group(1).title() + ' ' + m.group(2).title()
 
print(p.subn(func, s))









    



('say i, world hello!', 2)
('I Say, Hello World!', 2)

4. Python网络爬虫Ver 1.0 alpha

有了以上铺垫, 终于可以开始写真正的爬虫了. 我选择的入口地址是Fenng叔的Startup News, 我想Fenng叔刚刚拿到7000万美金融资, 不会介意大家的爬虫去光临他家的小站吧. 这个爬虫虽然可以勉强运行起来, 但是由于缺乏异常处理, 只能爬些静态页面, 也不会分辨什么是静态什么是动态, 碰到什么情况应该跳过, 所以工作一会儿就要败下阵来.



In [48]:

    
import re
import urllib.request
import urllib
 
from collections import deque
 
queue = deque()
visited = set()
 
url = 'http://news.dbanotes.net'  # 入口页面, 可以换成别的
 
queue.append(url)
cnt = 0
 
while queue:
  url = queue.popleft()  # 队首元素出队
  visited |= {url}  # 标记为已访问
 
  print('已经抓取: ' + str(cnt) + '   正在抓取 <---  ' + url)
  cnt += 1
  urlop = urllib.request.urlopen(url)
  if 'html' not in urlop.getheader('Content-Type'):
    continue
 
  # 避免程序异常中止, 用try..catch处理异常
  try:
    data = urlop.read().decode('utf-8')
  except:
    continue
 
  # 正则表达式提取页面中所有队列, 并判断是否已经访问过, 然后加入待爬队列
  linkre = re.compile('href="(.+?)"')
  for x in linkre.findall(data):
    if 'http' in x and x not in visited:
      queue.append(x)
      print('加入队列 --->  ' + x)









    



已经抓取: 0   正在抓取 <---  http://news.dbanotes.net
加入队列 --->  http://news.dbanotes.net/news.css
加入队列 --->  http://dbanotes.net/favicon.ico
加入队列 --->  http://news.dbanotes.net/submit
加入队列 --->  https://blog.coding.net/blog/Introducing-Coding-Enterprise
加入队列 --->  http://tonybai.com/2017/01/11/understanding-linux-network-namespace-for-docker-network/
加入队列 --->  http://www.phpxs.com/post/5569/
加入队列 --->  http://weibo.com/ttarticle/p/show?id=2309404061974152529226
加入队列 --->  https://cp.ivpser.com/buylinode
加入队列 --->  http://cn.oncedoc.com/page/view/helper/ixfumwyocn01ai67
加入队列 --->  http://tonybai.com/2016/12/30/install-kubernetes-on-ubuntu-with-kubeadm/
加入队列 --->  http://tonybai.com/2016/12/27/when-docker-meets-systemd/
加入队列 --->  https://blog.coding.net/blog/coding-review-2016
加入队列 --->  https://laravel-china.org/topics/3432
加入队列 --->  http://cn.oncedoc.com/blog/view/ix4clpo5cn01ao0d
加入队列 --->  https://github.com/baidu/tera/
加入队列 --->  http://28930717.blog.hexun.com/109481607_d.html
加入队列 --->  https://www.evget.com/article/2016/12/22/25339.html
加入队列 --->  https://blog.coding.net/blog/spring-mvc-cors
加入队列 --->  https://zhuanlan.zhihu.com/p/23646569
加入队列 --->  http://gold.xitu.io/entry/58200ec367f3560058a6f8fc/
加入队列 --->  http://tonybai.com/2016/11/25/the-security-settings-for-kubernetes-cluster/
加入队列 --->  http://gold.xitu.io/entry/582492025bbb5000590ef04d/
加入队列 --->  http://gold.xitu.io/entry/58217b84570c350060bc40f8/
加入队列 --->  https://www.fdzh.org/blog/2016/11/22/fedora-25/
加入队列 --->  http://gold.xitu.io/entry/5826ef85570c3500586b241d/
加入队列 --->  http://www.phpxs.com/post/5537/
加入队列 --->  http://insights.thoughtworkers.org/frontend-future-radar/
加入队列 --->  https://lanmaowz.com/open-dht-spider/
加入队列 --->  https://blog.coding.net/blog/forget-crowdsourcing
加入队列 --->  https://github.com/vim/vim/blob/master/runtime/doc/version8.txt
加入队列 --->  http://gold.xitu.io/entry/57fc9ea40bd1d00058d170f9/
加入队列 --->  http://tonybai.com/2016/11/16/how-to-pull-images-from-private-registry-on-kubernetes-cluster/
加入队列 --->  http://gold.xitu.io/entry/5817117967f356005868b8a8/
加入队列 --->  http://gold.xitu.io/entry/58294b222f301e00585ae000/
加入队列 --->  https://github.com/baidu/bfs
加入队列 --->  http://dbanotes.net
加入队列 --->  http://news.dbanotes.net/rss
加入队列 --->  https://jobsdigg.com/
加入队列 --->  http://news.ycombinator.com/
加入队列 --->  https://github.com/nex3/arc/
加入队列 --->  http://arclanguage.org/forum
加入队列 --->  https://itunes.apple.com/us/app/id611072155
加入队列 --->  http://halzhang.github.com/StartupNews/
已经抓取: 1   正在抓取 <---  http://news.dbanotes.net/news.css
已经抓取: 2   正在抓取 <---  http://dbanotes.net/favicon.ico
已经抓取: 3   正在抓取 <---  http://news.dbanotes.net/submit
已经抓取: 4   正在抓取 <---  https://blog.coding.net/blog/Introducing-Coding-Enterprise
加入队列 --->  https://coding.net
加入队列 --->  https://e.coding.net
加入队列 --->  https://e.coding.net
加入队列 --->  https://dn-coding-net-production-pp.qbox.me/6ddf455a-07d4-4d6c-b124-7a242bac62c6.png
加入队列 --->  https://dn-coding-net-production-pp.qbox.me/7c10dede-183f-499f-a8b3-1505696f584c.png
加入队列 --->  https://dn-coding-net-production-pp.qbox.me/a6136ea9-7d81-490c-98ee-53850cc0da14.gif
加入队列 --->  https://e.coding.net/
加入队列 --->  https://e.coding.net
加入队列 --->  https://coding.net/u/coding
加入队列 --->  https://coding.net/u/Diking
加入队列 --->  https://coding.net/u/Diking
加入队列 --->  https://coding.net/u/Diking
加入队列 --->  https://coding.net/u/eazy
加入队列 --->  https://coding.net/u/eazy
加入队列 --->  https://coding.net/u/eazy
加入队列 --->  https://coding.net/u/hs_coding
加入队列 --->  https://coding.net/u/hs_coding
加入队列 --->  https://coding.net/u/hs_coding
加入队列 --->  https://coding.net/u/h_s
加入队列 --->  https://coding.net/u/h_s
加入队列 --->  https://coding.net/u/h_s
加入队列 --->  https://coding.net/login?return_url=https://coding.net/blog/Introducing-Coding-Enterprise
加入队列 --->  https://coding.net/git
加入队列 --->  https://coding.net/pm
加入队列 --->  https://coding.net/webide
加入队列 --->  https://coding.net/app
加入队列 --->  https://blog.coding.net/update/blogs
加入队列 --->  https://coding.net/about
加入队列 --->  https://coding.net/event
加入队列 --->  https://coding.net/shop
加入队列 --->  https://coding.net/jobs
加入队列 --->  https://coding.net/about#partners
加入队列 --->  https://coding.net/help
加入队列 --->  https://status.coding.net
加入队列 --->  https://coding.net/feedback
加入队列 --->  https://open.coding.net
加入队列 --->  https://blog.coding.net/
加入队列 --->  http://weibo.com/n/coding
加入队列 --->  https://zhuanlan.zhihu.com/coding-net
加入队列 --->  https://coding.net/privacy
加入队列 --->  https://coding.net/terms
加入队列 --->  https://coding.net/security
加入队列 --->  http://www.miitbeian.gov.cn/
已经抓取: 5   正在抓取 <---  http://tonybai.com/2017/01/11/understanding-linux-network-namespace-for-docker-network/
加入队列 --->  http://tonybai.com/favicon.ico
加入队列 --->  http://tonybai.com/feed/
加入队列 --->  http://tonybai.com/comments/feed/
加入队列 --->  http://tonybai.com/2017/01/11/understanding-linux-network-namespace-for-docker-network/feed/
加入队列 --->  http://tonybai.com/xmlrpc.php?rsd
加入队列 --->  http://tonybai.com/wp-includes/wlwmanifest.xml
加入队列 --->  http://tonybai.com
加入队列 --->  http://tonybai.com/about/
加入队列 --->  http://tonybai.com/articles/
加入队列 --->  http://tonybai.com/2017/01/11/understanding-linux-network-namespace-for-docker-network/#respond
加入队列 --->  http://tonybai.com/2017/01/03/2016-summary/
加入队列 --->  http://tonybai.com/tag/kubernetes
加入队列 --->  http://tonybai.com/2016/01/15/understanding-container-networking-on-single-host/
加入队列 --->  https://www.docker.com/
加入队列 --->  https://github.com/containernetworking/cni
加入队列 --->  https://github.com/docker/libnetwork/blob/master/docs/design.md
加入队列 --->  http://tonybai.com/2016/01/15/understanding-container-networking-on-single-host/
加入队列 --->  http://tonybai.com/2016/01/18/understanding-binding-docker-container-ports-to-host/
加入队列 --->  https://github.com/docker/libnetwork
加入队列 --->  https://en.wikipedia.org/wiki/Linux_namespaces#Network_.28net.29
加入队列 --->  https://openvz.org/Virtual_Ethernet_device
加入队列 --->  https://wiki.linuxfoundation.org/networking/bridge
加入队列 --->  https://en.wikipedia.org/wiki/Virtual_Extensible_LAN
加入队列 --->  https://wiki.linuxfoundation.org/networking/bridge
加入队列 --->  https://wiki.linuxfoundation.org/networking/iproute2
加入队列 --->  http://www.ibm.com/developerworks/cn/linux/1310_xiawc_networkdevice/
加入队列 --->  https://wiki.linuxfoundation.org/networking/iproute2
加入队列 --->  https://www.netfilter.org/
加入队列 --->  http://tonybai.com/2016/01/18/understanding-binding-docker-container-ports-to-host/
加入队列 --->  http://tonybai.com/2016/11/22/deploy-nginx-service-for-the-services-in-kubernetes-cluster/
加入队列 --->  http://nginx.org/
加入队列 --->  http://nginx.com/
加入队列 --->  http://tonybai.com/2016/12/27/when-docker-meets-systemd
加入队列 --->  https://book.douban.com/subject/26929989/
加入队列 --->  https://book.douban.com/subject/26631435/
加入队列 --->  http://tonybai.com
加入队列 --->  http://www.jiathis.com/share/
加入队列 --->  http://feed.tonybai.com
加入队列 --->  http://weibo.com/bigwhite20xx
加入队列 --->  http://weibo.com/bigwhite20xx
加入队列 --->  https://www.digitalocean.com/?refcode=bff6eed92687
加入队列 --->  https://www.linode.com/?r=ec42f83592f2ddfc8f487c0428a9b74fa1b2984b
加入队列 --->  http://cn.linkedin.com/in/bigwhite
加入队列 --->  http://iwobi.net
加入队列 --->  http://iwobi.net/download/FlickWorldCup.apk
加入队列 --->  http://iwobi.net/download/SlalomKing.apk
加入队列 --->  http://tonybai.com/2017/01/05/leave-hand-made-homework-to-kids/
加入队列 --->  http://tonybai.com/2017/01/03/2016-summary/
加入队列 --->  http://tonybai.com/2016/12/30/install-kubernetes-on-ubuntu-with-kubeadm/
加入队列 --->  http://tonybai.com/2016/12/27/when-docker-meets-systemd/
加入队列 --->  http://tonybai.com/2016/12/23/write-go-code-in-vscode/
加入队列 --->  http://tonybai.com/2016/12/21/how-to-use-timer-reset-in-golang-correctly/
加入队列 --->  http://tonybai.com/2016/12/18/build-a-blog-website-for-my-daughter/
加入队列 --->  http://tonybai.com/2016/12/06/an-intro-to-wukong-fulltext-search-engine/
加入队列 --->  http://tonybai.com/2016/11/25/the-security-settings-for-kubernetes-cluster/
加入队列 --->  http://tonybai.com/category/images-collection/
加入队列 --->  http://tonybai.com/category/media-square/
加入队列 --->  http://tonybai.com/category/thoughts-center/
加入队列 --->  http://tonybai.com/category/technical-notes/
加入队列 --->  http://tonybai.com/category/edu-notes/
加入队列 --->  http://tonybai.com/category/groceries-store/
加入队列 --->  http://tonybai.com/category/living-notes/
加入队列 --->  http://tonybai.com/category/career-notes/
加入队列 --->  http://tonybai.com/category/reading-bar/
加入队列 --->  http://tonybai.com/category/sports-fan/
加入队列 --->  http://tonybai.com/category/tourist-show/
加入队列 --->  http://daughter.tonybai.com/
加入队列 --->  http://douban.com/people/tony_bai
加入队列 --->  http://www.flickr.com/photos/bigwhite/
加入队列 --->  https://github.com/bigwhite
加入队列 --->  http://code.google.com/p/bigwhite-code/
加入队列 --->  http://picasaweb.google.com/bigwhite.cn
加入队列 --->  http://www.slideshare.net/bigwhite20xx
加入队列 --->  http://twitter.com/tony_bai
加入队列 --->  http://weibo.com/bigwhite20xx
加入队列 --->  http://www.hoterran.info/
加入队列 --->  http://leomessi.com
加入队列 --->  http://puras.cn
加入队列 --->  http://dreamhead.blogbus.com
加入队列 --->  http://84tt.com/blog
加入队列 --->  http://code.google.com/p/buildc
加入队列 --->  http://code.google.com/p/cbehave
加入队列 --->  http://code.google.com/p/lcut
加入队列 --->  http://code.google.com/p/recommended-c-style-and-coding-standards-cn/
加入队列 --->  http://code.google.com/p/programming-in-haskell-cn/
加入队列 --->  http://www4.clustrmaps.com/user/8d910698e
加入队列 --->  http://feed.feedsky.com/bigwhite
加入队列 --->  http://statcounter.com/wordpress.org/
加入队列 --->  http://statcounter.com/p7675050/?guest=1
加入队列 --->  http://www.prchecker.info/
加入队列 --->  http://tonybai.com
加入队列 --->  http://wordpress.org
加入队列 --->  http://pagecho.com
已经抓取: 6   正在抓取 <---  http://www.phpxs.com/post/5569/






    



---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-48-89680f898c25> in <module>()
     19   print('已经抓取: ' + str(cnt) + '   正在抓取 <---  ' + url)
     20   cnt += 1
---> 21   urlop = urllib.request.urlopen(url)
     22   if 'html' not in urlop.getheader('Content-Type'):
     23     continue

/usr/lib/python3.5/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    161     else:
    162         opener = _opener
--> 163     return opener.open(url, data, timeout)
    164 
    165 def install_opener(opener):

/usr/lib/python3.5/urllib/request.py in open(self, fullurl, data, timeout)
    470         for processor in self.process_response.get(protocol, []):
    471             meth = getattr(processor, meth_name)
--> 472             response = meth(req, response)
    473 
    474         return response

/usr/lib/python3.5/urllib/request.py in http_response(self, request, response)
    580         if not (200 <= code < 300):
    581             response = self.parent.error(
--> 582                 'http', request, response, code, msg, hdrs)
    583 
    584         return response

/usr/lib/python3.5/urllib/request.py in error(self, proto, *args)
    508         if http_err:
    509             args = (dict, 'default', 'http_error_default') + orig_args
--> 510             return self._call_chain(*args)
    511 
    512 # XXX probably also want an abstract factory that knows when it makes

/usr/lib/python3.5/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    442         for handler in handlers:
    443             func = getattr(handler, meth_name)
--> 444             result = func(*args)
    445             if result is not None:
    446                 return result

/usr/lib/python3.5/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    588 class HTTPDefaultErrorHandler(BaseHandler):
    589     def http_error_default(self, req, fp, code, msg, hdrs):
--> 590         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    591 
    592 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden