Python 标准库



In [1]:

    
%matplotlib inline

import matplotlib.pyplot as plt

plt.style.use('ggplot')









    



/home/linusp/Projects/panic-notebook/venv/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

collections



In [2]:

    
import collections

collections 中实现了一些高性能、易用的容器类型

defaultdict 继承了 dict 类型，当 key 不存在时，可以用指定的工厂方法来产生默认值。使用 dict 时，如果 key 不存在，是会抛出异常的



In [3]:

    
d = {'id': 1, 'content': 'hello world'}

d['author']









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-49a67eb1662d> in <module>()
      1 d = {'id': 1, 'content': 'hello world'}
      2 
----> 3 d['author']

KeyError: 'author'

如果使用 defaultdict ，就不会有这个问题



In [4]:

    
dd = collections.defaultdict(int)
dd.update(d)
dd['author']









    Out[4]:





0

还可以用来定义树状结构



In [5]:

    
def Tree():
    return collections.defaultdict(Tree)


def tree_to_dict(tree):
    """将自定义的树状结构转换为 dict"""
    return {k: tree_to_dict(tree[k]) for k in tree}

tree = Tree()
tree['a']['b']['c']['d']
tree['a']['m']['n']['z']

tree_to_dict(tree)









    Out[5]:





{'a': {'b': {'c': {'d': {}}}, 'm': {'n': {'z': {}}}}}

namedtuple 用来创建带有命名字段的 tuple ，使代码更易读。



In [6]:

    
Point = collections.namedtuple('Point', ['x', 'y'])

Point(3, y=4)









    Out[6]:





Point(x=3, y=4)

Counter 顾名思义可用做计数器，不同的 Counter 类型值之间可以进行 +/- 运算，并且内部实现了 Top-K 功能，用来做简单的统计非常的方便。



In [7]:

    
cats = ['cat'] * 10
dogs = ['dog'] * 20
birds = ['bird'] * 5
dolphins = ['dolphin'] * 9

animal_counter = collections.Counter()
for animal in cats + dogs + birds + dolphins:
    animal_counter[animal] += 1

mammals_counter = collections.Counter()
for animal in cats + dogs + dolphins:
    mammals_counter[animal] += 1

print 'Top 2:', animal_counter.most_common(2)
print 'Animal not mammals:', animal_counter - mammals_counter









    



Top 2: [('dog', 20), ('cat', 10)]
Animal not mammals: Counter({'bird': 5})

deque 实现了一个双向队列



In [8]:

    
d = collections.deque([1, 2, 3, 4, 5])
print d

d.pop()
print 'After pop:', d
d.popleft()
print 'After popleft:', d

d.append('a')
print 'After append:', d
d.appendleft('A')
print 'After appendleft:', d

d.extend(['x', 'y', 'z'])
print 'After extend:', d
d.extendleft(['X', 'Y', 'Z'])
print 'After extendleft:', d









    



deque([1, 2, 3, 4, 5])
After pop: deque([1, 2, 3, 4])
After popleft: deque([2, 3, 4])
After append: deque([2, 3, 4, 'a'])
After appendleft: deque(['A', 2, 3, 4, 'a'])
After extend: deque(['A', 2, 3, 4, 'a', 'x', 'y', 'z'])
After extendleft: deque(['Z', 'Y', 'X', 'A', 2, 3, 4, 'a', 'x', 'y', 'z'])

OrderedDict 同 defaultdict 一样继承自 dict ，可以使用 dict 的所有操作，在此基础上，OrderedDict 可以 记住每个 key 被插入的顺序



In [9]:

    
d1 = dict([
    ('first', 1),
    ('second', 2),
    ('third', 3),
    ('fourth', 4)
])
print d1.keys()

d2 = collections.OrderedDict([
    ('first', 1),
    ('second', 2),
    ('third', 3),
    ('fourth', 4)
])
print d2.keys()









    



['second', 'fourth', 'third', 'first']
['first', 'second', 'third', 'fourth']

functools



In [10]:

    
import functools

functools 里包含一些高阶函数，我们用得比较多的大概是 partial, wraps 这两个。

partial 以一个函数对象作为参数，并为该函数的某些参数设置默认值，来得到一个新的函数。比如我们有一个函数叫 search_job ，如下:



In [11]:

    
def search_job(query, redis_config, sql_config, index_schema):
    """docstring here"""
    # blablabla
    return query

其中的 redis_config、sql_config 和 index_schema 在项目启动的时候通过读取配置文件已经确定了，假设它们的值分别如下:



In [12]:

    
REDIS_CONFIG = {'host': 'localhost', 'port': 6379, 'db': 0}
SQL_CONFIG = {'host': 'localhost', 'port': 4000, 'name': 'test'}
INDEX_SCHEMA = {'title': str, 'content': str}

那么可以用 partial 将这些配置设为 search_job 函数中相应参数的默认值，并产生一个新的函数



In [13]:

    
search_job_with_default_args = functools.partial(
    search_job,
    redis_config=REDIS_CONFIG,
    sql_config=SQL_CONFIG,
    index_schema=INDEX_SCHEMA,
)

search_job_with_default_args('hello world')









    Out[13]:





'hello world'

wraps 用来使应用了装饰器(decorator)的函数保持其属性(如 __name__ 和 __doc__)。比如上面定义的 search_job 方法，其属性为:



In [14]:

    
search_job.__name__, search_job.__doc__









    Out[14]:





('search_job', 'docstring here')

为其应用一个装饰器后



In [15]:

    
def func_wrapper(func):
    def wrap_it(*args, **kwargs):
        """docstring of wrap_it"""
        return func(*args, **kwargs)

    return wrap_it


@func_wrapper
def search_job_wrapped(query, redis_config, sql_config, index_schema):
    """docstring here"""
    # blablabla
    return query


search_job_wrapped.__name__, search_job_wrapped.__doc__









    Out[15]:





('wrap_it', 'docstring of wrap_it')

用 wraps 可以使被装饰方法的属性被正确输出:



In [16]:

    
def func_wrapper(func):
    @functools.wraps(func)
    def wrap_it(*args, **kwargs):
        """docstring of wrap_it"""
        return func(*args, **kwargs)

    return wrap_it


@func_wrapper
def search_job_wrapped(query, redis_config, sql_config, index_schema):
    """docstring here"""
    # blablabla
    return query


search_job.__name__, search_job.__doc__









    Out[16]:





('search_job', 'docstring here')

itertools



In [17]:

    
import itertools

itertools 中实现了很多实用的迭代器(iterator)，以及一些用来处理、操作可迭代(iterable)对象的方法。

chain 可以将多个可迭代对象连接起来得到一个新的迭代器



In [18]:

    
print list(itertools.chain('1234', 'abc'))
print list(itertools.chain([1, 2, 3, 4], ['a', 'b', 'c']))
print list(itertools.chain([1, 2, 3, 4], 'abc'))









    



['1', '2', '3', '4', 'a', 'b', 'c']
[1, 2, 3, 4, 'a', 'b', 'c']
[1, 2, 3, 4, 'a', 'b', 'c']

chain 还有一个类方法 from_iterable ，和 chain 的构造方法不一样，它要求参数只有一个，但这个参数是一个可迭代对象，其中每个元素又是一个可迭代对象。



In [19]:

    
print list(itertools.chain.from_iterable(['1234', 'abc']))
print list(itertools.chain.from_iterable([[1, 2, 3, 4], ['a', 'b', 'c']]))
print list(itertools.chain.from_iterable([[1, 2, 3, 4], 'abc']))









    



['1', '2', '3', '4', 'a', 'b', 'c']
[1, 2, 3, 4, 'a', 'b', 'c']
[1, 2, 3, 4, 'a', 'b', 'c']

combinations 对给定的 长度有限的可迭代对象 生成指定长度的所有子序列(也就是数学里的组合)。



In [20]:

    
print list(itertools.combinations('abc', 2))
print list(itertools.combinations(xrange(4), 2))









    



[('a', 'b'), ('a', 'c'), ('b', 'c')]
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

对应的， permutations 生成指定长度的所有可能的排列。



In [21]:

    
print list(itertools.permutations('abc', 2))
print list(itertools.permutations(xrange(4), 2))









    



[('a', 'b'), ('a', 'c'), ('b', 'a'), ('b', 'c'), ('c', 'a'), ('c', 'b')]
[(0, 1), (0, 2), (0, 3), (1, 0), (1, 2), (1, 3), (2, 0), (2, 1), (2, 3), (3, 0), (3, 1), (3, 2)]

compress 可以用给定的一个 mask 对象或说 selector ，从另一个给定的可迭代对象中选取对应的元素，然后返回一个新的可迭代对象。



In [22]:

    
print list(itertools.compress('ABCDF', [1, 0, 1, 0, 1]))
print list(itertools.compress('ABCDF', [True, False, True, False, True]))









    



['A', 'C', 'F']
['A', 'C', 'F']

dropwhile 将一个可迭代对象中前几个满足条件的、连续的值删除，并返回一个新的可迭代对象



In [23]:

    
x = '0123hello 123 world'
print ''.join(itertools.dropwhile(lambda x: x.isdigit(), x))









    



hello 123 world



In [24]:

    
def is_vowel(ch):
    return ch in set('aeiou')

def is_consonant(ch):
    return not is_vowel(ch)

def pig_latin(word):
    if is_vowel(word[0]):
        return word + 'yay'
    else:
        remain = ''.join(itertools.dropwhile(is_consonant, word))
        removed = word[:len(word)-len(remain)]
        return remain + removed + 'ay'

print pig_latin('hello')
print pig_latin('ok')









    



ellohay
okyay

与 dropwhile 对应的，还有一个叫做 takewhile 的方法，它返回的是前几个满足条件的、连续的元素。上面的 pig_latin可以用这个方法进行改写:



In [25]:

    
def another_pig_latin(word):
    if is_vowel(word[0]):
        return word + 'yay'
    else:
        removed = ''.join(itertools.takewhile(is_consonant, word))
        remain = word[len(word)-len(removed):]
        return remain + removed + 'ay'

print pig_latin('hello')
print pig_latin('ok')









    



ellohay
okyay

starmap 的功能和内建方法 map 类似，但是接受有多个参数的方法



In [26]:

    
def add(a, b):
    return a + b

list(itertools.starmap(add, [(1, 2), (3, 4), (4, 5), (5, 6)]))









    Out[26]:





[3, 7, 9, 11]

groupby 顾名思义，可以按照特定条件将输入分组



In [27]:

    
for key, group in itertools.groupby(xrange(12), lambda x: x / 5):
    print key, list(group)









    



0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11]

但需要注意的是，它只将连续的满足相同条件的元素分成同一组



In [28]:

    
for key, group in itertools.groupby(xrange(5), lambda x: x % 2):
    print key, list(group)









    



0 [0]
1 [1]
0 [2]
1 [3]
0 [4]

另外还有 ifiler, imap, islice, izip ，其功能与 filter, map, slice, zip 一样，但返回的结果都是 generator 。

而 count, cycle, repeat 可以用来产生长度无限的 generator。

random



In [29]:

    
import random

random 模块可以用来生成随机数、在已有数据上进行随机采样，也是很实用的一个模块。

random 方法可以用来生成服从 [0, 1) 区间内均匀分布的随机数:



In [30]:

    
samples = [random.random() for _ in range(10000)]

plt.hist(samples)
plt.title('Samples of random.random')
plt.xlabel('sample')
plt.ylabel('count')









    Out[30]:





<matplotlib.text.Text at 0x3672c10>

uniform 返回服从给定区间内的均匀分布的随机数，用 uniform 可以实现和前面的 random 方法相同功能的方法



In [31]:

    
def my_random():
    return random.uniform(0, 1)

randint 则返回服从给定区间内 离散均匀分布 的随机整数



In [32]:

    
[random.randint(0, 10) for _ in range(10)]









    Out[32]:





[0, 7, 0, 6, 3, 0, 0, 6, 10, 8]

choice 方法可以从一个给定序列中进行均匀随机采样



In [33]:

    
seq = range(10)
samples = [random.choice(seq) for _ in range(10000)]

plt.hist(samples)
plt.title('Samples of random.choice')
plt.xlabel('sample')
plt.ylabel('count')









    Out[33]:





<matplotlib.text.Text at 0x3615410>

sample 方法可以从给定序列中随机采样得到 K 个元素，这 K 个元素任意两个在原序列中的位置都不同。



In [34]:

    
seq = range(10)
random.sample(seq, 3)









    Out[34]:





[7, 1, 3]

shuffle 用来对一个 list 中的所有元素进行重排，即打乱其原有顺序，也是一个比较常用的方法



In [35]:

    
seq = range(10)
random.shuffle(seq)

seq









    Out[35]:





[3, 9, 1, 0, 7, 2, 6, 4, 5, 8]

shuffle 方法是有副作用的，会修改 list 本身，如果希望不修改原 list，而是将打乱后的结果返回，可以用之前提到的 sample 方法来达成目的



In [36]:

    
seq = range(10)
new_seq = random.sample(seq, len(seq))

print 'origin sequence:', seq
print 'new sequence:', new_seq









    



origin sequence: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
new sequence: [5, 8, 3, 7, 0, 6, 9, 4, 1, 2]

gauss 和 normalvariate 可以用来产生服从正态分布的随机数，不过需要注意的是，gauss 方法是线程不安全的



In [37]:

    
plt.subplots_adjust(hspace=1.)

plt.subplot(211)
samples_by_gauss = [random.gauss(0, 1) for _ in range(10000)]
plt.hist(samples_by_gauss)
plt.title('Samples of random.gauss')
plt.xlabel('sample')
plt.ylabel('count')

plt.subplot(212)
samples_by_normalvariate = [random.normalvariate(0, 1) for _ in range(10000)]
plt.hist(samples_by_normalvariate)
plt.title('Samples of random.normalvariate')
plt.xlabel('sample')
plt.ylabel('count')









    Out[37]:





<matplotlib.text.Text at 0x3c7a910>