Python In Practice: 实践为王

1 文件读写：到此一游

观光传送门： https://github.com/BinRoot/Haskell-Data-Analysis-Cookbook/tree/master/Ch01

即使是售价高达$54.99的《Haskell Data Analysis Cookbook》里，第一章也只能讲点平淡无奇的如何读入以下各种形式的文本

TXT,DAT(纯文本，里面的格式你已经有一定的了解)
CSV,TSV(Comma/Tab Separated Values 逗号、Tab分隔值的数据表)
JSON(JavaScript Object Notation，兼容Javascript数据的字符串表达)
XML(eXtensible Markup Language，半结构化数据标记语言，标签在尖括号中)
HTML(Hyper Text Markup Language，网页标记语言，元数据在尖括号中)

Python连接各种数据库会在后面的材料中介绍。

一个简单的csv格式：S0EP4_fakedata.txt

first name , last name , employee number
luke , vanderhart , 1
ryan , neufeld , 2



In [1]:

    
import csv
import codecs
import numpy as np
import pandas as pd

1.1 文件打开方式

如果你确定文件中不会出现中文的话，普通的open就能实现功能了，如果要处理中文的话还需要codecs库。

接下来做一下流程

打开文件
设置一个空列表
文件每一行都读取进列表
碰到文件结尾结束
关闭文件
打印内容



In [1]:

    
f = open("S0EP4_fakedata.txt",'r')
file_content0 = []
mark = True
while(mark):
    file_content0.append(f.readline())
    mark = file_content0[-1]
f.close()
print file_content0









    



['first_name,last_name,employee_number\n', 'luke,vanderhart,1\n', 'ryan,neufeld,2', '']

能看到，最后一个空字符串被读了进来是因为到文件末尾了，前面一些行尾部还有换行符。

以及这里我们使用 with as 来代替 f=open() 以及 f.close()，一切都是那么自然，不需要处理文件结尾和关闭文件。



In [2]:

    
file_content1 = []

with open("S0EP4_fakedata.txt") as f:
    mark = True
    while(mark):
        file_content1.append(f.readline())
        mark = file_content1[-1]

print file_content1,'\n'









    



['first_name,last_name,employee_number\n', 'luke,vanderhart,1\n', 'ryan,neufeld,2', '']

Nice!

刚才的Readline是为了防止一次读进太大的文件，试试小文件直接读取readlines()：



In [4]:

    
with open("S0EP4_fakedata.txt") as f:
    file_content2 = f.readlines()

print file_content2









    



['first_name,last_name,employee_number\n', 'luke,vanderhart,1\n', 'ryan,neufeld,2']

中文的挑战：

CODECS = COders/DECoderS



In [2]:

    
with codecs.open("S0EP4_fakedata2.txt",'r',encoding='utf-8') as f:
    file_content3 = f.readlines()

print file_content3
    
for item in file_content3:
    print item,

with codecs.open("S0EP4_fakedata2_w.txt","w",encoding='utf-8') as f:
    for item in file_content3:
        f.write(item)









    



[u'\u540d,\u59d3,ID\n', u'\u4e8c\u72d7,\u738b,1\n', u'\u7fe0\u82b1,\u674e,2\n']
名,姓,ID
二狗,王,1
翠花,李,2

收尾工作：



In [3]:

    
try:
    del file_content0
    del file_content1
    del file_content2
    del file_content3
except Exception:
    pass

1.2 非文件式读取CSV

NumPy是一个可用的选择：

这里注意文件使用的字节串与中文字符串的关系。

encode永远用于字符串到字节串转换
decode永远用于字节串到字符串转换



In [7]:

    
npdata = np.genfromtxt("S0EP4_fakedata2.txt",skip_header=1,dtype=None,delimiter=',')

#print npdata

print npdata[0][0],len(npdata[0][0]),unicode(npdata[0][0].decode('utf-8')),len(npdata[0][0].decode('utf-8'))

#npdata[0][0],npdata[0][0].decode('utf-8')









    



二狗 6 二狗 2

继续化简：

发现要想使用UTF-8有额外的转码负担，如果你需要在CSV上做一些统计分析工作，建议还是使用Pandas更方便。



In [10]:

    
pddata = pd.read_csv("S0EP4_fakedata2.txt",header=0,sep=",",encoding="utf-8")

print pddata,type(pddata)

pddata.to_csv("S0EP4_fakedata2_pd.txt",sep="\t",index=False,encoding="utf-8")









    



    名  姓  ID
0  二狗  王   1
1  翠花  李   2 <class 'pandas.core.frame.DataFrame'>

1.3 直接读写JSON(JavaScript Object Notation)

JSON也是一种常见的数据交换格式，在Web开发、网络数据传输中非常常见。

使用json库就可以进行简单的处理，先处理一下从Python数据结构转化为JSON字符串：



In [8]:

    
import json
from pprint import pprint
in_data=[{
         "id": "001",
         "name": "Bulbasaur",
         "type": "Grass",
         "subtype": "Poison"
     },
     {   "id": "002",
         "name": "Ivysaur",
         "type": "Grass",
         "subtype": "Poison"
     },
    {
         "id": "003",
         "subtype": "Poison",
         "name": "Venusaur",
         "type": "Grass"
    }]

data_string = json.dumps(in_data)          #将一个Python数据结构转换为JSON
out_data = json.loads(data_string)         #将JSON编码字符串重现变为一个Python数据结构

print data_string,'\n',type(data_string),'\n',out_data,'\n',type(out_data)









    



[{"subtype": "Poison", "type": "Grass", "id": "001", "name": "Bulbasaur"}, {"subtype": "Poison", "type": "Grass", "id": "002", "name": "Ivysaur"}, {"subtype": "Poison", "type": "Grass", "id": "003", "name": "Venusaur"}] 
<type 'str'> 
[{u'subtype': u'Poison', u'type': u'Grass', u'id': u'001', u'name': u'Bulbasaur'}, {u'subtype': u'Poison', u'type': u'Grass', u'id': u'002', u'name': u'Ivysaur'}, {u'subtype': u'Poison', u'type': u'Grass', u'id': u'003', u'name': u'Venusaur'}] 
<type 'list'>

如果你的JSON要求以文本文件为载体的话



In [12]:

    
# Writing JSON data
with open('S0EP4_jsondata.json', 'w') as f:
    json.dump(in_data, f)

# Reading data back
with open('S0EP4_jsondata.json', 'r') as f:
    out_data = json.load(f)

print out_data









    



[{u'subtype': u'Poison', u'type': u'Grass', u'id': u'001', u'name': u'Bulbasaur'}, {u'subtype': u'Poison', u'type': u'Grass', u'id': u'002', u'name': u'Ivysaur'}, {u'subtype': u'Poison', u'type': u'Grass', u'id': u'003', u'name': u'Venusaur'}]

判断out_data结构，让打印稍微美观一些：



In [13]:

    
pprint(out_data)









    



[{u'id': u'001',
  u'name': u'Bulbasaur',
  u'subtype': u'Poison',
  u'type': u'Grass'},
 {u'id': u'002',
  u'name': u'Ivysaur',
  u'subtype': u'Poison',
  u'type': u'Grass'},
 {u'id': u'003',
  u'name': u'Venusaur',
  u'subtype': u'Poison',
  u'type': u'Grass'}]

网络上常见的美化过的JSON数据：



In [12]:

    
print json.dumps(out_data, indent=2)









    



[
  {
    "subtype": "Poison", 
    "type": "Grass", 
    "id": "001", 
    "name": "Bulbasaur"
  }, 
  {
    "subtype": "Poison", 
    "type": "Grass", 
    "id": "002", 
    "name": "Ivysaur"
  }, 
  {
    "subtype": "Poison", 
    "type": "Grass", 
    "id": "003", 
    "name": "Venusaur"
  }
]

2 Get Organized: 包管理、模块组织与工程组织

2.1 包管理工具

发行版 Continuum Analytics(continuum.io/downloads) IDE PyCharm（www.jetbrains.com/pycharm）

conda install numpy（推荐）
pip install numpy（推荐）
easy_install numpy
源码包： setup.py install
灵活部署：virtualenv

包一般在Python文件夹/lib/site-packages下能够找到。

2.2 Import 习惯



In [10]:

    
import scipy
print scipy.log(5)

import scipy as sp # 推荐！
print sp.log(5)

import scipy.special as ssp # 推荐！
print ssp.comb(5,2)

from scipy import log
print log(5) # 可能导致名字空间的污染（Namespace Contamination）

from scipy import *
print log(5) # 强烈不推荐









    



1.60943791243
1.60943791243
10.0
1.60943791243
1.60943791243

那么我们用有些繁琐的import来简单介绍一下datetime库的功能，感受一下



In [15]:

    
# 比如我们要获得现在的时间
import datetime
print datetime.datetime.now().strftime("%Y-%m-%d %H:%M")     #获得当前时间
sometime = datetime.datetime.strptime("2015-05-20","%Y-%m-%d")
print sometime,type(sometime)                                #获得一个datetime.datetime类型数据
print (datetime.datetime.now() - datetime.timedelta(hours=1)).strftime("%Y-%m-%d %H:%M") #获得一小时前的时间









    



2015-06-06 11:38
2015-05-20 00:00:00 <type 'datetime.datetime'>
2015-06-06 10:38

2.3 工程组织初探

Python会在lib/site-packages和本文件夹下面自己去寻找可以用的文件，随用随执行。

3. Touch the True Programming: 函数，类

知道如何美观的组织工程之后，接下来了解函数与类，会少量涉及泛函（Functional Programming），后续会介绍functools

灵活的使用函数与类是任何一门程序语言的核心。

编程的范式：

Procedural Programming
Object-Oriented Programming
Functional Programming

Python号称这几种范式全能满足。

3.1 函数

函数：参数进去，结果出来（它的数学定义也如此）。

Procedural Programming: 流程式的写程序，要把你要做的事情写出来：

去重：



In [15]:

    
a=[1,2,3,2,5,8,1,3,10,12,15,5]
b=list(set(a))
print b









    



[1, 2, 3, 5, 8, 10, 12, 15]

函数是用来重用你的功能性代码的：如果一件事情你要做多次，那就用函数。

3.1.1 基本功能

不用set去重，发现函数连检查参数类型都不做，这种自由度会不会是以后的坑呢？



In [11]:

    
def dedupe(items):
    collect = set()
    for item in items:
        if item not in collect:
            #yield item
            collect.add(item)
    return collect #和Yield语句二选一
a=[1,2,3,2,5,8,1,3,10,12,15,5]
b=list(dedupe(a))
print b









    



[1, 2, 3, 5, 8, 10, 12, 15]

函数可以接受带默认值的参数，称之为keyword arguments。（kwargs）

需要注意的是，默认参数应该是不可变的！如果在默认参数中使用了[], {}这种可变类型，很快你的结果就会变得混乱。



In [12]:

    
import numpy as np

def abs(x=0):
    y=x if x >= 0 else -x
    return y

print abs(-10)
print abs()

def addlist(x=[]):
    x.append(np.random.randint(5))
    print x
    return None

addlist()
addlist()
addlist()









    



10
0
[3]
[3, 4]
[3, 4, 2]

Point 1:在使用函数前必须定义函数，否则会引发NameError.

定义之后直接使用函数是可以的，小括号里面传递合适的参数即可，但是千万不要忘记return，否则：



In [13]:

    
#del myabs

try:
    print myabs(10)
except Exception,e:
    print e
    
def myabs(x):                    
    y = x if x >= 0 else -x
    return y

print myabs(10)









    



name 'myabs' is not defined
10

Point 2: 使用函数记得return，而且print不能等同于return。



In [15]:

    
def myabs(x):
    y = x if x >= 0 else -x
    #return y

print myabs(-5)

def test():
    print "A man need a code-Bunk"

print type(test())









    



None
A man need a code-Bunk
<type 'NoneType'>

Point 3:注意return的正确位置

第一次碰见return之后，后面就完全不在执行了，相当于break。

输入4，两者都是碰见if里面的return False
输入7，因为是质数，所以if内永远不执行，一个是循环体执行1次直接返回True，一个是循环体结束了才返回True
输入9，一个循环体执行1次直接返回True，一个是执行到i=3的时候返回False

用print i输入运行信息



In [17]:

    
def is_prime_wrong(n):
    for i in xrange(2,n):
        #print i,
        if n % i == 0:
            return False
        return True
    
def is_prime_right(n):
    for i in xrange(2,n):
        #print i
        if n % i == 0:
            return False
    return True

print is_prime_wrong(4),is_prime_wrong(7),is_prime_wrong(9)
print is_prime_right(4),is_prime_right(7),is_prime_right(9)









    



False True True
False True False

Point 4.最后，在书写函数时，推荐在代码块中附加文档，这个文档按照之前描述的，一般会在三引号中间。

功能（通过一定的计算return结果）
参数类型
计算过程前后的类型变化（例如int -> str,int -> int)
一小段使用函数的例子



In [21]:

    
def myabs(x=0):
    """
    函数：myabs
    参数：number -> number
    用途：计算数字的绝对值，如果没有给出要求的参数，默认值为0
    例子： print myabs(-42)，打印了 42
    """
    y = x if x>= 0 else -x
    return y

print myabs(-5)

Point 5：函数可以调用自身，但是你需要防止无限循环的调用

阶乘的概念：

$N! = N * (N-1) * ... * 1 = \prod_{i=1}^N i$

程序的执行逻辑是

如果N不能用int化成整数，会直接报错
如果N是0,1的话，返回结果1
如果N是负数，返回空值
其他情况下，N是大于1的整数，N阶乘返回N乘以N-1的阶乘



In [18]:

    
def myFactorial(N):
    intN = int(N)
    if intN in [0,1]:
        return 1
    elif intN<0:
        return None
    else:
        return intN*myFactorial(intN-1)

print myFactorial(5.0)

3.1.2 Inception of "Scopes": 盗变量空间

Procedural Programming：流程式编程
函数内层变量叫局部变量，外层叫全局变量
Namespace在多层函数中是各自存在的
函数在定义和执行阶段可以"看见"外面的变量，但是函数设计时，一般不会让内层空间对外层变量进行干涉：



In [21]:

    
x="You live and learn. At any rate, you live."

def func():
    print 'x is', x                  #看见了外层x
    #x = "So long."                  #使用这句替代exec时会报一个赋值前就使用的错误
    exec('x = \"So long.\"')         #用语法检查不易察觉的方式修改了局部变量x
    print 'Changed local x to', x    #打印局部x

func()
print 'x is still', x                #外层x没有改变









    



x is You live and learn. At any rate, you live.
Changed local x to So long.
x is still You live and learn. At any rate, you live.

如果你执意去干涉外层，使用global：



In [19]:

    
x = "You live and learn. At any rate, you live."
def func():
    global x
    print('x is', x)
    x = "So Long, and Thanks for All the Fish"
    print('Changed global x to', x)

func()
print('Value of x is', x)









    



('x is', 'You live and learn. At any rate, you live.')
('Changed global x to', 'So Long, and Thanks for All the Fish')
('Value of x is', 'So Long, and Thanks for All the Fish')

插播小知识：大蟒蛇Python去哪儿找变量？LEGB Rule告诉你：

Local: 先找locals()，存放当前局部变量的字典
Enclosing: 再找上一层代码块，或者再再上一层代码块（稍后你会知道函数可以包着函数，每个函数都有属于自己那一层空间）
Global: 再找globals()
Built-In: 也就是Python自带的名字空间

还没有的话，那就是会告诉你变量没找到，否则，Python最先找到变量在哪儿，就用那里的变量。



In [21]:

    
print type(locals()),type(globals())









    



<type 'dict'> <type 'dict'>



In [25]:

    
print __name__, '__name__' in dir(__builtins__)









    



__main__ True

3.1.3 新玩具locals( )和globals( )

和变量类似的，函数里面的函数……这个主意怎么样？遗憾的是它仍然会是local的！

concrete()内部的abstract()仅限于concrete()使用

不管是检查函数还是变量，都可以用locals()字典和globals()字典来检查



In [26]:

    
def concrete():
    "None"
    print "concrete() called"
    def abstract():
        print "abstract() called"
    print 'abstract' in locals()
    abstract()

concrete()
print 'abstract' in globals()
abstract()









    



concrete() called
True
abstract() called
False






    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-26-bdc91565636d> in <module>()
      9 concrete()
     10 print 'abstract' in globals()
---> 11 abstract()

NameError: name 'abstract' is not defined

当我们发现了新的好玩的东西locals()和globals()可以用更多方法来干涉当前空间中的变量、函数……

只要空间里有名字a，不管它是什么，直接删除！（不推荐）然后直接操作globals，发现……



In [22]:

    
try:
    del a
except Exception:
    pass

globals()['a'] = 1

print a

3.1.4 凶残的eval( )和exec( )

如果刚才这些功能还不能满足你，那么教你最后两招（慎用）

evaluate 是对表达式求值
execute 是直接执行字符串里面的语句

这两种情况都是简单粗暴的方法，一旦程序出错，IDE很可能不会提示哪里出错了。



In [23]:

    
a = 3
b = eval('a+2')
print b

e = [2]

exec('b'+str(e[0])+' = a**2 ')
print b2

5
9

有这么多好用的小工具干脆做一个练习吧：

给你1-9的数字，在中间随机的插入＋、－、或者什么也不加，最后求和等于100。给出所有的答案。

答案之一是123+45-67+8-9=100。

如下的短短几行代码方案整合了如下功能：

函数迭代（调用自身）
re 正则表达式
zip 拉链
format 美观打印
List Comprehension 列表解析
eval 字符串表达式求值
xrange 循环
str 强制类型转换



In [29]:

    
import re
def trinary(x):
    return trinary(x/3)+str(x%3) if x>=3 else str(x) #自己调用自己、强制类型转换
head_str = ''.join([str(i) for i in xrange(1,10)]) #列表解析
for i in xrange(3**8): #循环
    tail_str = re.sub('0',' ',re.sub('1','+',re.sub('2','-','{0:0>8}'.format(trinary(i)))))+' ' #正则表达式和format
    zip_str = re.sub(' ','',''.join([s1+s2 for s1,s2 in zip(head_str,tail_str)])) #正则表达式和zip
    if 100==eval(zip_str): #eval
        print tail_str,zip_str









    



  + - +-  123+45-67+8-9
  +-+ -   123+4-5+67-89
  - - +   123-45-67+89
  ----+-  123-4-5-6-7+8-9
 +++--+   12+3+4+5-6-7+89
 +-++ ++  12+3-4+5+67+8+9
 --+-++   12-3-4+5-6+7+89
+ -+ +++  1+23-4+56+7+8+9
+ -+++ -  1+23-4+5+6+78-9
++ -+ -+  1+2+34-5+67-8+9
++-+++ +  1+2+3-4+5+6+78+9

3.1.5 匿名函数、函数传递

等号复制：引用传递适用于

变量
数据结构
函数
...

“静态”内存引用：

小字符串
小数字

普通的引用：

直接赋值会带来不同ID
传递引用赋值会带来相同ID



In [31]:

    
a = "abc"
b = "abc"
c = 1
d = 1
e = 4294967295
f = e
g = 4294967295
id(a),id(b),id(c),id(d),id(e),id(f),id(g)









    Out[31]:





(4302395312,
 4302395312,
 4298165816,
 4298165816,
 4302060680,
 4302060680,
 4302060872)

看了这几个，我觉得我们离“泛函”也就是函数的函数更近了一步。

先看看匿名函数也就是lambda的普通用法

def定义有名字的函数
有名函数起个别名(引用传递)
lambda定义匿名函数并绑定名字
lambda定义匿名函数并直接使用



In [32]:

    
def head(L):
    return L[0]

headprime = head

head2 = lambda x:x[0]
head3 = lambda x:x[0]

a = range(3,6)

print a,head(a),id(head),headprime(a),id(headprime)
print head2(a),id(head2),head3(a),id(head3)
print (lambda x:x[0])(a)









    



[3, 4, 5] 3 4416226920 3 4416226920
3 4416227280 3 4416227160
3

接下来，函数也可以作为其他函数的参数！

如果你是自学的Python，我猜71.2%的可能性你第一次接触lambda是在排序问题上（当然数字是胡扯的，不要信）

sorted第一个参数是字典的iteritems，第二个参数key是一个函数，这个函数用于给iteritems中每一个元素赋予一个值，sorted会根据这个值排序。

这里函数就作为参数传递了：



In [30]:

    
d = {'a':1.0,'b':1.2,'c':0.8,'d':0.9}
print sorted(d.iteritems(),key=lambda x:x[1],reverse=True)









    



[('b', 1.2), ('a', 1.0), ('d', 0.9), ('c', 0.8)]

3.1.6 高级应用：Decorator

如果你有一批变量想统一按一个规则处理，并且需要缩减代码，你需要函数。

如果你有一批函数想统一按一个规则处理，并且需要缩减代码，你需要装饰器（Decorator）

理清下面2点：

函数
- 接受参数
- 做点事情
- 返回结果
装饰器
- 接受函数作为参数
- 做点事情
- 返回一个函数

正常情况下的装饰函数使用



In [24]:

    
def decorated(func):
    print "Decorated first"
    return func

def main_func():
    print "Main called"

dec_main_func=decorated(main_func)

dec_main_func()









    



Decorated first
Main called

装饰器使用：

装饰器的语法以@开头，接着是装饰器函数的名字和可选的参数。紧跟着装饰器声明的是被修饰的函数和装饰函数的可选参数。

main_func=decorated(main_func)这一句被节省了。



In [35]:

    
#装饰器定义阶段
def decorated(func):
    print "Decorated first"
    return func
#装饰器直接使用

@decorated
def main_func():
    print "Function called"

main_func()









    



Decorated first
Function called

@那一行的本质相当于 “main_func=test_1(main_func)"

3.2 面向对象编程：Object-Oriented Programming

Python 同样支持面向对象编程。

面向对象这个术语最早由艾伦.C.凯提出，他同样是Smalltalk语言的设计者。

作为一名有生物学背景的计算机科学家，艾伦在思考编程风格时借鉴了生物中的细胞概念。面向对象的含义来自于它类比的对象：独立且相互交流的生物细胞。某一类细胞在特定环境下对特定化学信号产生特定的蛋白质。而计算机定义类（Class）之后生成的对象（或者有时被称作实例，Instance）也对特定的计算讯息作出特定反映。

“面向对象”适度地隐藏了内部活动的信息 —— 我不关心你细胞液怎么各种渗透怎么产生电信号……给我蛋白质！
在函数的基础上，帮助程序员组织代码，并在适当的场合复用代码

简单的例子是，公司雇员需要用一个class进行抽象：

他们都有姓名，年龄，专业，和员工卡编号
用姓名作为构建员工对象的变量，而其他几个变量可以进行修改和访问

首先定义类，定义一个类的成员，再修改他的变量。

类可以起到模板的作用,在创建实例的时候，可以把一些我们认为必须绑定的属性强制填写进去。通过定义一个特殊的__init__方法就可以达成这个目标。



In [36]:

    
class Employee(object):            #括号里指明该类从哪个类继承而来，object是所有类最终会继承的“基本”类
    def __init__(self,name):
        self.name=name
    def set_age(self,age):
        self.age=age
    def set_major(self,major):
        self.major=major
    def set_card(self,card):
        self.card=card
        
Francoise=Employee('Francoise')    #  构造一个实例，必须有名字
Francoise.set_age('22')
Francoise.set_major('Physics')
Francoise.set_card('000142')

尝试访问他的变量



In [37]:

    
print Francoise.age
print Francoise.major
print Francoise.name









    



22
Physics
Francoise

类之间的关系可以用继承实现



In [38]:

    
class Manager(Employee):
    qualification="5-year Experience Expert"
    
Alexandre=Manager('Alexandre')
Alexandre.set_age('25')
Alexandre.set_major('Mathematics')
Alexandre.set_card('00004512')


print Alexandre.major
print Alexandre.qualification









    



Mathematics
5-year Experience Expert



In [ ]:



In [ ]:

    
0123456789

2620000000  == 7020001000