正则表达式

1 基础部分

管道符号（|）匹配多个正则表达式: at | home 匹配 at,home
匹配任意单一字符(.): t.o 匹配 tao,tzo
字符串和单词开始和结尾位置匹配：
- （^）匹配字符串开始位置：^From 匹配 From 开始的字符串
- （\$）匹配字符串结尾的位置： /bin/tsch\$ 匹配/bin/tsch结束的字符串
(\b) 匹配单词的边界：\bthe 匹配the开头的单词
(\B) 与\b 相反
([]) 创建匹配字符集合：b[aeiu]t 匹配 bat,bet,bit,but
(-) 指定范围匹配： [a-z]匹配a到z的字符
(^)否定：[^aeiou]匹配非元音
*：出现一次；+：出现1次和多次；？：出现1次和0次
\d: 匹配数字，\D: 相反；\w 整个字符数字的字符集，\W 相反；\s 空白字符，\S 相反。
(()):进行分组匹配

2 Re模块

2.1 常用函数

comple(pattern, flags=0)
对正则表达式进行编译，返回regex对象
match(pattern, string, flags=0)
尝试用一个正则表达式模式pattern对一个字符串进行匹配，如果匹配成功，返回匹配的对象
search(pattern, string, flags=0)
在字符串中搜索pattern的第一次出现
findall(pattern, string[,flags])和finditer(pattern, string[,flags])
返回字符串中模式所有的出现，返回分别为列表和迭代对象
split(pattern, string, max=0)
根据正则表达式将字符串分割成一个列表
sub(pattern, repl, string, max=0)
把字符串的中符合pattern的部分用repl替换掉
group(num=0)
返回全部匹配对象（或者指定编号是num的子组）
group()
包含全部数组的子组的字符串

2.2 match



In [2]:

    
import re
m = re.match('foo', 'foo')
if m is not None: m.group()



In [3]:

    
m









    Out[3]:





<_sre.SRE_Match at 0x104699d30>



In [4]:

    
m = re.match('foo', 'bar')
if m is not None: m.group()



In [5]:

    
re.match('foo', 'foo on the  table').group()









    Out[5]:





'foo'



In [9]:

    
# raise attributeError
re.match('bar', 'foo on the  table').group()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-c92f4467278c> in <module>()
      1 # raise attributeError
----> 2 re.match('bar', 'foo on the  table').group()

AttributeError: 'NoneType' object has no attribute 'group'

2.3 search

match 从字符串开始位置进行匹配，但是模式出现在字符串的中间的位置比开始位置的概率大得多



In [11]:

    
m = re.match('foo','seafood')
if m is not None: m.group()

search 函数将返回字符串开始模式首次出现的位置



In [15]:

    
re.search('foo', 'seafood').group()









    Out[15]:





'foo'

2.4 匹配多个字符串



In [16]:

    
bt = 'bat|bet|bit'



In [17]:

    
re.match(bt,'bat').group()









    Out[17]:





'bat'



In [18]:

    
re.match(bt, 'blt').group()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-8f907281483b> in <module>()
----> 1 re.match(bt, 'blt').group()

AttributeError: 'NoneType' object has no attribute 'group'



In [19]:

    
re.match(bt, 'He bit me!').group()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-76f0c210fc22> in <module>()
----> 1 re.match(bt, 'He bit me!').group()

AttributeError: 'NoneType' object has no attribute 'group'



In [21]:

    
re.search(bt, 'He bit me!').group()









    Out[21]:





'bit'

2.5 匹配任意单个字符(.)

句点不能匹配换行符或者匹配非字符串（空字符串）



In [22]:

    
anyend='.end'



In [23]:

    
re.match(anyend, 'bend').group()









    Out[23]:





'bend'



In [24]:

    
re.match(anyend, 'end').group()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-24-587848c6c58a> in <module>()
----> 1 re.match(anyend, 'end').group()

AttributeError: 'NoneType' object has no attribute 'group'



In [25]:

    
re.search(anyend, '\nend').group()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-e37283bbf310> in <module>()
----> 1 re.search(anyend, '\nend').group()

AttributeError: 'NoneType' object has no attribute 'group'

2.6 创建字符集合([ ])



In [26]:

    
pattern = '[cr][23][dp][o2]'



In [27]:

    
re.match(pattern, 'c3po').group()









    Out[27]:





'c3po'



In [28]:

    
re.match(pattern, 'c3do').group()









    Out[28]:





'c3do'



In [29]:

    
re.match('r2d2|c3po', 'c2do').group()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-29-b0bdc5760f7c> in <module>()
----> 1 re.match('r2d2|c3po', 'c2do').group()

AttributeError: 'NoneType' object has no attribute 'group'



In [30]:

    
re.match('r2d2|c3po', 'r2d2').group()









    Out[30]:





'r2d2'

2.7 分组

2.7.1匹配邮箱



In [31]:

    
patt = '\w+@(\w+\.)?\w+\.com'
re.match(patt, 'nobady@xxx.com').group()









    Out[31]:





'nobady@xxx.com'



In [32]:

    
re.match(patt, 'nobody@www.xxx.com').group()









    Out[32]:





'nobody@www.xxx.com'



In [33]:

    
# 匹配多个子域名
patt = '\w+@(\w+\.)*\w+\.com'
re.match(patt, 'nobody@www.xxx.yyy.zzz.com').group()









    Out[33]:





'nobody@www.xxx.yyy.zzz.com'

2.7.2 分组表示



In [37]:

    
patt = '(\w\w\w)-(\d\d\d)'
m = re.match(patt, 'abc-123')



In [38]:

    
m.group()









    Out[38]:





'abc-123'



In [39]:

    
m.group(1)









    Out[39]:





'abc'



In [40]:

    
m.group(2)









    Out[40]:





'123'



In [41]:

    
m.groups()









    Out[41]:





('abc', '123')



In [42]:

    
m = re.match('ab', 'ab')
m.group()









    Out[42]:





'ab'



In [43]:

    
m.groups()









    Out[43]:





()



In [44]:

    
m = re.match('(ab)','ab')
m.groups()









    Out[44]:





('ab',)



In [45]:

    
m.group(1)









    Out[45]:





'ab'



In [46]:

    
m = re.match('(a(b))', 'ab')
m.group()









    Out[46]:





'ab'



In [48]:

    
m.group(1)









    Out[48]:





'ab'



In [49]:

    
m.group(2)









    Out[49]:





'b'



In [50]:

    
m.groups()









    Out[50]:





('ab', 'b')

2.8 字符串开头或者单词边界

2.8.1 字符串开头或者结尾



In [51]:

    
re.match('^The', 'The end.').group()









    Out[51]:





'The'



In [52]:

    
# raise attributeError
re.match('^The', 'end. The').group()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-52-c084852eeeee> in <module>()
      1 # raise attributeError
----> 2 re.match('^The', 'end. The').group()

AttributeError: 'NoneType' object has no attribute 'group'

2.8.2 单词边界



In [53]:

    
re.search(r'\bthe', 'bite the dog').group()









    Out[53]:





'the'



In [54]:

    
re.search(r'\bthe', 'bitethe dog').group()









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-54-8a74aa048c3e> in <module>()
----> 1 re.search(r'\bthe', 'bitethe dog').group()

AttributeError: 'NoneType' object has no attribute 'group'



In [55]:

    
re.search(r'\Bthe', 'bitthe dog').group()









    Out[55]:





'the'

2.9 find 模块



In [56]:

    
re.findall('car', 'car')









    Out[56]:





['car']



In [57]:

    
re.findall('car', 'scary')









    Out[57]:





['car']



In [58]:

    
re.findall('car', 'carry, the barcardi to the car')









    Out[58]:





['car', 'car', 'car']

2.10 sub()和subn()函数



In [62]:

    
(re.sub('X', 'Mr. Smith', 'attn: X\n\nDear X, \n'))









    Out[62]:





'attn: Mr. Smith\n\nDear Mr. Smith, \n'



In [63]:

    
print re.subn('X', 'Mr. Smith', 'attn: X\n\nDear X, \n')









    



('attn: Mr. Smith\n\nDear Mr. Smith, \n', 2)



In [64]:

    
re.sub('[ae]', 'X', 'abcdedf')









    Out[64]:





'XbcdXdf'

2.11 split分割



In [66]:

    
re.split(':','str1:str2:str3')









    Out[66]:





['str1', 'str2', 'str3']



In [68]:

    
from os import popen
from re import split
f = popen('who', 'r')
for eachLine in f.readlines():
    print split('\s\s+|\t', eachLine.strip())
f.close()









    



['gaufung', 'console', 'Jan 15 12:20']
['gaufung', 'ttys000', 'Jan 29 10:05']
['gaufung', 'ttys001', 'Jan 29 10:06']

3 搜索和匹配的比较，“贪婪”匹配



In [70]:

    
string = 'Thu Feb 15 17:46:04 2007::gaufung@cumt.edu.cn::1171590364-6-8'
patt = '.+\d+-\d+-\d+'
re.match(patt, string).group()









    Out[70]:





'Thu Feb 15 17:46:04 2007::gaufung@cumt.edu.cn::1171590364-6-8'



In [72]:

    
patt = '.+(\d+-\d+-\d+)'
re.match(patt, string).group(1)









    Out[72]:





'4-6-8'

由于通配符“.”默认贪心的，所以'.+'将会匹配尽可能多的字符，所以

Thu Feb 15 17:46:04 2007::gaufung@cumt.edu.cn::117159036

将匹配'.+'，而分组匹配的内容则是“4-6-8”,非贪婪算法则通过'?'解决



In [73]:

    
patt = '.+?(\d+-\d+-\d+)'
re.match(patt, string).group(1)









    Out[73]:





'1171590364-6-8'



In [ ]: