python正则表达式基础快速教程

By liupengyuan@pku.edu.cn

正则表达式，这个术语不太容易望文生义(没有去考证是如何被翻译为正则表达式的)，其实其英文为Regular Expression，直接翻译就是：有规律的表达式。这个表达式其实就是一个字符序列，反映某种字符规律，用(字符串模式匹配)来处理字符串。很多高级语言均支持利用正则表达式对字符串进行处理的操作。

python提供的正则表达式文档可参见：https://docs.python.org/3/library/re.html



In [1]:

    
import re

首先引入python正则表达式库re

1. 入门



In [2]:

    
s = 'Blow low, follow in of which low. lower, lmoww oow aow bow cow 23742937 dow kdiieur998.'
p = 'low'

假设要在字符串s中查找单词low，由于该单词的规律就是low，因此可将low作为一个正则表达式，可命名为p。



In [3]:

    
m = re.findall(p, s)
m









    Out[3]:





['low', 'low', 'low', 'low', 'low']

findall(pattern, string)是re模块中的函数，会在字符串string中将所有匹配正则表达式pattern模式的字符串提取出来，并以一个list的形式返回。该方法是从左到右进行扫描，所返回的list中的每个匹配按照从左到右匹配的顺序进行存放。
正则表达式low能够将所有单词low匹配出来，但是也会将lower，Blow等含有low字符串中的low也匹配出来。



In [4]:

    
p = r'\blow\b'
m = re.findall(p, s)
m









    Out[4]:





['low', 'low']

\b，即boundary，是正则表达式中的一种特殊字符，表示单词的边界。正则表达式r'\blow\b'就是要单独匹配low，该字符串两侧为单词的边界(边界为空格等，但是并不是要匹配之)



In [5]:

    
p = r'[lmo]ow'
m = re.findall(p, s)
m









    Out[5]:





['low', 'low', 'low', 'low', 'low', 'mow', 'oow']

[lmo]，匹配lmo字母中的任何一个



In [6]:

    
p = r'[a-d]ow'
m = re.findall(p, s)
m









    Out[6]:





['aow', 'bow', 'cow', 'dow']

[a-d]，匹配abcd字母中的任何一个



In [7]:

    
p = r'\d'
m = re.findall(p, s)
m









    Out[7]:





['2', '3', '7', '4', '2', '9', '3', '7', '9', '9', '8']

\d，即digit，表示数字



In [8]:

    
p = r'\d+'
m = re.findall(p, s)
m









    Out[8]:





['23742937', '998']

+，元字符，表示一个或者重复多个对象，对象为+前面指定的模式
因此\d+可以匹配长度至少为1的任意正整数。

2. 基本匹配与实例

字符模式	匹配模式内容	等价于
[a-d]	One character of: a, b, c, d	[abcd]
[^a-d]	One character except: a, b, c, d	[^abcd]
abc丨def	abc or def
\d	One digit	[0-9]
\D	One non-digit	[^0-9]
\s	One whitespace	[ \t\n\r\f\v]
\S	One non-whitespace	[^ \t\n\r\f\v]
\w	One word character	[a-zA-Z0-9_]
\W	One non-word character	[^a-zA-Z0-9_]
.	Any character (except newline)	[^\n]

固定点标记	匹配模式内容
^	Start of the string
$	End of the string
\b	Boundary between word and non-word characters

数量词	匹配模式内容
{5}	Match expression exactly 5 times
{2,5}	Match expression 2 to 5 times
{2,}	Match expression 2 or more times
{,5}	Match expression 0 to 5 times
*	Match expression 0 or more times
{,}	Match expression 0 or more times
?	Match expression 0 or 1 times
{0,1}	Match expression 0 or 1 times
+	Match expression 1 or more times
{1,}	Match expression 1 or more times

字符转义	转义匹配内容
\.	. character
\\|\ character
\*	* character
\+	+ character
\?	? character
\{	{ character
\)	) character
\[	[ character



In [9]:

    
m = re.findall(r'\d{3,4}-?\d{8}', '010-66677788,02166697788, 0451-22882828')
m









    Out[9]:





['010-66677788', '02166697788', '0451-22882828']

匹配电话号码，区号可以是3或者4位，号码为8位，中间可以有-或者没有。



In [10]:

    
m = re.findall(r'[\u4e00-\u9fa5]', '测试 汉 字，abc，测试xia，可以')
m









    Out[10]:





['测', '试', '汉', '字', '测', '试', '可', '以']

匹配汉字

几个实例

正则表达式	匹配内容
[A-Za-z0-9]	匹配英文和数字
[\u4E00-\u9FA5A-Za-z0-9_]	中文英文和数字及下划线
^[a-zA-Z][a-zA-Z0-9_]{4,15}$`	合法账号，长度在5-16个字符之间，只能用字母数字下划线，且第一个位置必须为字母

3. 进阶

3.1 python正则表达式几个函数

函数	功能	用法
re.search	Return a match object if pattern found in string	re.search(r'[pat]tern', 'string')
re.finditer	Return an iterable of match objects (one for each match)	re.finditer(r'[pat]tern', 'string')
re.findall	Return a list of all matched strings (different when capture groups)	re.findall(r'[pat]tern', 'string')
re.split	Split string by regex delimeter & return string list	re.split(r'[ -]', 'st-ri ng')
re.compile	Compile a regular expression pattern for later use	re.compile(r'[pat]tern')



In [11]:

    
m = re.search(r'\d{3,4}-?\d{8}', '010-66677788,02166697788, 0451-22882828')
m









    Out[11]:





<_sre.SRE_Match object; span=(0, 12), match='010-66677788'>



In [12]:

    
m.group()









    Out[12]:





'010-66677788'

利用group()函数，取出match对象中的内容



In [13]:

    
ms = re.finditer(r'\d{3,4}-?\d{8}', '010-66677788,02166697788, 0451-22882828')
for m in ms:
    print(m.group())









    



010-66677788
02166697788
0451-22882828



In [14]:

    
words = re.split(r'[,-]', '010-66677788,02166697788,0451-22882828')
words









    Out[14]:





['010', '66677788', '02166697788', '0451', '22882828']



In [15]:

    
p = re.compile(r'[,-]')
p.split('010-66677788,02166697788,0451-22882828')









    Out[15]:





['010', '66677788', '02166697788', '0451', '22882828']

利用compile()函数将正则表达式编译，如以后多次运行，可加快程序运行速度

3.2 分组与引用

Group Type	Expression
Capturing	( ... )
Non-capturing	(?: ... )
Capturing group named Y	(?P<Y> ... )
Match the Y'th captured group	\Y
Match the named group Y	(?P=Y)

(...) 将括号中的部分，放在一起，视为一组，即group。以该group来匹配符合条件的字符串。
group，可被同一正则表达式的后续，所引用，引用可以利用其位置，或者利用其名称，可称为反向引用。



In [16]:

    
p = re.compile('(ab)+')
p.search('ababababab').group()









    Out[16]:





'ababababab'



In [17]:

    
p.search('ababababab').groups()









    Out[17]:





('ab',)

有分组的情况，用groups()函数取出匹配的所有分组



In [18]:

    
p=re.compile('(\d)-(\d)-(\d)')
p.search('1-2-3').group()









    Out[18]:





'1-2-3'



In [19]:

    
p.search('1-2-3').groups()









    Out[19]:





('1', '2', '3')



In [20]:

    
s = '喜欢/v 你/x 的/u 眼睛/n 和/u 深情/n 。/w'
p = re.compile(r'(\S+)/n')
m = p.findall(s)
m









    Out[20]:





['眼睛', '深情']

按出现顺序捕获名词(/n)。



In [21]:

    
p=re.compile('(?P<first>\d)-(\d)-(\d)')
p.search('1-2-3').group()









    Out[21]:





'1-2-3'

在分组内，可通过?P<name>的形式，给该分组命名，其中name是给该分组的命名



In [22]:

    
p.search('1-2-3').group('first')









    Out[22]:





'1'

可利用group('name')，直接通过组名来获取匹配的该分组



In [23]:

    
s = 'age:13,name:Tom;age:18,name:John'
p = re.compile(r'age:(\d+),name:(\w+)')
m = p.findall(s)
m









    Out[23]:





[('13', 'Tom'), ('18', 'John')]



In [24]:

    
p = re.compile(r'age:(?:\d+),name:(\w+)')
m = p.findall(s)
m









    Out[24]:





['Tom', 'John']

(?:\d+)，匹配该模式，但不捕获该分组。因此没有捕获该分组的数字



In [25]:

    
s = 'abcdebbcde'
p = re.compile(r'([ab])\1')
m = p.search(s)
print('The match is {},the capture group is {}'.format(m.group(), m.groups()))









    



The match is bb,the capture group is ('b',)

此即为反向引用
当分组([ab])内的a或b匹配成功后，将开始匹配\1，\1将匹配前面分组成功的字符。因此该正则表达式将匹配aa或bb。
类似地，r'([a-z])\1{3}'，该正则将匹配连续的4个英文小写字母。



In [26]:

    
s = '12,56,89,123,56,98, 12'
p = re.compile(r'\b(\d+)\b.*\b\1\b')
m = p.search(s)
m.group(1)









    Out[26]:





'12'

利用反向引用来判断是否含有重复数字，可提取第一个重复的数字。
其中\1是引用前一个分组的匹配。



In [27]:

    
s = '12,56,89,123,56,98, 12'
p = re.compile(r'\b(?P<name>\d+)\b.*\b(?P=name)\b')
m = p.search(s)
m.group(1)









    Out[27]:





'12'

与前一个类似，但是利用了带分组名称的反向引用。

3.3 贪婪与懒惰

数量词	匹配模式内容
{2,5}?	Match 2 to 5 times (less preferred)
{2,}?	Match 2 or more times (less preferred)
{,5}?	Match 0 to 5 times (less preferred)
*?	Match 0 or more times (less preferred)
{,}?	Match 0 or more times (less preferred)
??	Match 0 or 1 times (less preferred)
{0,1}?	Match 0 or 1 times (less preferred)
+?	Match 1 or more times (less preferred)
{1,}?	Match 1 or more times (less preferred)

当正则表达式中包含能接受重复的限定符时，通常的行为是（在使整个表达式能得到匹配的前提下）匹配尽可能多的字符。
而懒惰匹配，是匹配尽可能少的字符。方法是在重复的后面加一个?。



In [28]:

    
p = re.compile('(ab)+')
p.search('ababababab').group()









    Out[28]:





'ababababab'



In [29]:

    
p = re.compile('(ab)+?')
p.search('ababababab').group()









    Out[29]:





'ab'

进一步学习可参考官方文档以及《精通正则表达式(第3版)》