[Python知识库] python正则表达式

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> python正则表达式 -> 正文阅读

[Python知识库]python正则表达式

什么是正则表达式

正则表达式是一个特殊的字符序列，代表了我们所设定的字符串组成规则，通过制定这些规则，我们可以轻松实现对复杂字符串的检索和替换。

如何使用正则表达式

在python中，存在一个re模块，此模块实现了正则表达式的全部功能。通过引入此模块，我们可以完成对字符串的各种处理。

re模块介绍

通过import引入此模块，使用dir()函数查看re模块的所有方法，可以看到：

import re
dir(re)
>>> ['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 
     'MULTILINE', 'Match', 'Pattern', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 
     'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', 
     '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', 
     '__spec__', '__version__', '_cache', '_compile', '_compile_repl', '_expand', 
     '_locale', '_pickle', '_special_chars_map', '_subx', 'compile', 'copyreg',
     'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 
     'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub',
     'subn', 'template']

可以看到方法还是比较多的，这里我们只介绍一些常用的（没有介绍到的可以直接进入模块内部查看各个方法的使用）

常用方法：

方法	描述
re.match(正则, 字符串, [修饰符])	从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。
re.search(正则, 字符串, [修饰符])	扫描整个字符串并返回第一个成功的匹配
re.finditer(正则, 字符串, [修饰符])	在字符串中找到正则表达式所匹配的所有子串，并把它们作为一个迭代器返回
re.findall(正则, 字符串, [修饰符])	在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果没有找到匹配的，则返回空列表。
re.fullmatch(正则, 字符串, [修饰符])	完整匹配，字符串需要完全满足正则规则才有结果，否则就返回None
re.compile(正则, [修饰符])	用于编译正则表达式，生成一个正则表达式对象，供 match() 和 search() 这两个函数使用。

import re
str = "helloworldhelloworld"

"""
re.match(pattern, string, flags=0)
特点：1.返回的是一个re.Match对象；
     2.只查询一次，返回一个符合项（此符合项跟查询顺序有关：从左到右）;
     3.只能查询符合项位于起始位置的字符串，否则就返回None;
"""
m1 = re.match(r'h',str)
#返回的是一个re.Match对象,且只返回一个符合项，是从左到右的第一个符合项
print(m1)  >>>  <re.Match object; span=(0, 1), match='h'>
m2 = re.match(r'e',str)
#当符合项不是字符串起始位置时，返回None，match只能匹配那些起始位置就是符合项的字符串
print(m2)  >>>  None


"""
re.search(pattern, string, flags=0)
特点：1.返回的是一个re.Match对象；
     2.只查询一次，返回一个符合项（此符合项跟查询顺序有关：从左到右）;
     3.不受符合项位置的制约，查询范围为整个字符串;
"""
m3 = re.search(r'e',str)
#返回的是一个re.Match对象，且只返回一个符合项，是从左到右的第一个符合项，不受符合项位置的制约,符合项可以位于字符串中的任何位置
print(m3)  >>>  <re.Match object; span=(1, 2), match='e'>


"""
re.finditer(pattern, string, flags=0)
特点：1.返回的是一个迭代器；
     2.里面包含字符串中所有的符合项；
"""
m4 = re.finditer(r'l',str)
print(m4)  >>>  <callable_iterator object at 0x000002444B79EBE0>
for i in m4:
    print(i)
#可以看到里面包含字符串中的所有符合项    
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(8, 9), match='l'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='l'>
<re.Match object; span=(18, 19), match='l'>


"""
re.findall(pattern, string, flags=0)
特点：1.返回的是一个列表；
     2.里面包含字符串中所有的符合项；
"""
m5 = re.findall(r'l',str)
#可以看到返回的是一个列表，且里面包含字符串中所有的符合项
print(m5)  >>>  ['l', 'l', 'l', 'l', 'l', 'l']


"""
re.fullmatch(pattern, string, flags=0)
特点：1.字符串和正则规则必须完全匹配；
     2.返回的是一个完整字符串；
"""
m6 = re.fullmatch(r'h.*d',str)
m7 = re.fullmatch(r'e.*d',str)
#可以看到只有当字符串和正则规则完整匹配时，才返回结果，否则返回None
print(m6)  >>>  <re.Match object; span=(0, 20), match='helloworldhelloworld'>
print(m7)  >>>  None


"""
re.compile(pattern[, flags])
特点：1.编译正则表达式，单独生成一个正则表达式（ Pattern ）对象；
     2.当重复使用此规则时，方便调取；
"""
r = re.compile(r'e.*d')
m = r.search(str)
print(m)  >>>  <re.Match object; span=(1, 20), match='elloworldhelloworld'>

re.Match对象

我们上面运行match,search，fullmatch或者遍历finditer方法时，所获取到的是一个re.Match object对象，要知道我们最终要获取的是字符串中的内容，那么我们应该如何从这个对象中将内容提取出来呢？要知道，re.match object作为一个对象，其自身必定是有属性和方法的，我们可以使用dir()函数查看一下：

import re
m = re.match(r'h','hello')
print(m)  >>>  <re.Match object; span=(0, 1), match='h'>
print(dir(m))
['忽略掉魔法方法'，'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']
#start()和end()方法返回符合项的开始和结束位置
print(m.start(),m.end())  >>>  0 1
#pos和endpos属性返回的是查找范围的起始位置和结束位置
print(m.pos,m.endpos)     >>>  0 5
#span()方法以元组的形式返回匹配到的字符串的开始和结束位置(默认返回group(0)的起始和结束位置)
print(m.span())   >>>  (0,1)
#string返回要用正则进行匹配的字符串
print(m.string)  >>>  hello
#group()方法获取匹配到的字符串
print(m.group())  >>>  h
"""
特别强调group()的用法：
1.group()方法是可以加入参数的,一般情况下，默认参数为0，整个正则作为一个整体匹配；
2.分组的正则表达式，必须可以作为一个整体进行匹配，就是中间不能间断，否则就会报错；
                  
m1 = re.search(r'(2.*8)(h.*s)','5s2fg8h4aas3g')
在正则表达式中，可以进行分组，每个小括号代表一组，可以对匹配到的内容进行分批获取
其中第0组就是把整个正则表达式作为一个整体来匹配
print(m1.group(0))  >>>  2fg8h4aas
第一组展示第一个括号内正则表达式所匹配到的内容
print(m1.group(1))  >>>  2fg8
第二组展示第二个括号内正则表达式所匹配到的内容
print(m1.group(2))  >>>  h4aas

我们将之前的正则表达式修改一下，将第一组的结束字符改为g，这样第一组正则与第二组正则所匹配的字符串之间会夹杂着一个8没被匹配，整个正则表达式变得不连续
m2 = re.search(r'(2.*g)(h.*s)','5s2fg8h4aas3g')
print(m.group(0),m.group(1),m.group(2))
>>>  AttributeError: 'NoneType' object has no attribute 'group'
"""
#groups()方法将各个组匹配到的字符串放入一个元组中
print(m1.groups())  >>>  ('2fg8', 'h4aas')
#groupdict()方法将各个组所匹配到的内容放入字典中，但对正则表达式的格式有要求，需以(?P<name>正则表达式)方式进行匹配
m2 = re.search(r'(?P<group1>2.*8)(?P<group2>h.*s)','5s2fg8h4aas3g')
print(m2.groupdict())  >>>  {'group1': '2fg8', 'group2': 'h4aas'}
#可以以字典的形式获取各个组所匹配的内容
print(m2.groupdict()['group1'])  >>>  2fg8
print(m2.groupdict()['group2'])  >>>  h4aas
#lastindex返回最后匹配的组索引
print(m2.lastindex)  >>>  2
#lastgroup返回最后匹配的组名
print(m2.lastgroup)  >>>  group2
#re返回当前使用的正则表达式的对象
print(m2.re)  >>>  re.compile('(?P<group1>2.*8)(?P<group2>h.*s)')
#regs返回由各个组的span()所构成的元组
print(m2.regs)  >>>  ((2, 11), (2, 6), (6, 11))

修饰符

修饰符在正则表达式中作为可选项，用来控制匹配的模式，例如：忽略大小写，使正则表达式中的“.”符号包含换行等等，可以理解为修饰符是对正则表达式的补充

修饰符	作用
re.I	使匹配对大小写不敏感
re.S	使 . 匹配包括换行在内的所有字符（即使有换行，字符串还是作为一个整体来匹配正则一次）
re.M	多行匹配，影响 ^ 和 $（如果有换行，字符串的每行都要与正则表达式进行一次匹配）

正则表达式模式

对于正则表达式模式，我们可以进入re模块中进行详细的了解。在re模块中，官方是这样描述的：

r"""Support for regular expressions (RE).

This module provides regular expression matching operations similar to
those found in Perl.  It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.

这个模块提供类似于在Perl中找到的正则表达式匹配操作。它既支持8位又支持Unicode字符串；
正在处理的模式（正则表达式模式）和字符串都可以包含空字节和超出US ASCII范围的字符。
说明：re模块使用的匹配规则可以是正则表达式，也可以是字符串
     m = re.search(r'\d{2}.*','he52qwq')
     m = re.search(r'52qwq','he52qwq')
     print(m.group())   >>>   52qwq

Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves.  You can
concatenate ordinary characters, so last matches the string 'last'.

正则表达式可以包含特殊字符和普通字符。大多数普通字符，如“A”、“A”或“0”，都是最简单的正则表达式；
他们只是匹配自己。你可以连接普通字符，使last与字符串“last”匹配。
说明：单个字符只能匹配一个字符，也就是它们本身；
     如果将多个字符连接起来组成字符串，就可以用来匹配字符串

The special characters are:
下面是一些特殊的字符：
    "."      Matches any character except a newline.
             (匹配任意字符，除了换行)
    
    "^"      Matches the start of the string.
             (匹配字符串的开始位)
    
    "$"      Matches the end of the string or just before the newline at the end of the string.
             (匹配字符串的结束位)
             
    "*"      Matches 0 or more (greedy) repetitions of the preceding RE. Greedy means that it will match as many repetitions as possible.
             (匹配0个或多个[贪婪]正则表达式，贪婪意味着尽可能多的匹配内容)
             
    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
             (匹配1个或多个[贪婪]正则表达式)
             
    "?"      Matches 0 or 1 (greedy) of the preceding RE.
             (匹配0个或1个[贪婪]正则表达式)
             
    *?,+?,?? Non-greedy versions of the previous three special characters.
             (前三个特殊字符[*,+,?]的非贪婪版本)
             
    {m,n}    Matches from m to n repetitions of the preceding RE.
             (匹配m到n次正则表达式)
             
    {m,n}?   Non-greedy version of the above.
             (以上的非贪婪版本)
             
    "\\"     Either escapes special characters or signals a special sequence.
             (要么转义特殊字符，要么发出特殊序列的信号)
             
    []       Indicates a set of characters. A "^" as the first character indicates a complementing set.
             (表示一组字符。开头第一个字符如果是“^”，表示取其补集)
             
    "|"      A|B, creates an RE that will match either A or B.
             A|B,创建一个正则表达式(要么匹配A，要么匹配B)
             
    (...)    Matches the RE inside the parentheses. The contents can be retrieved or matched later in the string.
             (匹配括号内的正则表达式。之后可以在字符串中[通过组group]检索或匹配内容)
             
    (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
             (为正则表达式设置A、I、L、M、S、U或X标志[修饰符])[见下文]
              
    (?:...)  Non-grouping version of regular parentheses.
             (普通括号的非分组版本)
             
    (?P<name>...) The substring matched by the group is accessible by name.
             (组匹配的子字符串可以按名称访问)
              
    (?P=name) Matches the text matched earlier by the group named name.
             (匹配名为name的组前面匹配的文本)
             
    (?#...)  A comment; ignored.
             (一个注释,忽略)
             
    (?=...)  Matches if ... matches next, but doesn't consume the string.
             (如果...匹配成功,紧接着匹配下一个，但是不使用字符串)
             
    (?!...)  Matches if ... doesn't match next.
             (如果...匹配成功,就不再继续匹配[下一个])
             
    (?<=...) Matches if preceded by ... (must be fixed length).
             (如果在...之前有内容[必须是固定长度的内容]则匹配)
             
    (?<!...) Matches if not preceded by ... (must be fixed length).
             (如果在...之前没有内容[必须是固定长度的内容]则匹配)
             
    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,the (optional) no pattern otherwise.
             (如果带有id/名称的组可以匹配到内容，则匹配yes模式，否则匹配no模式[可选])

The special sequences consist of "\\" and a character from the list below.  If
the ordinary character is not on the list, then the resulting RE will match the
second character.
特殊序列由“\\”和下表中的一个字符组成。如果普通字符不在列表中，那么最终正则表达式将匹配第二个字符。
[\a不在列表中，这个小写的‘a’字符将被视为普通字符进行匹配]

    \number  Matches the contents of the group of the same number.
             (匹配相同号码组的内容)
    \A       Matches only at the start of the string.
             (只匹配字符串的开始)
    \Z       Matches only at the end of the string.
             (只匹配字符串的结束)
    \b       Matches the empty string, but only at the start or end of a word.
             (匹配空字符串，但仅在单词的开头或结尾)
    \B       Matches the empty string, but not at the start or end of a word.
             (匹配空字符串，但不在单词的开头或结尾)
    \d       Matches any decimal digit; equivalent to the set [0-9] in bytes patterns or string patterns with the ASCII flag.In string patterns without the ASCII flag, it will match the whole range of Unicode digits.
             (匹配任何十进制数字；相当于在字节模式或带ASCII标志的字符串模式中匹配[0-9]。在不带ASCII标志的字符串模式中，它将匹配整个Unicode数字范围。)
    \D       Matches any non-digit character; equivalent to [^\d].
             (匹配任何非数字字符；相当于[^\d])
    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in bytes patterns or string patterns with the ASCII flag.In string patterns without the ASCII flag, it will match the whole range of Unicode whitespace characters.
             (匹配任何空白字符；相当于在字节模式或带ASCII标志的字符串模式中匹配[\t\n\r\f\v]。在不带ASCII标志的字符串模式中，它将匹配整个Unicode空白字符范围。)
    \S       Matches any non-whitespace character; equivalent to [^\s].
             (匹配任何非空白字符；相当于[^\s])
    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_] in bytes patterns or string patterns with the ASCII flag.In string patterns without the ASCII flag, it will match the range of Unicode alphanumeric characters (letters plus digits plus underscore).With LOCALE, it will match the set [0-9_] plus characters defined as letters for the current locale.
             (匹配任何字母数字字符；在字节模式或带有ASCII标志的字符串模式中，它相当于[a-zA-Z0-9_]。在没有ASCII标志的字符串模式中，它将匹配Unicode字母数字字符（字母加数字加下划线）的范围。在区域设置中，它将匹配集合[0-9_]加上定义为当前区域设置字母的字符)
    \W       Matches the complement of \w.
             (匹配\w的补集)
    \\       Matches a literal backslash.
             (匹配一个反斜杠)
This module exports the following functions:
此模块定义以下函数：
    match     Match a regular expression pattern to the beginning of a string.
              (将正则表达式模式与字符串开头匹配)
    fullmatch Match a regular expression pattern to all of a string.
              (将正则表达式模式与所有字符串匹配)
    search    Search a string for the presence of a pattern.
              (在字符串中搜索是否存在符合正则表达式模式的内容)
    sub       Substitute occurrences of a pattern found in a string.
              (替换字符串中出现的正则表达式模式)
    subn      Same as sub, but also return the number of substitutions made.
              (与sub相同，同时返回所做替换的数量)
    split     Split a string by the occurrences of a pattern.
              (基于正则表达式模式分割一个字符串)
    findall   Find all occurrences of a pattern in a string.
              (找出字符串中所有正则表达式模式的出现)
    finditer  Return an iterator yielding a Match object for each match.
              (返回一个迭代器，为每个匹配生成一个Match对象)
    compile   Compile a pattern into a Pattern object.
              (将模式编译为模式对象)
    purge     Clear the regular expression cache.
              (清除正则表达式缓存)
    escape    Backslash all non-alphanumerics in a string.
              (反斜杠字符串中的所有非字母数字)

Some of the functions in this module takes flags as optional parameters:
此模块中的某些函数将标志作为可选参数：
    A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D match the corresponding ASCII character categories(rather than the whole Unicode categories, which is the default).For bytes patterns, this flag is the only available behaviour and needn't be specified.
                   (对于字符串模式，使\w、\W、\b、\B、\d、\D与相应的ASCII字符类别匹配（而不是整个Unicode类别，这是默认值）。对于字节模式，此标志是唯一可用的行为，无需指定)
    I  IGNORECASE  Perform case-insensitive matching.
                   (执行不区分大小写的匹配)
    L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
                   (使\w、\W、\b、\B依赖于当前本地化设置)
    M  MULTILINE   "^" matches the beginning of lines (after a newline)as well as the string."$" matches the end of lines (before a newline) as well as the end of the string.
                   (“^”匹配字符串行的开头[换行后，再进行一次匹配]。“$”匹配字符串行的结尾[换行后，再进行一次匹配])
    S  DOTALL      "." matches any character at all, including the newline.
                   ("."匹配任意字符，包括换行)
    X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
                   (忽略空白和注释以获得更漂亮的RE)
    U  UNICODE     For compatibility only. Ignored for string patterns (it is the default), and forbidden for bytes patterns.
                   (仅用于兼容性。忽略字符串模式（这是默认值），禁止字节模式)
"""

正则替换

在re模块中，存在一个sub函数，用来将字符串中符合正则表达式模式的内容替换掉

#在re模块中，sub函数的定义如下：
def sub(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).sub(repl, string, count)
'''
从函数定义可以看出sub至少需要传入3个参数:pattern(正则)、repl(替换内容)、string(字符串)
其返回内容是一个字符串
'''

import re
m = re.sub(r'\d+','A','df456ddf23')
print(m)   >>>  dfAddfA

#在sub()的参数传递中，repl是比较特殊的，它也可以传递函数(函数中定义了对字符串中符合正则规则的内容的处理逻辑)
def pop(x):
    return str(int(x.group())**2)
#上面对数据类型进行了一个转换，正则匹配的内容是字符串型，需要转换为整数型才能进行运算，
#但最终要求返回的内容是字符串型，需要再进行一次转换(因为你要拿返回的字符串替换掉原来的字符串，而不能那数字替换)
m = re.sub(r'\d+',pop,'df456ddf23')
print(m)  >>>  df207936ddf529