令人惊讶但正确的贪婪子表达式在积极的后视断言中的行为
注意:
-
观察到的行为是正确的,但起初可能令人惊讶;对我来说是这样,我认为对其他人也可能是这样——尽管对那些非常熟悉正则表达式引擎的人可能不是这样。
-
重复建议的重复项Regex lookahead、lookbehind 和 atomic groups包含有关环视断言的一般信息,但没有解决手头的具体误解,如下面的评论中更详细地讨论。
使用greedy,根据定义,在肯定的后视断言中的可变宽度子表达式可以表现出令人惊讶的行为。
为了方便起见,这些示例使用 PowerShell,但该行为通常适用于 .NET 正则表达式引擎:
这个命令按我直觉的预期工作:
# OK:
# The subexpression matches greedily from the start up to and
# including the last "_", and, by including the matched string ($&)
# in the replacement string, effectively inserts "|" there - and only there.
PS> 'a_b_c' -replace '^.+_', '$&|'
a_b_|c
下面的命令,该命令使用正向后看断言,(?<=...)是看似等价-但不是:
# CORRECT, but SURPRISING:
# Use a positive lookbehind assertion to *seemingly* match
# only up to and including the last "_", and insert a "|" there.
PS> 'a_b_c' -replace '(?<=^.+_)', '|'
a_|b_|c # !! *multiple* insertions were performed
为什么不等价?为什么执行多次插入?
回答
tl;博士:
- 里面一个向后看断言,一个贪婪的子表达式的效果表现不贪婪(在全球的匹配除了贪婪的作用),由于考虑到每一个前缀字符串输入字符串的。
我的问题是我没有考虑到,在后视断言中, 输入字符串中的每个字符位置都必须检查之前的文本直到该点,以匹配后视断言中的子表达式。
这与 PowerShell 的-replace运算符执行的始终全局替换(即执行所有可能的匹配)相结合,导致多次插入:
也就是说,当考虑当前正在考虑的字符位置左侧的文本时,贪婪的、锚定的子表达式^.+_合法地匹配了两次:
- 首先,
a_左边的文字是什么时候。 - 再一次
a_b_,左边的文字是什么时候。
因此,产生了两次插入|。
相比之下,如果没有后视断言,贪婪表达式 ^.+_根据定义只匹配一次,直到最后一次 _,因为它只应用于整个输入字符串。
-
In short: The behavior _isn't_ described in the links you've posted - not in any meaningful way that would clear up the specific misconception at hand.
This answer now does describes it, and it will hopefully clear up the misconception for others too. - @WiktorStribiżew re your links: the FAQ (undoubtedly a good general resource) merely links to the first linked answer, which is about _non-support for variable-width patterns_ in lookbehind assertions in Python. By contrast, _this_ question is precisely about _variable-width_ patterns, and it is precisely that variable-width nature that gave rise to the misconception exhibited in my question. The only reference in the FAQ to variable-width lookbehinds is [this answer](https://stackoverflow.com/a/20994257/45375), which merely states that .NET does support them in general.
- This is a [known lookaround behavior](https://stackoverflow.com/questions/11197608/fixed-length-regex-required/11197672#11197672). No need to repeat it. This is also part of the [Regex FAQ](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean).