解析逻辑表达式的正则表达式
我正在尝试使用正则表达式来解析带括号的逻辑表达式
例如:
((weight gt 10) OR (weight lt 100)) AND (length lt 50)
((weight gt 10) OR (weight lt 100)) AND (length lt 50)
我希望它可以解析为:
Group 1: (weight gt 10) OR (weight lt 100)
Group 2: AND
Group 3: length lt 50
如果这个顺序改变:
(length lt 50) AND ((weight gt 10) OR (weight lt 100))
我希望它可以解析为:
Group 1: length lt 50
Group 2: AND
Group 3: (weight gt 10) OR (weight lt 100)
我试过的成本最高的是这个表达式:
(((?>[^()]+|(?1))*))
问题在于它仅部分起作用:
((weight gt 10) OR (weight lt 100)) AND (length lt 50)
Group 1: ((weight gt 10) OR (weight lt 100))
Group 2: (length lt 50)
(length lt 50) AND ((weight gt 10) OR (weight lt 100))
Group 1: (length lt 50)
Group 2: ((weight gt 10) OR (weight lt 100))
逻辑运算符不是作为一个组选择的。
如何解决此问题以捕获逻辑运算符 AND?
回答
使用您显示的示例,请尝试以下正则表达式,用 Python3.8 测试和编写
^(?:(((weight.*?))|((length[^)]*)))s+(AND)s+(?:(((weight.*?).*?))|((length[^)]*)))$
或通用解决方案:
^(?:(((.*?))|((w+[^)]*)))s+(S+)s+(?:(((w+.*?).*?))|((w+[^)]*)))$
这是python3的完整代码:以下结果是特定于示例的正则表达式,只是将正则表达式更改为泛型(如上所示),它也适用于泛型值。
import re
##Scenario 1st here...
var="""((weight gt 10) OR (weight lt 100)) AND (length lt 50)"""
li = re.findall(r'^(?:(((weight.*?))|((length[^)]*)))s+(AND)s+(?:(((weight.*?).*?))|((length[^)]*)))',var)
[('(weight gt 10) OR (weight lt 100)', '', 'AND', '', 'length lt 50')]
##Scenario 2nd here.
var="""(length lt 50) AND ((weight gt 10) OR (weight lt 100))
li = re.findall(r'^(?:(((weight.*?))|((length[^)]*)))s+(AND)s+(?:(((weight.*?).*?))|((length[^)]*)))',var)
[('', 'length lt 50', 'AND', '(weight gt 10) OR (weight lt 100)', '')]
##Remove null elements in 1st scenario's find command here.
[string for string in li[0] if string != ""]
['(weight gt 10) OR (weight lt 100)', 'AND', 'length lt 50']
##Remove null elements came in 2nd scenario's find command here.
[string for string in li[0] if string != ""]
['length lt 50', 'AND', '(weight gt 10) OR (weight lt 100)']
说明:为上述正则表达式添加详细说明。
^ ##Checking from starting of value.
(?: ##Creating a non-capturing group here.
( ##Matching literal here.
((weight.*?))|((length[^)]*)) ##Creating 1st capturing group to match weight till ) OR length before ) as per need.
) ##Closing 1st non-capturing group here.
s+ ##Matching 1 or more occurrences of spaces here.
(AND) ##Matching AND and keeping it in 2nd capturing group here.
s+ ##Matching 1 or more occurrences of spaces here.
(?: ##Creating 2nd capturing group here.
( ##Matching literal here.
((weight.*?).*?))|((length[^)]*)) ##Creating 3rd capturing group here which is matching either weight till ) 2nd occurrence OR length just before ) as per need.
)$ ##Closing 2nd non-capturing group at end of value here.
回答
你快到了。唯一缺少的一点是逻辑表达式没有包含在括号中。在AND与OR要捕获。您的正则表达式要求所有内容都位于括号中间。
此外,您所说的组似乎实际上是匹配项,其中
匹配两次:
- 第一场比赛是
((weight gt 10) OR (weight lt 100)) - 第二场比赛是
(length lt 50)
您的表达式中只有两个组并且它们是相同的,因为 group1 (g1),最外面的括号,实际上是整个表达式 (g0)。
由于您的表达式匹配任何包含的逻辑,我只是对其进行了扩展,添加了一个封闭的可选非捕获组,该组由您提供的捕获组组成:
(?:([^()]+)((?1)))?
结合起来就变成
(((?>[^()]+|(?1))*))(?:([^()]+)((?1)))?
^----------- g1 -----^ ^-g2--^^-g3-^
在(?1)仍引用第1组为在原始的表达。以下所有是比赛及其各自的组:
(weight gt 10)
^--- g1 -----^
(weight gt 10) OR (weight lt 100)
^--- g1 -----^ g2 ^--- g3 ------^
((weight gt 10) OR (weight lt 100)) AND (length lt 50)
^-------------- g1 ---------------^ g2 ^--- g3 -----^
(length lt 50) AND ((weight gt 10) OR (weight lt 100))
^--- g1 -----^ g2 ^------------- g3 ----------------^
(length lt 50) nonsense ((weight gt 10) OR (weight lt 100))
^--- g1 -----^ g2 ^-------------- g3 ---------------^
字符玻璃只排除括号,因此匹配任何废话。
你的表情崩溃了:
( # capturing group 1
( # match a `(` literally
(?> # atomic/independent, non-capturing group (meaning no backtracking into the group)
[^()] # any character that is not `(` nor `)`
+ # one or more times
| # or
(?1) # recurse group 1.
# ..this is like a copy of the expression of group 1 here.
# ..which also includes this part.
# ..so it's sort of self-recursing
)* # zero or more times
) # match a `)` literally
)
添加分解:
(?: # non-capturing group
( # capturing group 2
[^()] # any character that is not `(` nor `)`
+ # one or more times
)
( # capturing group 3
(?1) # recurse group 1.
)
)? # zero or one time
regex101处的表达式。在这里,我更改了字符类[^()n]以避免换行问题。