Python按层次结构后的多个分隔符拆分字符串

我只想根据多个分隔符(如“and”、“&”和“-”)按顺序拆分字符串一次。例子:

'121 34 adsfd' -> ['121 34 adsfd']
'dsfsd and adfd' -> ['dsfsd ', ' adfd']
'dsfsd & adfd' -> ['dsfsd ', ' adfd']
'dsfsd - adfd' -> ['dsfsd ', ' adfd']
'dsfsd and adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']
'dsfsd and adfd - adsfa' -> ['dsfsd ', ' adfd - adsfa']
'dsfsd - adfd and adsfa' -> ['dsfsd - adfd ', ' adsfa']
'121 34 adsfd' -> ['121 34 adsfd']
'dsfsd and adfd' -> ['dsfsd ', ' adfd']
'dsfsd & adfd' -> ['dsfsd ', ' adfd']
'dsfsd - adfd' -> ['dsfsd ', ' adfd']
'dsfsd and adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']
'dsfsd and adfd - adsfa' -> ['dsfsd ', ' adfd - adsfa']
'dsfsd - adfd and adsfa' -> ['dsfsd - adfd ', ' adsfa']

我尝试了下面的代码来实现这一点:

import re
re.split('and|&|-', string, maxsplit=1)

它适用于除最后一种情况之外的所有情况。由于它不遵循层次结构,因此它返回最后一个:

'dsfsd - adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']

我怎样才能做到这一点?

回答

这对于单个正则表达式是不切实际的。你可以让它与负后视一起工作,但每个额外的分隔符都会变得非常复杂。使用简单的旧str.split()行和多行来做到这一点非常简单。您所要做的就是检查使用当前分隔符进行拆分是否会为您提供两个元素。如果是,那就是你的答案。如果没有,请转到下一个分隔符:

def split_new(inp, delims):
    for d in delims:
        result = inp.split(d, maxsplit=1)
        if len(result) == 2: return result

    return [inp] # If nothing worked, return the input

要测试这个:

teststrs = ['121 34 adsfd' , 'dsfsd and adfd', 'dsfsd & adfd' , 'dsfsd - adfd' , 'dsfsd and adfd and adsfa' , 'dsfsd and adfd - adsfa' , 'dsfsd - adfd and adsfa' ]
for t in teststrs:
    print(repr(t), '->', split_new(t, ['and', '&', '-']))

产出

  • Simple, readable and it's easy to add more delimiters.
  • This. so much better than the regex in the accepted answer, which will make you hate yourself if you'll have to modify it one year from now.

回答

尝试:

import re

tests = [
    ["121 34 adsfd", ["121 34 adsfd"]],
    ["dsfsd and adfd", ["dsfsd ", " adfd"]],
    ["dsfsd & adfd", ["dsfsd ", " adfd"]],
    ["dsfsd - adfd", ["dsfsd ", " adfd"]],
    ["dsfsd and adfd and adsfa", ["dsfsd ", " adfd and adsfa"]],
    ["dsfsd and adfd - adsfa", ["dsfsd ", " adfd - adsfa"]],
    ["dsfsd - adfd and adsfa", ["dsfsd - adfd ", " adsfa"]],
]

for s, result in tests:
    res = re.split(r"and|&(?!.*and)|-(?!.*and|.*&)", s, maxsplit=1)
    print(res)
    assert res == result

印刷:

['121 34 adsfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd and adsfa']
['dsfsd ', ' adfd - adsfa']
['dsfsd - adfd ', ' adsfa']

解释:

正则表达式and|&(?!.*and)|-(?!.*and|.*&)使用 3 种替代方法。

  1. 我们and总是匹配或:
  2. 我们&仅在没有and前进时才匹配(使用否定前瞻(?! )或:
  3. 我们-仅在没有and&领先时匹配。

我们在re.sub-> 仅在第一场比赛中使用此模式。

  • regex used in a loop should be compiled before the loop. The total time will be reduced by about 25%.

以上是Python按层次结构后的多个分隔符拆分字符串的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>