如何构建一个简单的分词器

html5 • 2022年10月31日 pm9:26 • 问答

我想知道如何构建一个非常简单的标记器。给定字典 d（在本例中为列表）和句子 s，我想返回该句子的所有可能标记（=单词）。这是我尝试过的：

l = ["the","snow","ball","snowball","is","cold"]
sentence = "thesnowballisverycold"

def subs(string, ret=['']):
    if len(string) == 0:
        return ret
    head, tail = string[0], string[1:]
    ret = ret + list(map(lambda x: x+head, ret))
    return subs(tail, ret)
    
print((list(set(subs(sentence))&set(l))))

但这会返回：

["snow","ball","cold","is","snowball","the"]

我可以比较子字符串，但必须有更好的方法来做到这一点，对吗？我想要的是：

["the","snowball","is","cold"]

回答

您可以在此处使用正则表达式：

import re
l = ["the","snow","ball","snowball","is","cold"]
pattern = "|".join(sorted(l, key=len, reverse=True))
sentence = "thesnowballisverycold"
print( re.findall(pattern, sentence) )
# => ['the', 'snowball', 'is', 'cold']

请参阅Python 演示。

以上是如何构建一个简单的分词器的全部内容。

THE END

二维码

为什么每次输入git命令时zsh都会杀死我的进程

< <上一篇

如何从javaspringboot中的请求头中获取承载令牌？

下一篇>>

搜索内容

如何构建一个简单的分词器

回答

目录

目录

推荐文章

最新文章