Raku中的否定命名正则表达式或字符类插值
我正在尝试解析带引号的字符串。像这样的东西:
say '"in quotes"' ~~ / '"' <-[ " ]> * '"'/;
(来自https://docs.raku.org/language/regexes “枚举字符类和范围”)但是......我想要更多一种类型的引用。这样的东西组成了不起作用的语法:
token attribute_value { <quote> ($<-quote>) $<quote> };
token quote { <["']> };
我发现这个讨论是另一种方法,但它似乎没有去任何地方:https : //github.com/Raku/problem-solving/issues/97。有没有办法做这种事情?谢谢!
更新 1
我无法让@user0721090601 的“多令牌”解决方案起作用。我的第一次尝试产生了:
say '"in quotes"' ~~ / '"' <-[ " ]> * '"'/;
在做了一些研究之后,我补充说proto token quoted_string {*}:
#!/usr/bin/env raku
use Grammar::Tracer;
grammar QuotedString {
proto token quoted_string {*}
multi token quoted_string:sym<'> { <sym> ~ <sym> <-[']> }
multi token quoted_string:sym<"> { <sym> ~ <sym> <-["]> }
token quote { <["']> }
}
my $string = '"foo"';
my $quoted-string = QuotedString.parse($string, :rule<quoted_string>);
say $quoted-string;
token attribute_value { <quote> ($<-quote>) $<quote> };
token quote { <["']> };
我还在学习乐,所以我可能做错了什么。
更新 2
哦!感谢@raiph 指出这一点。我忘了在<-[']>和上加一个量词<-["]>。这就是我不假思索地复制/粘贴的结果!当你做对时,作品会找到:
#!/usr/bin/env raku
use Grammar::Tracer;
grammar QuotedString {
proto token quoted_string (|) {*}
multi token quoted_string:sym<'> { <sym> ~ <sym> <-[']>+ }
multi token quoted_string:sym<"> { <sym> ~ <sym> <-["]>+ }
token quote { <["']> }
}
my $string = '"foo"';
my $quoted-string = QuotedString.parse($string, :rule<quoted_string>);
say $quoted-string;
更新 3
只是为了向这个鞠躬……
#!/usr/bin/env raku
grammar NegativeLookahead {
token quoted_string { <quote> $<string>=([<!quote> .]+) $<quote> }
token quote { <["']> }
}
grammar MultiToken {
proto token quoted_string (|) {*}
multi token quoted_string:sym<'> { <sym> ~ <sym> $<string>=(<-[']>+) }
multi token quoted_string:sym<"> { <sym> ~ <sym> $<string>=(<-["]>+) }
}
use Bench;
my $string = "'foo'";
my $bench = Bench.new;
$bench.cmpthese(10000, {
negative-lookahead =>
sub { NegativeLookahead.parse($string, :rule<quoted_string>); },
multi-token =>
sub { MultiToken.parse($string, :rule<quoted_string>); },
});
$ ./multi-token.raku
No such method 'quoted_string' for invocant of type 'QuotedString'
in block <unit> at ./multi-token.raku line 16
我将使用“多令牌”解决方案。谢谢大家!
回答
There are a few different approaches that you can take — which one is best will probably depend on the rest of the structure you're employing.
But first an observation on your current solution and why opening it up to others won't work this way. Consider the string 'value". Should that parse? The structure you laid out actually would match it! That's because each <quote> token will match either a single or double quote.
Dealing with the inner
The simplest solution is to make your inner part a non-greedy wildcard:
<quote> (.*?) <quote>
This will stop the match as soon as you reach quote again. Also note the alternative syntax using a tilde that lets the two terminal bits be closer together:
<quote> ~ <quote> (.*?)
Your initial attempt wanted to use a sort of non-match. This does exist in the form of an assertion, <!quote> which will fail if a <quote> is found (which needn't be just a character, by any thing arbitrarily complex). It doesn't consume, though, so you need to provide that separately. For instance
[<!quote> .]*
Will check that something is NOT a quote, and then consume the next character.
Lastly, you could use either of the two approaches and use a <content> token that handles in the inside. This is actually a great approach if you intend to later do more complex things (e.g. escape characters).
Avoiding a mismatch
As I noted, your solution would parse mismatched quotes. So we need to have a way to ensure that the quote we are (not) matching is the same as the start one. One way to do this is using a multi token:
proto token attribute_value (|) { * }
multi token attribute_value:sym<'> { <sym> ~ <sym> <-[']> }
multi token attribute_value:sym<"> { <sym> ~ <sym> <-["]> }
(Using the actual token <sym> is not require, you could write it as { ' <-[']> '} if you wanted).
Another way you could do this is by passing a parameter (either literally, or via dynamic variables). For example, you could make write the attribute_value as
token attribute_value {
$<start-quote>=<quote> # your actual start quote
:my $*end-quote; # define the variable in the regex scope
{ $*end-quote = ... } # determine the requisite end quote (e.g. ” for “)
<attribute_value_contents> # handle actual content
$*end-quote # fancy end quote
}
token attribute_value_contents {
# We have access to $*end-quote here, so we can use
# either of the techniques we've described before
# (a) using a look ahead
[<!before $*end-quote> .]*
# (b) being lazy (the easier)
.*?
# (c) using another token (described below)
<attr_value_content_char>+
}
I mention the last one because you can even further delegate if you ultimately decide to allow for escape characters. For example, you could then do
proto token attr_value_content_char (|) { * }
multi token attr_value_content_char:sym<escaped> { $*end-quote }
multi token attr_value_content_char:sym<literal> { . <?{ $/ ne $*end-quote }> }
But if that's overkill for what you're doing, ah well 🙂
Anyways, there are probably other ways that didn't jump to my mind that others can think of, but that should hopefully put you on the right path. (also some of this code is untested, so there may be slight errors, apologies for that)
- `<?{ $/ ne $*end-quote }>` is better written as `<!after "$*end-quote">`