用于替换dyplr工作流程中字符串内括号内的非数字字符的正则表达式
我的问题在某种程度上与一个已经回答的问题需要使用 R 从字符串列中提取单个字符有关。
我尝试用我的知识解决这个问题,并且需要知道如何删除字符串中括号中的非数字字符:`
这是带有列的数据框x:
team linescore ondate x
1 NYM 010000000 2020-08-01 0, 1, 0, 0, 0, 0, 0, 0, 0
2 NYM (10)1140006x) 2020-08-02 (, 1, 0, ), 1, 1, 4, 0, 0, 0, 6, x, )
3 BOS 002200010 2020-08-13 0, 0, 2, 2, 0, 0, 0, 1, 0
4 NYM 00000(11)01x 2020-08-15 0, 0, 0, 0, 0, (, 1, 1, ), 0, 1, x
5 BOS 311200 2020-08-20 3, 1, 1, 2, 0, 0
structure(list(team = c("NYM", "NYM", "BOS", "NYM", "BOS"), linescore = c("010000000",
"(10)1140006x)", "002200010", "00000(11)01x", "311200"), ondate = structure(c(18475,
18476, 18487, 18489, 18494), class = "Date"), x = list(c("0",
"1", "0", "0", "0", "0", "0", "0", "0"), c("(", "1", "0", ")",
"1", "1", "4", "0", "0", "0", "6", "x", ")"), c("0", "0", "2",
"2", "0", "0", "0", "1", "0"), c("0", "0", "0", "0", "0", "(",
"1", "1", ")", "0", "1", "x"), c("3", "1", "1", "2", "0", "0"
))), class = "data.frame", row.names = c(NA, -5L))
期望输出:
team linescore ondate x
1 NYM 010000000 2020-08-01 0, 1, 0, 0, 0, 0, 0, 0, 0
2 NYM (10)1140006x) 2020-08-02 10, 1, 1, 4, 0, 0, 0, 6, x, )
3 BOS 002200010 2020-08-13 0, 0, 2, 2, 0, 0, 0, 1, 0
4 NYM 00000(11)01x 2020-08-15 0, 0, 0, 0, 0, 11, 0, 1, x
5 BOS 311200 2020-08-20 3, 1, 1, 2, 0, 0
如何更改(, 1, 0, )到10和(, 1, 1, )到11,剩下的为是。
到目前为止我已经得到了一些帮助:
-
用于替换括号外特定字符的正则表达式仅感谢 AnilGoyal
-
gsub("D+", "", str1)感谢阿克伦 -
gsub("[(,) ]", "", "(, 1, 0, )")感谢 Anoushirvan
谢谢!
回答
我们可以在base R. 一个选项是在(...)with之外的字符之间插入一个分隔符*SKIP/*FAIL,然后删除配对()同时通过将字符捕获为一个组来保留字符,最后list通过在,with处拆分来返回strsplit
df1$x <- strsplit(gsub("((d+))", "1,",
gsub("([^)]+)(*SKIP)(*FAIL)|(.)", "1,",
df1$linescore, perl = TRUE)),",")
-输出
df1$x
[[1]]
[1] "0" "1" "0" "0" "0" "0" "0" "0" "0"
[[2]]
[1] "10" "1" "1" "4" "0" "0" "0" "6" "x" ")"
[[3]]
[1] "0" "0" "2" "2" "0" "0" "0" "1" "0"
[[4]]
[1] "0" "0" "0" "0" "0" "11" "0" "1" "x"
[[5]]
[1] "3" "1" "1" "2" "0" "0"