当要匹配的字符串模式是来自另一个数据帧的列时,如何计算R中匹配字符串的数量?
我有两个非常大的数据框,第一个数据框由一个列组成body,它是一个评论列表,第二个由names. 我想计算有多少个元素body包含names. 这是一个可重现的小型数据集(原始数据集大约有 2000 个名称,其中每个名称都是汽车的名称):
df1 <- tibble(body = c("The Tesla Roadster has a range of 620 miles",
"ferrari needs to make an electric car",
"How much does a tesla cost?",
"When is the new Mercedes releasing?",
"Can't wait to get my hands on the new Tesla"))
df2 <- tibble(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))
如上所述,我正在尝试计算名称中每个值在 body 中出现的次数,然后最好将其添加为df2. 我已经通过以下方式尝试过:
counter = c()
for (i in df2$names) {
counter[i] = sum(ifelse(str_detect(df1$body, i),1, 0))
}
虽然这种方法的作品,它需要极长的时间量,并返回一个向量,其中名称是属性counter值,然后我拆散它,并加入了数据帧,以df2使用names为按键。这是唯一有效的方法,除了我尝试使用的方法之外,str_count但以我目前对 R 的熟练程度,代码非常糟糕,让我无处可去。
有没有更有效的方法来查找匹配的字符串?我试图在堆栈上找到类似的问题,但无济于事!
提前谢谢了 :)
回答
像这样的东西?
df1 <- data.frame(body = c("The Tesla Roadster has a range of 620 miles",
"ferrari needs to make an electric car",
"How much does a tesla cost?",
"When is the new Mercedes releasing?",
"Can't wait to get my hands on the new Tesla"))
df2 <- data.frame(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))
library(tidyverse)
df2 %>%
mutate(des_count = map_int(tolower(names), ~ sum(str_detect(tolower(df1$body), .x))))
#> names des_count
#> 1 FORD 0
#> 2 TESLA 3
#> 3 MERCEDES 1
#> 4 FERRARI 1
#> 5 JAGUAR 0
#> 6 HYUNDAI 0
由reprex 包( v2.0.0 )于 2021 年 5 月 13 日创建
或者如果你想使用 baseR
df1 <- data.frame(body = c("The Tesla Roadster has a range of 620 miles",
"ferrari needs to make an electric car",
"How much does a tesla cost?",
"When is the new Mercedes releasing?",
"Can't wait to get my hands on the new Tesla"))
df2 <- data.frame(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))
df2$desired_count <- sapply(df2$names, function(x) sum(grepl(x, df1$body, ignore.case = T)))
df2
#> names desired_count
#> 1 FORD 0
#> 2 TESLA 3
#> 3 MERCEDES 1
#> 4 FERRARI 1
#> 5 JAGUAR 0
#> 6 HYUNDAI 0
由reprex 包( v2.0.0 )于 2021 年 5 月 13 日创建
回答
你可以使用rowwiseand grepl,我认为它比str_detect:
df1 <- df1 %>%
mutate(body = tolower(body))
df2 %>%
mutate(names = tolower(names)) %>%
rowwise() %>%
mutate(counter = sum(grepl(names,tolower(df1$body),fixed = TRUE )))
# A tibble: 6 x 2
# Rowwise:
names counter
<chr> <int>
1 ford 0
2 tesla 3
3 mercedes 1
4 ferrari 1
5 jaguar 0
6 hyundai 0
由于问题是关于速度的,这里是基准:
df1 <- df1 %>%
mutate(body = tolower(body))
df2 <- df2 %>%
mutate(names = tolower(names))
anilgoyal = function(){
df2 %>%
mutate(des_count = map_int(names, ~ sum(str_detect(df1$body, .x))))
}
anigoyal2 = function(){
sapply(df2$names, function(x) sum(grepl(x, df1$body, ignore.case = T)))
}
denis = function(){
df2 %>%
rowwise() %>%
mutate(counter = sum(grepl(names,df1$body ,fixed = T)))
}
Anoushiravan = function(){
df1 %>%
rowwise() %>%
mutate(match = df2$names[which(str_detect(body, fixed(df2$names,
ignore_case = TRUE)))]) -> df3
df2 %>%
mutate(cnt = map_chr(names, ~ sum(str_detect(df3$match, .x))))
}
chris = function(){
df2 %>%
rowwise() %>%
mutate(count = sum(grepl(paste0("(?i)", names), df1$body)))
}
结果
library(microbenchmark)
microbenchmark(denis(),anilgoyal(),anigoyal2(),Anoushiravan(),chris(),times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
denis() 5960.6 7059.85 10644.711 8692.50 11533.90 49709.7 100 c
anilgoyal() 3614.2 4385.55 6660.244 4886.60 7195.65 31088.9 100 b
anigoyal2() 153.4 203.00 315.966 239.35 285.45 2010.8 100 a
Anoushiravan() 10083.4 12522.40 19994.135 15355.85 20469.60 100866.2 100 d
chris() 5971.7 7060.55 11353.754 8356.35 10727.10 98319.3 100 c
Base R 效率更高!厉害了@AnilGoyal
- Add `fixed = TRUE` to `grepl` for a big speed boost.
THE END
二维码