在R中混合[tidyverse]和[data.table]语法的危险？

html5 • 2022年9月19日 pm2:53 • 问答

我从混合tidyverse和data.table语法中得到了一些非常奇怪的行为。对于上下文，我经常发现自己使用tidyverse语法，然后data.table在需要速度与需要代码可读性时添加管道。我知道 Hadley 正在开发一个快速使用tidyverse语法的新包data.table，但据我所知，它仍处于初期阶段，所以我还没有使用它。

有人愿意解释这里发生了什么吗？这对我来说非常可怕，因为我可能不假思索地做了数千次。

library(dplyr); library(data.table)
DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")

# now, what happens if I use a tidyverse function (distinct) and then
# convert back to data.table?
DT <- distinct(DT) %>% as.data.table()

# this works like normal
DT[iso3c %in% codes]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC

# now, what happens if I use a different tidyverse function (arrange) 
# and then convert back to data.table?
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack: (!!!!!!!!!!!!)
DT[iso3c %in% codes]
# iso3c country income
# 1:   ALB Albania   UMIC

# but these work:
DT[(iso3c %in% codes), ]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
DT[DT$iso3c %in% codes, ]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
DT[DT$iso3c %in% codes]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC

回答

我有几次遇到同样的问题，这导致我避免dplyr与data.table语法混合，因为我没有花时间找出原因。所以感谢您提供 MRE。

看起来dplyr::arrange正在干扰data.table 自动索引：

当使用==
或%in%
在单个变量上对数据集进行子集化时将使用索引

默认情况下，如果过滤时不存在变量的索引，则会自动创建并使用它

如果更改数据顺序，索引将丢失

您可以检查您是否正在使用索引 options(datatable.verbose=TRUE)

如果我们明确设置自动索引：

library(dplyr); 
library(data.table)

DT <- fread(
"iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options(datatable.auto.index = TRUE)

DT <- distinct(DT) %>%   as.data.table()

# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.060s elapsed (0.060s cpu) 
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#>    iso3c country income
#> 1:   ALB Albania   UMIC

# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

为避免此问题，您可以禁用自动索引：

library(dplyr); 
library(data.table)

DT <- fread(
"iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options(datatable.auto.index = FALSE) # Disabled

DT <- distinct(DT) %>%   as.data.table()

# No automatic index creation
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

DT <- DT %>% arrange(iso3c) %>% as.data.table()

# This now works because auto-indexing is off:
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

我在data.table/issues/5042和dtplyr/issues/259上报告了这个问题：集成在 1.4.11 里程碑。

以上是在R中混合[tidyverse]和[data.table]语法的危险？的全部内容。

THE END

二维码

如何获得vimscript的底线？

< <上一篇

为什么我不能在`foldr`的lambda函数中使用“isJust”？

下一篇>>

搜索内容

在R中混合[tidyverse]和[data.table]语法的危险？

回答

目录

目录

推荐文章

最新文章