是否有R函数读取以n作为（列）分隔符的文本文件？

html5 • 2022年9月13日 pm1:34 • 问答

问题

我正在尝试提出一种简洁/快速的方法来将由换行符 ( n) 字符分隔的文件读取到多个列中。

本质上，在给定的输入文件中，输入文件中的多行应该成为输出中的单行，但是大多数文件读取功能明智地将换行符解释为表示新行，因此它们最终作为一个数据框柱子。下面是一个例子：

输入文件如下所示：

Header Info
2021-01-01
text
...
@
2021-01-02
text
...
@
...

其中...表示输入文件中的潜在多行，并且@表示输出数据帧中真正应该是一行的结尾。所以在读取这个文件时，它应该变成这样的数据框（忽略标题）：

X1	X2	...	Xn
2021-01-01	文本	...	...
2021-01-02	文本	...	...
...	...	...	...

回答

1)读入数据，找到给出逻辑变量的@符号，然后创建一个分组变量g，该变量对每个所需的行都有不同的值。最后使用带粘贴的 tapply 将其重新加工成可以使用 read.table 读取的行并读取它。（如果数据中有逗号，则使用其他一些分隔字符。）

L <- readLines("input.txt")[-1]
at <- grepl("@", L)
g <- cumsum(at)
read.table(text = tapply(L[!at], g[!at], paste, collapse = ","), 
  sep = ",", col.names = cnames)

给出这个数据框：

          V1   V2
1 2021-01-01 text
2 2021-01-02 text

2)另一种方法是通过删除 @ 符号并在其他行前面加上列名和冒号来将数据重新处理为 dcf 形式。然后使用 read.dcf。cnames 是您要使用的列名称的字符向量。

cnames <- c("Date", "Text")

L <- readLines("input.txt")[-1]
LL <- sub("@", "", paste0(c(paste0(cnames, ": "), ""), L))
DF <- as.data.frame(read.dcf(textConnection(LL)))
DF[] <- lapply(DF, type.convert, as.is = TRUE)
DF

给出这个数据框：

        Date Text
1 2021-01-01 text
2 2021-01-02 text

3）这种方法只是将数据重新整形为矩阵，然后将其转换为数据框。请注意，（1）将数字列转换为数字列，而这只是将它们保留为字符。

L <- readLines("input.txt")[-1]
k <- grep("@", L)[1]
as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
##           V1   V2
## 1 2021-01-01 text
## 2 2021-01-02 text

基准

这个问题没有提到速度是一个考虑因素，但在后来的评论中提到了。根据以下基准中的数据，(1) 运行速度是问题中代码的两倍，(3) 运行速度快近 25 倍。

library(microbenchmark)

writeLines(c("Header Info", 
   rep(c("2021-01-01", "text", "@", "2021-01-02", "text", "@"), 10000)), 
   "input.txt")

library(microbenchmark)
writeLines(c("Header Info", rep(c("2021-01-01", "text", "@", "2021-01-02", "text", "@"), 10000)), "input.txt")

microbenchmark(times = 10,
ques = {
  input <- readLines("input.txt")
  input <- paste(input[2:length(input)], collapse = ";") # Skip the header
  input <- gsub(";@;*", replacement = "n", x = input)
  input <- strsplit(unlist(strsplit(input, "n")), ";")
  input <- do.call(rbind.data.frame, input)
},
ans1 = {
  L <- readLines("input.txt")[-1]
  at <- grepl("@", L)
  g <- cumsum(at)
  read.table(text = tapply(L[!at], g[!at], paste, collapse = ","), sep = ",")
},
ans3 = {
  L <- readLines("input.txt")[-1]
  k <- grep("@", L)[1]
  as.data.frame(matrix(L, ncol = k, byrow = TRUE))[, -k]
})
## Unit: milliseconds
##  expr     min      lq    mean  median      uq     max neval cld
##  ques 1146.62 1179.65 1188.74 1194.78 1200.11 1219.01    10   c
##  ans1  518.95  522.75  548.33  532.59  561.55  647.14    10  b 
##  ans3   50.47   51.19   51.68   51.69   52.25   52.52    10 a

以上是是否有R函数读取以n作为（列）分隔符的文本文件？的全部内容。

THE END

二维码

通过取消引用打印字符指针与打印字符

< <上一篇

Rust智能指针std::rc::Rc和std::sync::Arc分别类似于C++智能指针std::shared_ptr和std::atomic_shared_ptr吗？

下一篇>>

搜索内容

是否有R函数读取以n作为（列）分隔符的文本文件？

问题

回答

基准

目录

目录

推荐文章

最新文章