我正在使用 data.table 包来处理非常大的数据集,而 value 它的速度和清晰度。但我是新手,并且 chaining 一起工作时遇到困难,尤其是在处理一组混合数据时。table 和基本 R 函数。我的问题是,如何将下面的示例函数链接在一起,形成一个无缝的代码串来定义目标 data
对象?
下面是正确的输出,通过单独(未链接)运行每一行代码生成,生成代码显示在输出的正下方:
> data
ID Period State Values
1: 1 1 X0 5
2: 1 2 X1 0
3: 1 3 X2 0
4: 1 4 X1 0
5: 2 1 X0 1
6: 2 2 XX 0
7: 2 3 XX 0
8: 2 4 XX 0
9: 3 1 X2 0
10: 3 2 X1 0
11: 3 3 X9 0
12: 3 4 X3 0
13: 4 1 X2 1
14: 4 2 X1 2
15: 4 3 X9 3
16: 4 4 XX 0
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
)
# changes State to "XX" if remaining Values_1 + Values_2 cumulative sums = 0 for each ID:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID]
# create new column "Values", which equals "Values_1":
setDT(data)[,Values := Values_1]
# in base R, drops columns Values_1 and Values_2:
data <- subset(data, select = -c(Values_1,Values_2)) # How to do this step in data.table, if possible or advisable?
# in base R, changes all "XX" elements in State column to "HI":
data$State <- gsub('XX','HI', data$State) # How to do this step in data.table, if possible or advisable?
对于它的价值,下面是我尝试使用 '%>%' 管道 operators 链接在一起的尝试,但失败了(错误消息 Error in data$State : object of type 'closure' is not subsettable),虽然我宁愿使用数据链接在一起。table operators:
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
setDT(data)[,Values := Values_1] %>%
subset(data, select = -c(Values_1,Values_2)) %>%
data$State <- gsub('XX','HI', data$State)
回答1
如果我理解正确,OP 想要
- 将列
Value_1
重命名为Value
(或者用 OP 的话说:创建新列“Values”,等于“Values_1”) - 删除列
Value_2
- 将
State
列中所有出现的XX
替换为HI
这是我在 data.table 语法中要做的:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
, Values_2 := NULL][
State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
ID Period Values State 1: 1 1 5 X0 2: 1 2 0 X1 3: 1 3 0 X2 4: 1 4 0 X1 5: 2 1 1 X0 6: 2 2 0 HI 7: 2 3 0 HI 8: 2 4 0 HI 9: 3 1 0 X2 10: 3 2 0 X1 11: 3 3 0 X9 12: 3 4 0 X3 13: 4 1 1 X2 14: 4 2 2 X1 15: 4 3 3 X9 16: 4 4 0 HI
setnames()
通过引用更新,例如,无需复制。无需创建 Values_1
的副本并稍后删除 Values_1
。
此外,[State == "XX", State := "HI"]
仅在受影响的行中通过引用将 XX
替换为 HI
,而 [, State := gsub('XX','HI', State)]
将替换整列。
data.table chaining 在适当的地方使用。
顺便说一句:我想知道为什么不能在第一条语句中立即用 HI
替换 XX
:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
, Values_2 := NULL][]
setnames(data, "Values_1", "Values")
回答2
您可以使用括号符号 [
进行链接。这样你只需要调用 setDT()
一次,因为你正在继续 data.table
宇宙中的所有操作,所以 data
不会停止成为 data.table
。此外 setDT()
就地修改,因此它不需要分配(尽管通过管道将其返回 value 分配给 data
这也很好)。
首先定义数据并将其设为 data.table
:
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0", "X1", "X2", "X1", "X0", "X2", "X0", "X0", "X2", "X1", "X9", "X3", "X2", "X1", "X9", "X3")
) |>
setDT()
然后定义您需要的列。请注意 https://atrebas.github.io/post/2020-06-17-datatable-introduction/#computation-on-columns 的函数符号。
data[, `:=`(
State = ifelse(
rev(cumsum(rev(Values_1 + Values_2))),
State, "XX"
)
),
by = ID
][
,
`:=`(
Values = Values_1,
Values_1 = NULL,
Values_2 = NULL,
State = gsub("XX", "HI", State)
)
]
输出:
data
# ID Period State Values
# 1: 1 1 X0 5
# 2: 1 2 X1 0
# 3: 1 3 X2 0
# 4: 1 4 X1 0
# 5: 2 1 X0 1
# 6: 2 2 HI 0
# 7: 2 3 HI 0
# 8: 2 4 HI 0
# 9: 3 1 X2 0
# 10: 3 2 X1 0
# 11: 3 3 X9 0
# 12: 3 4 X3 0
# 13: 4 1 X2 1
# 14: 4 2 X1 2
# 15: 4 3 X9 3
# 16: 4 4 HI 0
您可能想进一步阅读数据中的 https://atrebas.github.io/post/2020-06-17-datatable-introduction/#chaining-commands。我认为该页面是对包的语法和功能的极好总结,值得一读。
回答3
您可以使用 magrittr
包到 chaining 数据。tables 在 [
之前使用 .
。试试下面的代码:
library(dplyr)
library(magrittr)
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data) %>%
.[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
.[,Values := Values_1] %>%
select(-c(Values_1, Values_2)) %>%
mutate(State = gsub('XX','HI', State))
输出:
rn ID Period State Values
1: 1 1 1 X0 5
2: 2 1 2 X1 0
3: 3 1 3 X2 0
4: 4 1 4 X1 0
5: 5 2 1 X0 1
6: 6 2 2 HI 0
7: 7 2 3 HI 0
8: 8 2 4 HI 0
9: 9 3 1 X2 0
10: 10 3 2 X1 0
11: 11 3 3 X9 0
12: 12 3 4 X3 0
13: 13 4 1 X2 1
14: 14 4 2 X1 2
15: 15 4 3 X9 3
16: 16 4 4 HI 0