**更新
我的数据集包含以下格式的 314090 个观察值:
UPDATEDID | BRIEF_ID | gamma | LDR_SUM | LDR_Topic | LDR_7Code |
---|---|---|---|---|---|
16 | 04999120040277 | 2.879744e-03 | 0.15326902 | supervises collective followers very closely | 1 |
注意:有重复的 Brief_ID 编号(3205 个唯一的#s)、LDR_Topics(15 个具有相应 LDR_7Codes 的唯一 LDR 主题),这就是数据很长的原因。
我想 reshape 这个数据,其中每一行都是唯一的 #s (3205 行),每个 LDR_Topic(15)是它自己的唯一列(总共 20 列),其对应的 LDR_SUM 在列中作为 values。例如:
UPDATEDID | BRIEF_ID | supervises collective followers very closely |
---|---|---|
16 | 04999120040277 | 0.15326902 |
到目前为止,我已经尝试过:
BriefingGammas4<-reshape(data = BriefingGammas3,
idvar = c("UPDATEDID", "BRIEF_ID"),
timevar = "LDR_Topic",
direction = "wide")
但它中止到一个新的会话。
有什么建议吗?谢谢!
***** 更新
我尝试了以下方法,但都没有得到正确的 table。
install.packages("data.table")
library (data.table)
BriefingGammas7 <- as.data.table(BriefingGammas6)
BriefingGammas7 <- dcast(BriefingGammas7, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')
这导致正确的 3205 行,但每个 LDR_Topic 的 values 不正确(它们不应该相同,并且应该是小数。这些数字似乎反映了不在数据集中的 LDR_7Code)。请参阅下面的示例:
UPDATEDID | BRIEF_ID | acquired resources | distributed resources | enhanced |
---|---|---|---|---|
1 | 01999110036250 | 2 | 4 | 15 |
2 | 01999120041284 | 2 | 4 | 15 |
3 | 01999300213 | 2 | 4 | 15 |
然后我尝试了这个:
install.packages("tidyverse")
library (tidyverse)
BriefingGammas6 <- BriefingGammas5 |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code))
这导致每个 LDR_Topic 的 values 正确,但行数不正确。它保持在 314,090 行而不是 3205 行。参见示例:
UPDATEDID | BRIEF_ID | acquired resources | distributed resources | enhanced |
---|---|---|---|---|
1 | 01999110036250 | 0.02843241 | NA | NA |
2 | 01999110036250 | NA | 0.010892233 | NA |
3 | 01999110036250 | NA | 0.010892233 | 0.006081761 |
4 | 01999110036250 | 0.02843241 | NA | 0.006081761 |
基本上,它为每个主题填写了 3205 行的 values(重复多次),然后开始为下一个主题填写 values。但我想让 3205 行看起来像这样:
UPDATEDID | BRIEF_ID | acquired resources | distributed resources | enhanced |
---|---|---|---|---|
1 | 01999110036250 | 0.02843241 | 0.010892233 | 0.006081761 |
2 | 01999120041284 | 0.1594207 | 0.005315201 | 0.004850703 |
3 | 01999300213 | 0.4374699 | 0.01607505 | 0.003971634 |
我尝试的最后一个是这样的:
BriefingGammas7<-reshape(data = BriefingGammas6,
idvar = c("UPDATEDID", "BRIEF_ID"),
timevar = "LDR_Topic",
v.names = "LDR_SUM",
direction = "wide")
结果是这样的:
UPDATEDID | BRIEF_ID | "acquired resources", "distributed"... |
---|---|---|
1 | 01999110036250 | NA |
2 | 01999120041284 | NA |
没有其他行出来。
解决方案更新*
步骤 1. 减少变量数量 步骤 2. 删除重复观察
BriefingGammas7 <- subset(BriefingGammas6, !duplicated(subset(BriefingGammas6, select=c(UPDATEDID, BRIEF_ID, LDR_SUM, LDR_Topic))))
步骤 3. 在下面的评论中使用 tidy verse 方式。
BriefingGammas8 <- BriefingGammas7 |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM)
回答1
为了使情况更清楚,我尝试使用遵循第一行数据模式的虚拟数据创建第二行:
dput(dat)
structure(list(UPDATEDID = c(16, 17), BRIEF_ID = c("04999120040277",
"14999120040277"), gamma = c(879.744, 779.744), LDR_SUM = c(0.15326902,
0.25326902), LDR_Topic = c("supervises collective followers very closely",
"does something else"), LDR_7Code = c(1, 2)), class = "data.frame", row.names = c(NA,
-2L))
dat
UPDATEDID BRIEF_ID gamma LDR_SUM LDR_Topic LDR_7Code
1 16 04999120040277 879.744 0.153269 supervises collective followers very closely 1
2 17 14999120040277 779.744 0.253269 does something else 2
基本 R 方式
dat |>
reshape(direction = "wide",
idvar = "UPDATEDID",
timevar ="LDR_Topic",
v.names = "LDR_SUM")|>
subset(select = -c(gamma, LDR_7Code))
# The result
# UPDATEDID BRIEF_ID LDR_SUM.supervises collective followers very closely LDR_SUM.does something else
#1 16 04999120040277 0.153269 NA
#2 17 14999120040277 NA 0.253269
一个整洁的方式
library(tidyverse)
dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code))
#The result
# A tibble: 2 × 4
# UPDATEDID BRIEF_ID `supervises collective followers very closely` `does something else`
# <dbl> <chr> <dbl> <dbl>
#1 16 04999120040277 0.153 NA
#2 17 14999120040277 NA 0.253
一种data.table方式(内存效率推荐)
library(data.table)
dat.dt <- as.data.table(dat)
dcast(dat.dt, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')
# The result
# UPDATEDID BRIEF_ID does something else supervises collective followers very closely
#1: 16 04999120040277 NA 0.153269
#2: 17 14999120040277 0.253269 NA
更新
根据您的解释, tidyverse
方式基本上朝着正确的方向工作。唯一的问题是在某些列中有 NA
的重复行,您希望它们折叠成一行。使用 fill()
和 distinct()
函数很容易做到这一点。您的示例中唯一的问题是 UPDATEDID
从 1,2,3,4
更改为 1
没有任何解释。因此,现在,我假设我们可以忽略 UPDATEDID
(您可以稍后为其创建一个新列)并且我们只需要考虑 BRIEF_ID
。
yourdf <- structure(list(UPDATEDID = 1:4, BRIEF_ID = c(1999110036250, 1999110036250,
1999110036250, 1999110036250), acquired_resources = c(0.02843241,
NA, NA, 0.02843241), distributed_resources = c(NA, 0.010892233,
0.010892233, NA), enhanced = c(NA, NA, 0.006081761, 0.006081761
)), class = "data.frame", row.names = c(NA, -4L))
yourdf # I change the space to '_' to make it easier to control
UPDATEDID BRIEF_ID acquired_resources distributed_resources enhanced
1 1 1.99911e+12 0.02843241 NA NA
2 2 1.99911e+12 NA 0.01089223 NA
3 3 1.99911e+12 NA 0.01089223 0.006081761
4 4 1.99911e+12 0.02843241 NA 0.006081761
yourdf[,-1] |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()
# The result
BRIEF_ID acquired_resources distributed_resources enhanced
1 1.99911e+12 0.02843241 0.01089223 0.006081761
然后,完整的步骤将是:
dat |>
pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
select(-c(gamma, LDR_7Code)) |>
fill(acquired_resources,distributed_resources,enhanced,
.direction = 'downup') |>
distinct()