r - 如何将长数据集 reshape 转换为具有多个变量的短数据集

**更新

我的数据集包含以下格式的 314090 个观察值:

UPDATEDID BRIEF_ID gamma LDR_SUM LDR_Topic LDR_7Code
16 04999120040277 2.879744e-03 0.15326902 supervises collective followers very closely 1

注意:有重复的 Brief_ID 编号(3205 个唯一的#s)、LDR_Topics(15 个具有相应 LDR_7Codes 的唯一 LDR 主题),这就是数据很长的原因。

我想 reshape 这个数据,其中每一行都是唯一的 #s (3205 行),每个 LDR_Topic(15)是它自己的唯一列(总共 20 列),其对应的 LDR_SUM 在列中作为 values。例如:

UPDATEDID BRIEF_ID supervises collective followers very closely
16 04999120040277 0.15326902

到目前为止,我已经尝试过:

BriefingGammas4<-reshape(data = BriefingGammas3, 
                         idvar = c("UPDATEDID", "BRIEF_ID"),
                         timevar = "LDR_Topic", 
                         direction = "wide")

但它中止到一个新的会话。

有什么建议吗?谢谢!

***** 更新

我尝试了以下方法,但都没有得到正确的 table。

install.packages("data.table")
library (data.table)
BriefingGammas7 <- as.data.table(BriefingGammas6)
BriefingGammas7 <- dcast(BriefingGammas7, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')

这导致正确的 3205 行,但每个 LDR_Topic 的 values 不正确(它们不应该相同,并且应该是小数。这些数字似乎反映了不在数据集中的 LDR_7Code)。请参阅下面的示例:

UPDATEDID BRIEF_ID acquired resources distributed resources enhanced
1 01999110036250 2 4 15
2 01999120041284 2 4 15
3 01999300213 2 4 15

然后我尝试了这个:

install.packages("tidyverse")
library (tidyverse)
BriefingGammas6 <- BriefingGammas5 |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code))

这导致每个 LDR_Topic 的 values 正确,但行数不正确。它保持在 314,090 行而不是 3205 行。参见示例:

UPDATEDID BRIEF_ID acquired resources distributed resources enhanced
1 01999110036250 0.02843241 NA NA
2 01999110036250 NA 0.010892233 NA
3 01999110036250 NA 0.010892233 0.006081761
4 01999110036250 0.02843241 NA 0.006081761

基本上,它为每个主题填写了 3205 行的 values(重复多次),然后开始为下一个主题填写 values。但我想让 3205 行看起来像这样:

UPDATEDID BRIEF_ID acquired resources distributed resources enhanced
1 01999110036250 0.02843241 0.010892233 0.006081761
2 01999120041284 0.1594207 0.005315201 0.004850703
3 01999300213 0.4374699 0.01607505 0.003971634

我尝试的最后一个是这样的:

BriefingGammas7<-reshape(data = BriefingGammas6, 
                         idvar = c("UPDATEDID", "BRIEF_ID"),
                         timevar = "LDR_Topic",
                         v.names = "LDR_SUM",
                         direction = "wide")

结果是这样的:

UPDATEDID BRIEF_ID "acquired resources", "distributed"...
1 01999110036250 NA
2 01999120041284 NA

没有其他行出来。

解决方案更新*

步骤 1. 减少变量数量 步骤 2. 删除重复观察

BriefingGammas7 <- subset(BriefingGammas6, !duplicated(subset(BriefingGammas6, select=c(UPDATEDID, BRIEF_ID, LDR_SUM, LDR_Topic))))

步骤 3. 在下面的评论中使用 tidy verse 方式。

BriefingGammas8 <- BriefingGammas7 |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM)

回答1

为了使情况更清楚,我尝试使用遵循第一行数据模式的虚拟数据创建第二行:

dput(dat)
structure(list(UPDATEDID = c(16, 17), BRIEF_ID = c("04999120040277", 
"14999120040277"), gamma = c(879.744, 779.744), LDR_SUM = c(0.15326902, 
0.25326902), LDR_Topic = c("supervises collective followers very closely", 
"does something else"), LDR_7Code = c(1, 2)), class = "data.frame", row.names = c(NA, 
-2L))

dat
  UPDATEDID       BRIEF_ID   gamma  LDR_SUM                                    LDR_Topic LDR_7Code
1        16 04999120040277 879.744 0.153269 supervises collective followers very closely         1
2        17 14999120040277 779.744 0.253269                          does something else         2

基本 R 方式

dat |> 
  reshape(direction = "wide", 
          idvar  = "UPDATEDID",
          timevar ="LDR_Topic",
          v.names = "LDR_SUM")|>
  subset(select = -c(gamma, LDR_7Code))

# The result

#  UPDATEDID       BRIEF_ID LDR_SUM.supervises collective followers very closely LDR_SUM.does something else
#1        16 04999120040277                                             0.153269                          NA
#2        17 14999120040277                                                   NA                    0.253269

一个整洁的方式

library(tidyverse)

dat |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code))

#The result

# A tibble: 2 × 4
#  UPDATEDID BRIEF_ID       `supervises collective followers very closely` `does something else`
#      <dbl> <chr>                                                   <dbl>                 <dbl>
#1        16 04999120040277                                          0.153                NA    
#2        17 14999120040277                                         NA                     0.253

一种data.table方式(内存效率推荐)

library(data.table)

dat.dt <- as.data.table(dat)
dcast(dat.dt, UPDATEDID + BRIEF_ID ~ LDR_Topic, value.var = 'LDR_SUM')

# The result

#   UPDATEDID       BRIEF_ID does something else supervises collective followers very closely
#1:        16 04999120040277                  NA                                     0.153269
#2:        17 14999120040277            0.253269                                           NA

更新

根据您的解释, tidyverse 方式基本上朝着正确的方向工作。唯一的问题是在某些列中有 NA 的重复行,您希望它们折叠成一行。使用 fill()distinct() 函数很容易做到这一点。您的示例中唯一的问题是 UPDATEDID1,2,3,4 更改为 1 没有任何解释。因此,现在,我假设我们可以忽略 UPDATEDID(您可以稍后为其创建一个新列)并且我们只需要考虑 BRIEF_ID

yourdf <- structure(list(UPDATEDID = 1:4, BRIEF_ID = c(1999110036250, 1999110036250, 
1999110036250, 1999110036250), acquired_resources = c(0.02843241, 
NA, NA, 0.02843241), distributed_resources = c(NA, 0.010892233, 
0.010892233, NA), enhanced = c(NA, NA, 0.006081761, 0.006081761
)), class = "data.frame", row.names = c(NA, -4L))

yourdf   # I change the space to '_' to make it easier to control

  UPDATEDID    BRIEF_ID acquired_resources distributed_resources    enhanced
1         1 1.99911e+12         0.02843241                    NA          NA
2         2 1.99911e+12                 NA            0.01089223          NA
3         3 1.99911e+12                 NA            0.01089223 0.006081761
4         4 1.99911e+12         0.02843241                    NA 0.006081761

yourdf[,-1] |>
     fill(acquired_resources,distributed_resources,enhanced, 
     .direction = 'downup') |> 
     distinct()
    

# The result
     BRIEF_ID acquired_resources distributed_resources    enhanced
1 1.99911e+12         0.02843241            0.01089223 0.006081761

然后,完整的步骤将是:

dat |> 
 pivot_wider(names_from = LDR_Topic, values_from = LDR_SUM) |>
 select(-c(gamma, LDR_7Code)) |>
 fill(acquired_resources,distributed_resources,enhanced, 
     .direction = 'downup') |> 
     distinct()

相似文章

随机推荐

最新文章