我试图在尽可能多的列和组合中找到相同的字符串。例如,我有这样的数据
df<-structure(list(first = c("SNTM1", "STTTT2", "STOLA", "STOMQ",
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3",
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1", NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA), second = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, "SNTLK", "STTTFSG", "STOIU", "STOMQ", "STR25",
"SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1", NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA), second2 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, "SNTLKM", "STTTFSGTT", "GFD", "STOMQ",
"TRS", "BRsts", "TMHS", "RSEST", "TRSF", "YIPF1")), class = "data.frame", row.names = c(NA,
-37L))
它有 3 列,我想找到第 1 列和第 2 列之间的相似之处。然后 2 和 3 然后 1,2,3 一起。所以答案是这样的
C1-C2 C2-C3 C1-C3 C1-C2-C3
STOMQ STOMQ STOMQ STOMQ
TMEIL1 YIPF1 YIPF1 YIPF1
YIPF1
这意味着 C1(column1)-C2(column 2) 共享以下唯一相同的字符串
STOMQ
TMEIL1
YIPF1
其他列也一样
回答1
a <- combn(unname(df),2, do.call, what=intersect, simplify=FALSE)
上面的 a
包含 1,2 和 1,3 和 2,3 的交集。现在要将 1,2,3 的交集添加到列表中,我们执行以下命令:这会将 1,2,3 的交集添加到列表 a
c(a, list(intersect(a[[1]],a[[2]])))
[[1]]
[1] "STOMQ" "TMEIL1" "YIPF1" NA
[[2]]
[1] "STOMQ" "YIPF1" NA
[[3]]
[1] NA "STOMQ" "YIPF1"
[[4]]
[1] "STOMQ" "YIPF1" NA
回答2
您可以使用 purrr 包中的 accumulate()
以及 base R 中的 intersect()
来完成此操作。就像是:
library(purrr)
df <- map(df, ~ discard(.x, is.na))
# first remove NA values so they don't show up in intersect results
accumulate(df, ~ base::intersect(.x, .y))
# output
List of 3
$first
"SNTM1" "STTTT2" "STOLA" "STOMQ" "STR2" "SUPTY1"
"TBNHSG" "TEYAH" "TMEIL1" "TMEIL2"
"TMEIL3" "TNIL" "TREUK" "TTRK" "TRRFK" "UBA52" "YIPF1"
$second
"STOMQ" "TMEIL1" "YIPF1"
$second2
"STOMQ" "YIPF1"
$second 是取第一列和第二列的交集的结果,对应于上面示例中的 C1-C2 列。 $second2 是取这个结果和 second2 的交集的结果,对应上面的 C1-C2-C3。