r - 导入的数据未正确分离

所以我有一个问题,我的 .CSV 文件中的数据看起来像这样,这就是我需要它的方式,(唯一的区别是我需要 Time 和 Date 在 2 个单独的列中)。 (编辑)

这是我需要我的数据在 R Studio 中看起来像的目标格式,而不是将时间戳列分隔为 Date 和时间列。 (这是从EXCEL复制粘贴的)

timestamp   axis1   axis2   axis3   VM  standing    stepping    sitting
2022-03-17 11:29    0   0   0   0   0   0   1
2022-03-17 11:29    0   0   0   0   0   0   1
2022-03-17 11:29    0   0   0   0   0   0   1
2022-03-17 11:29    0   0   0   0   0   0   1
2022-03-17 11:29    0   0   0   0   0   0   1
2022-03-17 11:29    0   0   0   0   0   0   1
2022-03-17 11:29    0   0   0   0   0   0   1

但是我一开始导入的数据都集中在 1 列中,

data1 <- read_csv("AE1_07 DBG (2022-03-17)1sec.csv")
1
2022-03-17 11:29:00,0,0,0,0,0,0,1
2
2022-03-17 11:29:01,0,0,0,0,0,0,1
3
2022-03-17 11:29:02,0,0,0,0,0,0,1
4
2022-03-17 11:29:03,0,0,0,0,0,0,1
5
2022-03-17 11:29:04,0,0,0,0,0,0,1
6
2022-03-17 11:29:05,0,0,0,0,0,0,1
7
2022-03-17 11:29:06,0,0,0,0,0,0,1
8
2022-03-17 11:29:07,0,0,0,0,0,0,1
9
2022-03-17 11:29:08,0,0,0,0,0,0,1
10
2022-03-17 11:29:09,0,0,0,0,0,0,1
11
2022-03-17 11:29:10,0,0,0,0,0,0,1
12
2022-03-17 11:29:11,124,72,61,155.82,0,0,1
13
2022-03-17 11:29:12,195,121,101,250.73,0,0,1
14
2022-03-17 11:29:13,36,21,0,41.68,0,0,1
15
2022-03-17 11:29:14,72,184,91,217.53,0,0,1
16
2022-03-17 11:29:15,80,145,53,173.88,0,0,1

这是我在记事本中打开文件时的“原始”数据

timestamp,axis1,axis2,axis3,vectormagnitude,inclinestanding,inclinestepping,inclinesittinglying
2022-03-11 17:00:00,139,15,19,141.09,0,0,1
2022-03-11 17:00:01,183,15,10,183.89,0,0,1
2022-03-11 17:00:02,185,0,21,186.19,0,0,1
2022-03-11 17:00:03,184,12,20,185.47,0,0,1
2022-03-11 17:00:04,183,13,19,184.44,0,0,1
2022-03-11 17:00:05,179,12,21,180.63,0,0,1

回答1

如果您的第二个代码块是真正的原始文件,那么我们会遇到一些问题,因为在正确解析之前,交错的单数行将需要外部操作或 R 中性能较低的 readLines(.) 步骤。我将暂时跳过这种可能性......

如果第一个代码块是真正的原始文件,那么我们有两种可能性:

  1. 如果它真的是制表符分隔的(只有空格在 date 和时间之间的 timestamp 字段中),那么我们可以简单地使用 read.table。我将从一个主要包含标签的文件 quux.tsv 开始。

    dat <- read.table("quux.tsv", header = TRUE, sep = "\t")
    dat
    #          timestamp axis1 axis2 axis3 VM standing stepping sitting
    # 1 2022-03-17 11:29     0     0     0  0        0        0       1
    # 2 2022-03-17 11:29     0     0     0  0        0        0       1
    # 3 2022-03-17 11:29     0     0     0  0        0        0       1
    # 4 2022-03-17 11:29     0     0     0  0        0        0       1
    # 5 2022-03-17 11:29     0     0     0  0        0        0       1
    # 6 2022-03-17 11:29     0     0     0  0        0        0       1
    # 7 2022-03-17 11:29     0     0     0  0        0        0       1

    我们可以使用以下方法拆分 timestamp

    dat$date <- sub(" .*", "", dat$timestamp)
    dat$time <- sub(".* ", "", dat$timestamp)
    dat$timestamp <- NULL
    dat
    #   axis1 axis2 axis3 VM standing stepping sitting       date  time
    # 1     0     0     0  0        0        0       1 2022-03-17 11:29
    # 2     0     0     0  0        0        0       1 2022-03-17 11:29
    # 3     0     0     0  0        0        0       1 2022-03-17 11:29
    # 4     0     0     0  0        0        0       1 2022-03-17 11:29
    # 5     0     0     0  0        0        0       1 2022-03-17 11:29
    # 6     0     0     0  0        0        0       1 2022-03-17 11:29
    # 7     0     0     0  0        0        0       1 2022-03-17 11:29
  2. 但是,如果它是(多)空格分隔的,我将使用一个没有制表符的 quux.txt 文件,只有在您的第一个块中的多个空格。

    dat <- fread("quux.txt")
    # Warning in fread("quux.txt") :
    #   Detected 8 column names but the data has 9 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.
    dat
    #            V1 timestamp axis1 axis2 axis3    VM standing stepping sitting
    #        <IDat>    <char> <int> <int> <int> <int>    <int>    <int>   <int>
    # 1: 2022-03-17     11:29     0     0     0     0        0        0       1
    # 2: 2022-03-17     11:29     0     0     0     0        0        0       1
    # 3: 2022-03-17     11:29     0     0     0     0        0        0       1
    # 4: 2022-03-17     11:29     0     0     0     0        0        0       1
    # 5: 2022-03-17     11:29     0     0     0     0        0        0       1
    # 6: 2022-03-17     11:29     0     0     0     0        0        0       1
    # 7: 2022-03-17     11:29     0     0     0     0        0        0       1

    那个警告有点烦人,但在这种情况下,它可以被忽略,甚至可以使用 suppressWarnings(fread("quux.txt")) 抑制。从那里,它已经拆分了 timestamp (方便,但表明格式存在“问题”,因此警告),您只需要重命名列。

    另一种方法使用单行预处理以在标题行中拆分单词 timestamp 以便大多数解析器可以正确读取它。

    Sys.which("sed")
    #                               sed 
    # "c:\\rtools40\\usr\\bin\\sed.exe" 
    system("sed -i -e '1 s/timestamp/time stamp/' quux.txt")
    # [1] 0

    (大多数 unix-y 操作系统在路径中有 sed,而 windows 则凭借此优点安装 Rtools ...您的位置可能会有所不同。)

    细分:

    -i -e '1 s/timestamp/time stamp/'
     ^                               edit/change in-place
        ^                            evaluate the next expression
           ^                         this expression only on line 1
             ^ `-------' `--------'  search/replace

    每个文件只需执行一次;再次运行是无害的,但是应该没有变化,因为 timestamp 已经被改变了,再也找不到了。

    从这里,大多数读者会看到它就好了,例如

    read.table("quux.txt", header = TRUE)
    #         time stamp axis1 axis2 axis3 VM standing stepping sitting
    # 1 2022-03-17 11:29     0     0     0  0        0        0       1
    # 2 2022-03-17 11:29     0     0     0  0        0        0       1
    # 3 2022-03-17 11:29     0     0     0  0        0        0       1
    # 4 2022-03-17 11:29     0     0     0  0        0        0       1
    # 5 2022-03-17 11:29     0     0     0  0        0        0       1
    # 6 2022-03-17 11:29     0     0     0  0        0        0       1
    # 7 2022-03-17 11:29     0     0     0  0        0        0       1
    readr::read_table("quux.txt")
    # -- Column specification ----------------------------------------------------------------------------------------------------------
    # cols(
    #   time = col_date(format = ""),
    #   stamp = col_time(format = ""),
    #   axis1 = col_double(),
    #   axis2 = col_double(),
    #   axis3 = col_double(),
    #   VM = col_double(),
    #   standing = col_double(),
    #   stepping = col_double(),
    #   sitting = col_double()
    # )
    # # A tibble: 7 x 9
    #   time       stamp  axis1 axis2 axis3    VM standing stepping sitting
    #   <date>     <time> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>   <dbl>
    # 1 2022-03-17 11:29      0     0     0     0        0        0       1
    # 2 2022-03-17 11:29      0     0     0     0        0        0       1
    # 3 2022-03-17 11:29      0     0     0     0        0        0       1
    # 4 2022-03-17 11:29      0     0     0     0        0        0       1
    # 5 2022-03-17 11:29      0     0     0     0        0        0       1
    # 6 2022-03-17 11:29      0     0     0     0        0        0       1
    # 7 2022-03-17 11:29      0     0     0     0        0        0       1

相似文章

javascript - 伊朗平均成绩 Regex

我需要一个regex来处理伊朗平均成绩,regex应该涵盖所有这些规则数字范围应在0到20之间。数字可以是十进制,但20除外。整数部分和小数部分应包含一或两个数字。如果您使用“。”你应该至少有一个十进...