Long 2 wide

Storing data in long format is efficient and flexible. For example, suppose you have been recording data using some sort of logger like a temperature or water level. But each logger may be on a different recording interval or the time interval might not line up between loggers. The example below demonstrates how to convert the data into a wide format where each row represents a date and time and each column represents a logger. Wide format is easier to deal with analysis wise while long format is much easier for storage.

Suppose you have 3 loggers that record water level where

Logger 1: on a 5 minute interval that starts at 3 minutes past the hour
Logger 2: on a 15 minute interval that starts at the top of the hour, and
Logger 3: on a 20 minute interval that starts at 5 minutes past the hour.

The code below simulates a data stream for logger 1.

dat<-expand.grid(
    year=c(2018:2020),
    doy=c(1:365),
    hour_fractional=seq(3,1440,by=5)/60,
    logger=1)

Now some data for logger 2 to add to the mix.

dat<-rbind(dat,expand.grid(
    year=c(2017:2020),
    doy=c(1:365),
    hour_fractional=seq(0,1440,by=15)/60,
    logger=2))

And some for logger 3 to finish up.

dat<-rbind(dat,expand.grid(
    year=c(2018:2020),
    doy=c(1:365),
    hour_fractional=seq(5,1440,by=20)/60,
    logger=3))

Here I am going to convert the data.frame to a data.table using the data.table package. The syntax for a data.table can use base R or the data.table syntax below. I find that relative to other packages for summarizing and wrangling data that data.table checks all my boxes as something that is fast, compact code, and does not require external package dependencies.

library(data.table)
dat<-as.data.table(dat)
dat$water_level_elevation<-runif(nrow(dat)) # or
dat[,water_level_elevation:=runif(.N)] # does the same as line above 

Now the challenge is getting the time into a common scale. In this example logger 3 is on a 20 minute interval while logger 1 and logger 2 are on a 5 and 15 minute interval respectively. In short we need to round the fractional hour so we can aggregate the water level elevations to the 20 minute interval. This would be a ton easier if the logger kept to the top of the hour, and most loggers do that now but we are up for the challenge here.

So will will summarize the data such that each row represents 1-20, 21-40, and 41-60 minute intervals. The challenge there is converting the fractional hours to the intervals we want. The intervals we want are

intervals<-expand.grid(hour=c(0:23),minute_interval=c(0,20/60,40/60))
intervals<- c(sort(intervals$hour+intervals$minute_interval),24)

The cut function can be used to link those intervals to data, it essentially assigns the hourly fraction. Note the use of include.lowest argument which sets up the logic that if a values is greater than or equal to the lowest value and less than the highest value it is assigned to the corresponding interval.

dat[,interval:=cut(hour_fractional,intervals,
    labels=c(1:length(intervals[-1])),
     include.lowest = TRUE)]

Now we can summarize within the time interval.

tmp<-dcast(dat,year+doy+interval~logger,
    value.var="water_level_elevation",fun.aggregate=mean)

The result is illustrated below. It is easy to see where there is missing data for each row and now each logger has corresponding data for the time interval.

head(tmp)

##    year doy interval   1          2   3
## 1: 2017   1        1 NaN 0.27053709 NaN
## 2: 2017   1        2 NaN 0.65718921 NaN
## 3: 2017   1        3 NaN 0.36914897 NaN
## 4: 2017   1        4 NaN 0.59974370 NaN
## 5: 2017   1        5 NaN 0.06386676 NaN
## 6: 2017   1        6 NaN 0.48007127 NaN

And here are the last 5 rows of data.

tail(tmp)

##    year doy interval         1         2          3
## 1: 2020 365       67 0.4147117 0.7213023 0.44568067
## 2: 2020 365       68 0.4494164 0.9966311 0.05178791
## 3: 2020 365       69 0.8021701 0.5101723 0.99904165
## 4: 2020 365       70 0.4218598 0.7817812 0.76833789
## 5: 2020 365       71 0.1888997 0.8379302 0.68012699
## 6: 2020 365       72 0.7194562 0.4693931 0.51111675

Long 2 wide

Long is good for storage, wide can be good for analysis

Long 2 wide

Long is good for storage, wide can be good for analysis