Thursday, July 19, 2012

Time zones

Say we have some following raw data. It consists of a timestamp and a corresponding value. There is a peak at exactly midnight (00:00:00). Each timestamp is fully specified. It contains a date, a time of day, and a time zone offset indication. In this case +0000, meaning the data is 0 hours away the UTC timezone.

"timestamp","value"
"25-04-2012 22:00:00 +0000",0
"25-04-2012 22:15:00 +0000",0
"25-04-2012 22:30:00 +0000",1
"25-04-2012 22:45:00 +0000",2
"25-04-2012 23:00:00 +0000",5
"25-04-2012 23:15:00 +0000",11
"25-04-2012 23:30:00 +0000",17
"25-04-2012 23:45:00 +0000",19
"26-04-2012 00:00:00 +0000",20
"26-04-2012 00:15:00 +0000",19
"26-04-2012 00:30:00 +0000",17
"26-04-2012 00:45:00 +0000",11
"26-04-2012 01:00:00 +0000",5
"26-04-2012 01:15:00 +0000",2
"26-04-2012 01:30:00 +0000",1
"26-04-2012 01:45:00 +0000",0
"26-04-2012 02:00:00 +0000",0

This data is stored in a file called peak2.dat and we read it as follows:

dataset <- read.csv( file="peak2.dat",as.is=TRUE)

Then we convert the timestamps to POSIXct objects with the aid of strptime. Here we use the %z field to also read the time zone offset:

# Convert timestamps
dataset$timestamp2 <- strptime( format="%d-%m-%Y %H:%M:%S %z",.
                                dataset$timestamp,
                                tz="UTC" )

And use ggplot to make a nice plot of the data. The resulting graph looks something like.

p1 <- ggplot( dataset, aes( timestamp2, value ) ) +.
      geom_point() + scale_x_datetime() 

Something odd happened. The peak that was at 00:00 is now at 02:00 hours.

The reason for this is that the timestamps in the graph are displayed in the timezone of the machine R runs on. In my case this was CET, which is two hours ahead of CET (during summertime).

Notice that if you make the same plot with plot() instead of ggplot() the result is different.

plot( dataset$timestamp2, dataset$value, main="plot()", 
      xlab='timestamp2',
      ylab='value' )

The peak now shows at 00:00 hours instead of 02:00.

So which one is correct. It depends; sometimes you want 02:00, sometimes you want 00:00. Let me give you an example. Say you live in the Germany. You have a collegue living the Iceland. She did some interesting experiment and needs your help analyzing the data. She sends you the data with time stamps with an UTC timezone, the timezone of Iceland (GMT). Also, unlike Germany, Iceland does not have daylight saving.

You analyse the data and make a nice plot with ggplot. You call her an say, "well I see a strange spike 2 o'clock in your data". Then you better tell her it was 2 o'clock your time, or she might go on a wild goose chase trying to figure out what happend during her experiment at 2 o'clock her time. Which depending on the day of the year is 3 o'clock or 4 o'clock your time (depending if you have winter or summer time). In such a case it is much easier to view a graph of the data in the same timezone as where the data is from. Thereby avoiding having to constantly convert back and forth between the timezones.

On the other hand if you have some measuring device that records all timestamps in UTC, and it is located in the same time zone as you, you probably want all time stamps shown in your time zone.

So sometimes you want the data to be shown in your timezone, sometimes in its original timezone. So how can this be acomplished. There are a number of variables that influence how data is shown.

  • The time zone offset of the data (%z field),
  • The parameter tz of the strptime function,
  • The timezone set in your operating system's clock,
  • The environment variable TZ.

The most important variable is the time zone offset. It indicates what the offset is of your timestamp from UTC. It does not indicate the exact timezone, as several timezones can have the same offset from UTC. However if your data includes a time zone offset, use it. With this offset the timestamp defines a single point in time. Without this offset timestamps are ambiguous and the time zone your data is in depends on other variables.

One of these is the tz parameter of the strptime function. It lets you specify the name of a timezone. This parameter does several things. If your timestamp does not include a time zone offset, tz is used to interpret your timestamp.

> x <- strptime( "25-03-2012 02:23:00", 
               format="%d-%m-%Y %H:%M:%S", tz="EST" )
> x
[1] "2012-03-25 02:23:00 EST"

It is also possible to use both time zone offset and tz In this case tz is used when displaying your data.

> x <- strptime( "25-03-2012 02:23:00 +0000", 
                 format="%d-%m-%Y %H:%M:%S %z", tz="EST" )
> x
[1] "2012-03-24 21:23:00 EST"

The value of tz is stored together with the converted timestamp.

> dput(x)
> structure(list(sec = 0, min = 23L, hour = 21L, mday = 24L, mon = 2L, 
    year = 112L, wday = 6L, yday = 83L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = c("EST", "EST", "EST"
))

This information is used by some functions that display data.

If you don't specify either time zone offset or tz, R uses the time zone of your OS. But does not store the timezone information.

> x <- strptime( "25-03-2012 02:23:00", format="%d-%m-%Y %H:%M:%S")
> x
[1] "2012-03-25 02:23:00"
> dput(x)
structure(list(sec = 0, min = 23L, hour = 2L, mday = 25L, mon = 2L, 
    year = 112L, wday = 0L, yday = 84L, isdst = 1L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"))

On Unix like systems you can override this with the environment variable TZ. It puts R temporarily in a different timezone. You would use it as follows:

$ export TZ=EST
$ R 
> x <- strptime( "25-03-2012 02:23:00", format="%d-%m-%Y %H:%M:%S")
> x
[1] "2012-03-25 02:23:00 EST"

The table below shows the effect of these variables on how plot() and ggplot() show the data from the example above. The table shows where the peak of the graph is located for various combinations of the variables. The timezone of the operating system's clock is fixed to CET. The other variables are varied.

time zone offset - UTC
tz strptime - UTC EST - UTC EST
-ggplot 00:00 02:00 07:00 02:00 02:00 02:00
-plot 00:00 00:00 00:00 00:00 00:00 19:00
UTCggplot 00:00 00:00 05:00 00:00 00:00 00:00
UTCplot 00:00 00:00 00:00 00:00 00:00 19:00
CETggplot 00:00 02:00 07:00 02:00 02:00 02:00
CETplot 00:00 00:00 00:00 02:00 00:00 19:00
TZ Plot kind

When we look at the table it is clear that plot and ggplot() behave quite differently. When the timestamps have a timezone indicator, the tz parameter does not have any influence on where ggplot() shows the maximum. For plot() it is the oposite. If no time zone offset is specified, plot() always shows the peak at 00:00.

Conclusion

  • When you make a graph with a time axis be aware in which time zone the breaks on the axis are displayed. Otherwise points of interest might not be where you think they are.
  • Use timestamps with a time zone offset indication,
  • If you want plot() and ggplot() to behave the same, do not use tz but do set TZ.
  • The only(*) way to make ggplot display your data in a different time zone than your OS's, is to set TZ.

(*) ggplot2 used to have a tz parameter for its scale_x_datetime but that seems to be gone in the current release.

No comments:

Post a Comment