Say we have some following raw data.
It consists of a timestamp and a corresponding value.
There is a peak at exactly midnight (00:00:00).
Each timestamp is fully specified. It contains a date, a time of day,
and a time zone offset indication. In this case +0000, meaning the data
is 0 hours away the UTC timezone.
"timestamp","value"
"25-04-2012 22:00:00 +0000",0
"25-04-2012 22:15:00 +0000",0
"25-04-2012 22:30:00 +0000",1
"25-04-2012 22:45:00 +0000",2
"25-04-2012 23:00:00 +0000",5
"25-04-2012 23:15:00 +0000",11
"25-04-2012 23:30:00 +0000",17
"25-04-2012 23:45:00 +0000",19
"26-04-2012 00:00:00 +0000",20
"26-04-2012 00:15:00 +0000",19
"26-04-2012 00:30:00 +0000",17
"26-04-2012 00:45:00 +0000",11
"26-04-2012 01:00:00 +0000",5
"26-04-2012 01:15:00 +0000",2
"26-04-2012 01:30:00 +0000",1
"26-04-2012 01:45:00 +0000",0
"26-04-2012 02:00:00 +0000",0
This data is stored in a file called peak2.dat and we read it as
follows:
dataset <- read.csv( file="peak2.dat",as.is=TRUE)
Then we convert the timestamps to POSIXct
objects
with the aid of strptime. Here we use the %z field to also read
the time zone offset:
# Convert timestamps
dataset$timestamp2 <- strptime( format="%d-%m-%Y %H:%M:%S %z",.
dataset$timestamp,
tz="UTC" )
And use ggplot to make a nice plot of the data.
The resulting graph looks something like.
p1 <- ggplot( dataset, aes( timestamp2, value ) ) +.
geom_point() + scale_x_datetime()
Something odd happened. The peak that was at 00:00 is now at 02:00 hours.
The reason for this is that the timestamps in the graph are displayed in
the timezone of the machine R runs on. In my case this was CET, which is
two hours ahead of CET (during summertime).
Notice that if you make the same plot with plot()
instead of ggplot()
the result is different.
plot( dataset$timestamp2, dataset$value, main="plot()",
xlab='timestamp2',
ylab='value' )
The peak now shows at 00:00 hours instead of 02:00.
So which one is correct. It depends; sometimes you want 02:00, sometimes you want 00:00. Let me give you an example. Say you live in the Germany.
You have a collegue living the Iceland. She did some interesting experiment
and needs your help analyzing the data.
She sends you the data with time stamps with an UTC timezone, the timezone of Iceland (GMT). Also, unlike Germany, Iceland does not have daylight saving.
You analyse the data and make a nice plot with ggplot.
You call her an say, "well I see a strange spike 2 o'clock in your data".
Then you better tell her it was 2 o'clock your time, or
she might go on a wild goose chase trying to figure out what happend
during her experiment at 2 o'clock her time. Which depending on the day of the year is 3 o'clock or 4 o'clock your time (depending if you have winter or summer time).
In such a case it is much easier to view a graph of the data
in the same timezone as where the data is from. Thereby avoiding having
to constantly convert back and forth between the timezones.
On the other hand if you have some measuring device that records all timestamps
in UTC, and it is located in the same time zone as you, you probably want all
time stamps shown in your time zone.
So sometimes you want the data to be shown in your timezone, sometimes
in its original timezone. So how can this be acomplished.
There are a number of variables that influence how data is shown.
- The time zone offset of the data (%z field),
- The parameter
tz
of the strptime
function,
- The timezone set in your operating system's clock,
- The environment variable
TZ
.
The most important variable is the time zone offset.
It indicates what the offset is of your timestamp from UTC.
It does not indicate the exact timezone, as several timezones can have the same offset from UTC. However if your data includes a time zone offset, use it.
With this offset the timestamp defines a single point in time.
Without this offset timestamps are ambiguous and the time zone your data
is in depends on other variables.
One of these is the tz
parameter of the strptime
function. It lets you specify the name of a timezone.
This parameter does several things. If your timestamp does not include
a time zone offset, tz
is used to interpret your timestamp.
> x <- strptime( "25-03-2012 02:23:00",
format="%d-%m-%Y %H:%M:%S", tz="EST" )
> x
[1] "2012-03-25 02:23:00 EST"
It is also possible to use both time zone offset and tz
In this case tz
is used when displaying your data.
> x <- strptime( "25-03-2012 02:23:00 +0000",
format="%d-%m-%Y %H:%M:%S %z", tz="EST" )
> x
[1] "2012-03-24 21:23:00 EST"
The value of tz
is stored together with the converted timestamp.
> dput(x)
> structure(list(sec = 0, min = 23L, hour = 21L, mday = 24L, mon = 2L,
year = 112L, wday = 6L, yday = 83L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = c("EST", "EST", "EST"
))
This information is used by some functions that display data.
If you don't specify either time zone offset or tz
, R uses the time zone
of your OS. But does not store the timezone information.
> x <- strptime( "25-03-2012 02:23:00", format="%d-%m-%Y %H:%M:%S")
> x
[1] "2012-03-25 02:23:00"
> dput(x)
structure(list(sec = 0, min = 23L, hour = 2L, mday = 25L, mon = 2L,
year = 112L, wday = 0L, yday = 84L, isdst = 1L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"))
On Unix like systems you can override this with the environment variable TZ
.
It puts R temporarily in a different timezone.
You would use it as follows:
$ export TZ=EST
$ R
> x <- strptime( "25-03-2012 02:23:00", format="%d-%m-%Y %H:%M:%S")
> x
[1] "2012-03-25 02:23:00 EST"
The table below shows the effect of these variables on how
plot()
and ggplot()
show the data from the
example above. The table shows where the peak of the graph is
located for various combinations of the variables.
The timezone of the operating system's clock is fixed to CET.
The other variables are varied.
|
time zone offset |
- |
UTC |
|
tz strptime |
- |
UTC |
EST |
- |
UTC |
EST |
- | ggplot |
00:00 |
02:00 |
07:00 |
02:00 |
02:00 |
02:00 |
- | plot |
00:00 |
00:00 |
00:00 |
00:00 |
00:00 |
19:00 |
UTC | ggplot |
00:00 |
00:00 |
05:00 |
00:00 |
00:00 |
00:00 |
UTC | plot |
00:00 |
00:00 |
00:00 |
00:00 |
00:00 |
19:00 |
CET | ggplot |
00:00 |
02:00 |
07:00 |
02:00 |
02:00 |
02:00 |
CET | plot |
00:00 |
00:00 |
00:00 |
02:00 |
00:00 |
19:00 |
TZ |
Plot kind |
|
When we look at the table it is clear that plot
and
ggplot()
behave quite differently.
When the timestamps have a timezone indicator, the tz
parameter does not have any influence on where ggplot()
shows the maximum. For plot()
it is the oposite.
If no time zone offset is specified, plot()
always shows
the peak at 00:00.
Conclusion
- When you make a graph with a time axis be aware in which time zone the
breaks on the axis are displayed. Otherwise points of interest might not be
where you think they are.
- Use timestamps with a time zone offset indication,
- If you want
plot()
and ggplot()
to
behave the same, do not use tz
but do set
TZ
.
- The only(*) way to make
ggplot
display your data in a
different time zone than your OS's, is to set TZ
.
(*) ggplot2 used to have a tz
parameter for its
scale_x_datetime
but that seems to be gone in
the current release.