Saturday, July 21, 2012

The magic of the year 1901

The year 1901 is rather magical. Well it is for R provided you run it under Linux. Let me show you why. I have four data points, one from 1900, two from 1901, and one from 1902.

dates  <- c("11/11/1900", "01/01/1901", "30/05/1901", "01/01/1902")
values <- c(      1,         2,           0.7,              0.1 )

I convert them in two different ways; as a Date, and as a POSIXct. For both conversions the same format string is used.

date1 <- as.Date(    dates, format="%d/%m/%Y" )
date2 <- as.POSIXct( dates, format="%d/%m/%Y" )

I plot them like this

plot( date1, values )
plot( date2, values )

Now you try and spot the difference.

Both graphs have the same shape, but different breaks. In the first graph the maximum appears to be in 1901, in the second graph in 1900. This is caused by a bug in the conversion from a string to R POSIXct class.

> as.POSIXct( "1901-01-01", format="%Y-%m-%d" ) 
[1] "1900-12-31 23:59:28 AMT"

Two things are wrong here.

  1. We somehow shifted 32 seconds into the past, (thereby moving from 1901 to 1900, which causes the difference in the two graphs).
  2. We also moved to the CET time zone, where I live, to the Amazonian time zone (AMT).
The conversion works fine for dates in more recent past.

> as.POSIXct("2012-01-01", format="%Y-%m-%d" )
[1] "2012-01-01 CET"

It even works properly for dates before the Unix epoch 1970-01-01.

> as.POSIXct("1957-01-01", format="%Y-%m-%d" )
[1] "1957-01-01 CET"

But around 1940 strange things happen to the timezone, and in december 1901 the 32 second time shift happens.

> as.POSIXct("1940-01-01", format="%Y-%m-%d" )
[1] "1940-01-01 NET"

How to fix this

Be explicit, don't leave R guessing what time zone to use. Set the environment variable TZ to a time zone of your liking before you start R.

$ export TZ=CET
$ R
> as.POSIXct( "1901-01-01", format="%Y-%m-%d" ) 
[1] "1901-01-01 CET"
> as.POSIXct( "1940-01-01", format="%Y-%m-%d" ) 
[1] "1940-01-01 CET"

No comments:

Post a Comment