Monday, August 06, 2012

It is good to be explicit

Being careful not to repeat the year 1901 mistake, I set the TZ variable before I run R. I have the same set of data that I convert as follows:

dates  <- c("11/11/1900", "01/01/1901", "30/05/1901", "01/01/1902")
values <- c(      1,         2,           0.7,              0.1 )

date1 <- as.Date(    dates, "%d/%m/%Y" )
date2 <- as.POSIXct( dates, "%d/%m/%Y" )

and then plot

plot( date1, values )
plot( date2, values )

To my surprise I end up with the following two graphs.

A number of things conspired against me here:

  • positional parameters,
  • default parameters.

Well the main cause is that I did not read the manual and assumed the parameters for as.Date() are similar to those of as.POSIXct(). But ask yourself how many times you used a function without consulting the manual because you thought you knew what the parameters were.

So lets look at the other causes. The help for as.Date() shows the following possible parameters.

as.Date(x, ...)
## S3 method for class 'character'
as.Date(x, format = "", ...)
## S3 method for class 'numeric'
as.Date(x, origin, ...)
## S3 method for class 'POSIXct'
as.Date(x, tz = "UTC", ...)

Depending on the type of object you are trying to convert different parameter lists apply. Lets focus on character objects.

as.Date(x, format = "", ...)

Two parameters are expected:

  • x an object to convert,
  • format a format string that specifies how the dates are formatted.
You can provide values for these parameters positional, that is in the order they are listen, or by name.

An example of the former.

as.Date('1492-11-29', '%Y-%m-%d')
[1] "1492-11-29"

Notice that format also has a default value, "". Which means we do not have to provide it. This indeed works.

as.Date('1492-11-29')
[1] "1492-11-29"

Well sort of. If format equals "" R tries %Y-%m-%d and %Y/%m/%d. It warns when this does not succeed

as.Date('29-Nov-1492')
Error in charToDate(x) : 
  character string is not in a standard unambiguous format

But fails without warning for

as.Date('29-11-1492')
[1] "29-11-14"

29-11-1492 is interpreted as the 29th year, 11-th month and 14-th day. The remaining string "92" not used, but this is not reported.

Default parameters can save time but it is better to be explicit and say what you mean and specify the format of your data, so R does not have to guess, and you won't end up being surprised.

Back to how to provide the values of the parameters. We have seen the positional method, the other one is by name. This would be:

as.Date(x='1492-11-29', format='%Y-%m-%d')
[1] "1492-11-29"

It even works the other way around now.

as.Date(format='%Y/%m/%d', x='1492/11/29')
[1] "1492-11-29"

It is more work to type this, and probably not worth it when you are just using R interactively. But if you are writing a script that is to be reused, this is the best way. It is very explicit, in a good way. You tell R exactly what you mean. In addition you tell your future self what you meant when you wrote it. It help others to understand what you are trying to say. This is good for reproducible research.

Now what went wrong in the original plot? My mistake was to assume that the parameters for as.POSIXct() appear in the same order as the ones for as.Date(). This is not the case however, as can be seen from the help pages.

## S3 method for class 'character'
as.Date(x, format = "", ...)

## S3 method for class 'character'
as.POSIXlt(x, tz = "", format, ...)

Therefore the conversion

date2 <- as.POSIXct( dates, "%d/%m/%Y" )

ended up using "%d/%m/%Y" as the tz parameter. It did not find a value for the format parameter and therefore guessed one (%Y-%m-%d). They day numbers got interpreted as years, and the graph therefore shows the years, 1, 11 and 30. About 2000 years ago instead of the original 100 years given by the data.

I would have avoided the surprising graph had I used named parameters, even without reading the manual.

Conclusion

Be explicit and be beware of implicit defaults in R.

No comments:

Post a Comment