Skip to content

Q - incorporate faster (though inexact due to leap centuries) datetime functions? #2463

@MichaelChirico

Description

@MichaelChirico

I have the following functions in my package that we might consider adding to data.table: quick_year, quick_yday, quick_mday (code here).

Till now I haven't filed these as a PR because they're not perfectly correct -- any user that deals with dates before March 1st, 1900 or after February 28, 2100 will get incorrect results due to leap centuries.

Only now I considered the possibility of adding these methods to data.table, but including an argument (say, exact = TRUE, ?activated by default?) that would default to the exact methods for users dealing with distant dates; users dealing with (currently) more typical dates would get the benefit of these faster methods.

To get a sense for the improvement, here's a benchmark:

NN = 1e6
# sample between Jan 1, 1901 and Dec 31, 2099
sample_dates = as.IDate(sample(-25202:47481, NN, replace = TRUE),
                        origin = '1970-01-01')

library(microbenchmark)
library(data.table)
library(funchir) #must be installed from my GitHub for now
microbenchmark(times = 100L,
               year_exact_TRUE = year(sample_dates),
               year_exact_FALSE = quick_year(sample_dates),
               yday_exact_TRUE = yday(sample_dates),
               yday_exact_FALSE = quick_yday(sample_dates),
               mday_exact_TRUE = mday(sample_dates),
               mday_exact_FALSE = quick_mday(sample_dates))
# Unit: milliseconds
#              expr      min       lq     mean   median       uq       max
#   year_exact_TRUE 20.64397 57.99979 62.94209 68.10336 70.69258  87.86857
#  year_exact_FALSE 37.25068 38.42864 40.07562 39.44084 40.68865  86.23708
#   yday_exact_TRUE 20.15585 56.33888 61.68985 64.37842 69.01061 135.93603
#  yday_exact_FALSE 29.93036 31.09082 33.88315 31.85358 32.78074  80.44249
#   mday_exact_TRUE 19.05430 54.83927 57.80232 65.01923 67.61623  82.47341
#  mday_exact_FALSE 16.10283 16.88485 18.81903 17.38729 17.90674  65.08335

So a 43% improvement, 52% improvement, and a 74% improvement over current methods.

And to confirm:

identical(year(sample_dates), quick_year(sample_dates))
# [1] TRUE
identical(yday(sample_dates), quick_yday(sample_dates))
# [1] TRUE
identical(mday(sample_dates), quick_mday(sample_dates))
# [1] TRUE

Any thoughts? I would also lean towards setting the default to use the faster method, but this is a potentially breaking change. Could also consider checking whether the underlying integer is inside the correct boundary and then branching from there, or consider expanding this to just work for all dates forever...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions