-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
I have the following functions in my package that we might consider adding to data.table: quick_year, quick_yday, quick_mday (code here).
Till now I haven't filed these as a PR because they're not perfectly correct -- any user that deals with dates before March 1st, 1900 or after February 28, 2100 will get incorrect results due to leap centuries.
Only now I considered the possibility of adding these methods to data.table, but including an argument (say, exact = TRUE, ?activated by default?) that would default to the exact methods for users dealing with distant dates; users dealing with (currently) more typical dates would get the benefit of these faster methods.
To get a sense for the improvement, here's a benchmark:
NN = 1e6
# sample between Jan 1, 1901 and Dec 31, 2099
sample_dates = as.IDate(sample(-25202:47481, NN, replace = TRUE),
origin = '1970-01-01')
library(microbenchmark)
library(data.table)
library(funchir) #must be installed from my GitHub for now
microbenchmark(times = 100L,
year_exact_TRUE = year(sample_dates),
year_exact_FALSE = quick_year(sample_dates),
yday_exact_TRUE = yday(sample_dates),
yday_exact_FALSE = quick_yday(sample_dates),
mday_exact_TRUE = mday(sample_dates),
mday_exact_FALSE = quick_mday(sample_dates))
# Unit: milliseconds
# expr min lq mean median uq max
# year_exact_TRUE 20.64397 57.99979 62.94209 68.10336 70.69258 87.86857
# year_exact_FALSE 37.25068 38.42864 40.07562 39.44084 40.68865 86.23708
# yday_exact_TRUE 20.15585 56.33888 61.68985 64.37842 69.01061 135.93603
# yday_exact_FALSE 29.93036 31.09082 33.88315 31.85358 32.78074 80.44249
# mday_exact_TRUE 19.05430 54.83927 57.80232 65.01923 67.61623 82.47341
# mday_exact_FALSE 16.10283 16.88485 18.81903 17.38729 17.90674 65.08335
So a 43% improvement, 52% improvement, and a 74% improvement over current methods.
And to confirm:
identical(year(sample_dates), quick_year(sample_dates))
# [1] TRUE
identical(yday(sample_dates), quick_yday(sample_dates))
# [1] TRUE
identical(mday(sample_dates), quick_mday(sample_dates))
# [1] TRUE
Any thoughts? I would also lean towards setting the default to use the faster method, but this is a potentially breaking change. Could also consider checking whether the underlying integer is inside the correct boundary and then branching from there, or consider expanding this to just work for all dates forever...