-
Notifications
You must be signed in to change notification settings - Fork 1k
Replace TRUELENGTH markers with a hash
#6694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6694 +/- ##
==========================================
- Coverage 99.06% 99.04% -0.03%
==========================================
Files 86 87 +1
Lines 16616 16668 +52
==========================================
+ Hits 16461 16508 +47
- Misses 155 160 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Generated via commit 3119135 Download link for the artifact containing the test results: ↓ atime-results.zip
|
Also avoid crashing when creating a 0-size hash.
This may likely require a dynamically growing hash of TRUELENGTHs instead of the current pre-allocation approach with a very conservative over-estimate.
The hash needs O(n) memory (actually 2*n/load_factor entries) which isn't great.
|
hi, thanks for this. Can you please propose one or two performance test cases that you think may be adversely affected by these changes? Is it when we create a table with one column, and then use |
|
The The I'll try giving many |
|
Perhaps you can also try a fast 3rd party hash map: https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/ I particular Google's Abseil hash is pretty fast: https://abseil.io/docs/cpp/guides/container https://abseil.io/docs/cpp/guides/hash |
|
It's pretty bad. For typical cases, the current hash table eats >1 order of magnitude more R* memory, and it's similarly slower in DetailsThe hash table is only on par by time in the worst case for # may need 16G of RAM to run comfortably due to pathological memory allocation patterns
library(atime)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
truehash = '24e81785669e70caac31501bf4424ba14dbc90f9'
)
N <- 10^seq(2, 8.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)
# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)
# expected case: all columns named the same
N <- 10^seq(1, 6.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)
# worst case: all columns different
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)
save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')* Edit: Some of the memory use in I'll try profiling the code. Thanks @SebKrantz for the link, a newer benchmark by the same author is also very instructive. |
|
thanks for proposing the performance test cases and sharing the atime benchmark graphs. I agree that we should try to avoid an order of magnitude constant factor increase in time/memory usage. |
In forder() and rbindlist(), there is no good upper boundary on the number of elements in the hash known ahead of time. Grow the hash table dynamically. Since the R/W locks are far too slow and OpenMP atomics are too limited, rely on strategically placed flushes, which isn't really a solution.
|
Since profiling has shown that a noticeable amount of time is wasted initialising the giant pre-allocated hash tables, I was able to make the slowdown factor closer to 2 by dynamically re-allocating the hash table: The memory use is significantly reduced (except for the worst cases), but cannot be measured with Detailslibrary(atime)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
static_hash = '24e81785669e70caac31501bf4424ba14dbc90f9',
dynamic_hash = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4'
)
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)
# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)
# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)
# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)
#save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')The main problem with the current approach is that since the parallel loop in The current code keeps one previous hash table until the next reallocation cycle and hopes that
|
|
In case it helps, {collapse}'s hash functions (https://github.com/SebKrantz/collapse/blob/master/src/kit_dup.c and https://github.com/SebKrantz/collapse/blob/master/src/match.c) are pretty fast as well - inspired by base R -> multiplication hash using unsigned integer prime number. It's bloody fast but requires a large table. But Calloc() is quite efficient. Anyway, would be great if you'd test the Google Hash function, curious to see it it can do much better. PS: you can test |
|
The abseil hash function is very slightly slower in my tests, although the difference may be not significant. Perhaps that's because my C port fails to inline some of the things that naturally inline in the original C++ with templates. I can try harder, but that's a lot of extra code to bring in properly. Detailslibrary(atime)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
'abseil_hash' = '159e1d48926b72af9f212b8c645a8bc8ab6b20be'
)
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)
# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)
# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)
# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)
save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')
Detailslibrary(atime)
library(data.table)
limit <- 1
# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
chmatch1 <- atime(
N,
seconds.limit = limit, verbose = TRUE,
master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)
rm(chmatch_work1); gc(full = TRUE)
save(chmatch1, file = 'times_collapse.rda')And the real memory cost isn't even that large: Detailslibrary(atime)
library(data.table)
# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
library(data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac)
library(data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4)
library(parallel)
# only tested on a recent Linux system
# measures the _maximal_ amount of memory in kB used by the current process
writeLines('
#include <sys/resource.h>
void maxrss(double * kb) {
struct rusage ru;
int ret = getrusage(RUSAGE_SELF, &ru);
*kb = ret ? -1 : ru.ru_maxrss;
}
', 'maxrss.c')
tools::Rcmd('SHLIB maxrss.c')
dyn.load(paste0('maxrss', .Platform$dynlib.ext))
limit <- 1
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
versions <- expression(
master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)
plan <- expand.grid(N = N, version = names(versions))
chmatch1 <- lapply(seq_len(nrow(plan)), \(i) {
# use a disposable child process
mccollect(mcparallel({
eval(versions[[plan$version[[i]]]], list(N = plan$N[[i]]))
.C('maxrss', kb = double(1))$kb
}))
})
rm(chmatch_work1); gc(full = TRUE)
save(chmatch1, file = 'times_collapse.rda')
chmatch1p <- lattice::xyplot(
maxrss_kb ~ N, cbind(plan, maxrss_kb = unlist(chmatch1)), group = version,
auto.key = TRUE, scales = list(log = 10),
par.settings=list(superpose.symbol=list(pch=19))
) |
|
Nice! Thanks. I always wondered about this tradeoff between the size of the table and the quality of the hash function. Looks like speed + large table still wins. Anyway, if you want to adopt, feel free to copy it under your MPL license. Just mention me in the top of the file and as a contributor. |
PROTECT() the corresponding EXTPTRSXP while used. Introduce a separate hash_set_shared() operation that avoids long jumps. Deallocate the previous hash table when growing a non-shared hash table.
Explicitly check the return value and update the shared pointer when necessary. If a reallocation attempt fails, signal an error when possible.
Avoid memory leaks from potential long jumps in hash_set().
This avoids a memory leak in case growVector or hash_set fails.
Prevent memory leak in case hash_set() causes a long jump.
Turns out it wasn't used.
Where C-heap allocations also exist, catch and handle potential allocation failures from the hash table.
TRUELENGTH markers with a hashTRUELENGTH markers with a hash
|
We should probably also add a NOTE regarding all the changes Internal use of declared non-API R functions probably also last benchmark (since we look on par with
Detailslibrary(atime)
library(data.table)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
'1.17.8' = '4152c6cff80c60896e86b80f8b1c901a4ae93354',
'truehash' = 'a7f5c6c0c0edd098793837525205244f49c41aa2'
)
set.seed(3)
sample_strings = function(N=10, len=4) {
do.call(paste0, replicate(len, sample(LETTERS, N, TRUE), FALSE))
}
N <- 10^seq(2, 7.5, .25)
# Benchmark 1: Large table
tab_full = sample_strings(1e6, 10)
tab_small = sample(tab_full, 9e5)
chmatch_work1 <- lapply(setNames(nm = N), \(N)
sample(tab_full, N, TRUE)
)
chmatch1 <- atime_versions(
pkg.path, N,
expr = data.table::chmatch(chmatch_work1[[as.character(N)]], tab_small),
seconds.limit = limit, verbose = TRUE, sha.vec = versions,
pkg.edit.fun = pkg.edit.fun
)
plot(chmatch1)
# Benchmark 2: Small table (expected case: a few distinct strings - 26 letters)
chmatch_work2 <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
chmatch2 <- atime_versions(
pkg.path, N,
expr = data.table::chmatch(chmatch_work2[[as.character(N)]], letters),
seconds.limit = limit, verbose = TRUE, sha.vec = versions,
pkg.edit.fun = pkg.edit.fun
)
plot(chmatch2)
# Benchmark 3: Medium table (10K unique strings)
medium_table <- sample_strings(10000, len=4)
chmatch_work3 <- lapply(setNames(nm = N), \(N)
sample(medium_table, N, TRUE)
)
chmatch3 <- atime_versions(
pkg.path, N,
expr = data.table::chmatch(chmatch_work3[[as.character(N)]], medium_table),
seconds.limit = limit, verbose = TRUE, sha.vec = versions,
pkg.edit.fun = pkg.edit.fun
)
plot(chmatch3)
# Benchmark 4: High miss rate (most items not found)
miss_table <- sample_strings(100, len=4)
chmatch_work4 <- lapply(setNames(nm = N), \(N)
sample_strings(N, len=4)
)
chmatch4 <- atime_versions(
pkg.path, N,
expr = data.table::chmatch(chmatch_work4[[as.character(N)]], miss_table),
seconds.limit = limit, verbose = TRUE, sha.vec = versions,
pkg.edit.fun = pkg.edit.fun
)
plot(chmatch4)
# Benchmark 5: All hits (every item found)
hit_table <- sample_strings(1000, len=4)
chmatch_work5 <- lapply(setNames(nm = N), \(N)
sample(hit_table, N, TRUE)
)
chmatch5 <- atime_versions(
pkg.path, N,
expr = data.table::chmatch(chmatch_work5[[as.character(N)]], hit_table),
seconds.limit = limit, verbose = TRUE, sha.vec = versions,
pkg.edit.fun = pkg.edit.fun
)
plot(chmatch5)
# Benchmark 6: chin (logical return) with small table
chin_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
chin_bench <- atime_versions(
pkg.path, N,
expr = data.table::`%chin%`(chin_work[[as.character(N)]], letters),
seconds.limit = limit, verbose = TRUE, sha.vec = versions,
pkg.edit.fun = pkg.edit.fun
)
plot(chin_bench)
# Benchmark 7: Long strings (16 chars)
long_table <- sample_strings(1000, len=16)
chmatch_work7 <- lapply(setNames(nm = N), \(N)
sample(long_table, N, TRUE)
)
chmatch7 <- atime_versions(
pkg.path, N,
expr = data.table::chmatch(chmatch_work7[[as.character(N)]], long_table),
seconds.limit = limit, verbose = TRUE, sha.vec = versions,
pkg.edit.fun = pkg.edit.fun
)
plot(chmatch7)
# Benchmark 8: Realistic join keys (90% hit rate)
join_table <- sample_strings(10000, len=8)
chmatch_work8 <- lapply(setNames(nm = N), \(N) {
# 90% from join_table (will be found), 10% new (won't be found)
c(
sample(join_table, as.integer(N * 0.9), TRUE),
sample_strings(as.integer(N * 0.1), len=8)
)
})
chmatch8 <- atime_versions(
pkg.path, N,
expr = data.table::chmatch(chmatch_work8[[as.character(N)]], join_table),
seconds.limit = limit, verbose = TRUE, sha.vec = versions,
pkg.edit.fun = pkg.edit.fun
)
plot(chmatch8)
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
plot(forderv1)
rm(forderv1_work); gc(full = TRUE)
# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
plot(forderv2)
rm(forderv2_work); gc(full = TRUE)
# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
plot(rbindlist1)
rm(rbindlist1_work); gc(full = TRUE)
# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
plot(rbindlist2)
rm(rbindlist2_work); gc(full = TRUE)
# ===== SAVE BENCHMARK RESULTS =====
results_dir <- "benchmark_results"
dir.create(results_dir, showWarnings = FALSE)
# Save R objects
save(
chmatch1, chmatch2, chmatch3, chmatch4, chmatch5,
chin_bench, chmatch7, chmatch8,
forderv1, forderv2,
rbindlist1, rbindlist2,
file = file.path(results_dir, "benchmarks_hash.rda")
)
library(atime)
library(data.table)
results_dir <- "benchmark_results"
load(file.path(results_dir, "benchmarks_hash.rda"))
library(ggplot2)
library(patchwork) # install.packages("patchwork") if needed
# helper: ensure we get a ggplot and add a title
g <- function(p, title) p + ggtitle(title)
p <- list(
g(plot(chmatch1), "Large table (900K unique)"),
g(plot(chmatch2), "Small table (26 letters)"),
g(plot(chmatch3), "Medium table (10K unique)"),
g(plot(chmatch4), "High miss rate"),
g(plot(chmatch5), "All hits"),
g(plot(chin_bench), "chin (logical return)"),
g(plot(chmatch7), "Long strings (16 chars)"),
g(plot(chmatch8), "Join keys (90% hit rate)"),
g(plot(forderv1), "forderv: few distinct strings"),
g(plot(forderv2), "forderv: all unique strings"),
g(plot(rbindlist1), "rbindlist: same columns"),
g(plot(rbindlist2), "rbindlist: different columns")
)
# write a multipage PDF with 2x2 panels per page
pdf("benchmark_results.pdf", width = 10, height = 8)
for (i in seq(1, length(p), by = 4)) {
pg <- wrap_plots(p[i:min(i+3, length(p))], ncol = 2, nrow = 2, guides = "collect")
print(pg)
}
dev.off()
|
|
Probably also prudent to add something along the lines of "We've benchmarked the refactored code and find the performance satisfactory, but do report any edge case performance regressions we may have missed." |
|
Oh no. # use 6144 elements instead of 6145 and it doesn't trigger
strings = apply(outer(1:6145, 95^(0:2), function(x, y) as.raw(0xa1 + floor(x / y) %% 95)), 1, rawToChar)
Encoding(strings) = "latin1"
DT = data.table(x = strings)
setorder(DT, x)
plot(match(DT$x, sort.int(strings, method='radix')))
Bisect points at c933247. I'm also investigating, but does anyone else have any pointers? 6145 is, of course, Line 611 in c933247
...and Line 322 in c933247
(And according to coverage data, we never rehash in place via Rehashing itself seems completely fine. The newly created table is empty, as it should be. Once the elements are set, I can walk both tables and verify that the other one looks up the same value It doesn't seem to be due to threading as well, because this still happens after Naturally, neither Valgrind nor the sanitizers see any problems. |
|
Found the problem: changes to |
Co-authored-by: Benjamin Schwendinger <[email protected]> Co-authored-by: HughParsonage <[email protected]> Co-authored-by: Jan Gorecki <[email protected]>
|
Really incredible work here. Looks like some conflicts to resolve before merging. Believe this is the last necessary merge before 1.18.0? |
|
Yes, this is the last PR needed to get rid of the non-API NOTE (for now). Looks like rehashing is unreachable from |
|
Changes to GLCI config tested on GitLab:
https://gitlab.com/Rdatatable/data.table/-/pipelines/2214728801
UBSan report also addressed:
https://gitlab.com/Rdatatable/data.table/-/jobs/12435568788
We should probably specify the number of arguments when registering our
callMethods to avoid surprises like this from UBSan.
|
I would not cover it with |














With apologies to Matt Dowle, who had poured a lot of effort into making
data.tablego fast.Ongoing work towards #6180. Unfortunately, doesn't completely remove any uses of non-API entry points by itself. Detailed motivation here in a pending blog post. Can't start implementing stretchy ALTREP vectors until
data.tablestops usingTRUELENGTHto mark them.Currently implemented:
TRUELENGTHto markCHARSXPs or columns replaced with a hashNeeds more work:
rbindlist()andforder()pre-allocate memory for the worst-case usageforder.c, the last remaining user ofsavetlSET_TRUELENGTHis atomic,hash_setis not, will need additional care in multi-threaded environmentsavetlmachinery inassign.cLet's just see how much worse is the performance going to get.