Replace `TRUELENGTH` markers with a hash #6694

aitap · 2024-12-26T15:52:14Z

With apologies to Matt Dowle, who had poured a lot of effort into making data.table go fast.

Ongoing work towards #6180. Unfortunately, doesn't completely remove any uses of non-API entry points by itself. Detailed motivation here in a pending blog post. Can't start implementing stretchy ALTREP vectors until data.table stops using TRUELENGTH to mark them.

Currently implemented:

An open-addressing hash table with Knuth's linear multiplication hashing
- Deliberately no removal operation, only insertion and update
- Can be made more performant (collisions much less likely) by switching to tables of power-of-two length and double hashing
Almost all uses of TRUELENGTH to mark CHARSXPs or columns replaced with a hash

Needs more work:

the uses in rbindlist() and forder() pre-allocate memory for the worst-case usage
- may be noticeably bad for cache performance
- a dynamically growing hash table would take even more CPU time and cache misses and maybe system calls to resize
forder.c, the last remaining user of savetl
- SET_TRUELENGTH is atomic, hash_set is not, will need additional care in multi-threaded environment
savetl machinery in assign.c

Let's just see how much worse is the performance going to get.

codecov · 2024-12-26T15:59:05Z

Codecov Report

❌ Patch coverage is 97.72727% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.04%. Comparing base (6329d94) to head (3119135).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
src/hash.c	94.50%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6694      +/-   ##
==========================================
- Coverage   99.06%   99.04%   -0.03%     
==========================================
  Files          86       87       +1     
  Lines       16616    16668      +52     
==========================================
+ Hits        16461    16508      +47     
- Misses        155      160       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2024-12-26T16:07:30Z

HEAD=truehash slower P<0.001 for memrecycle regression fixed in #5463

Generated via commit 3119135

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	2 minutes and 55 seconds
Installing different package versions	43 seconds
Running and plotting the test cases	2 minutes and 40 seconds

Also avoid crashing when creating a 0-size hash.

This may likely require a dynamically growing hash of TRUELENGTHs instead of the current pre-allocation approach with a very conservative over-estimate.

The hash needs O(n) memory (actually 2*n/load_factor entries) which isn't great.

tdhock · 2024-12-29T16:38:51Z

hi, thanks for this. Can you please propose one or two performance test cases that you think may be adversely affected by these changes? Is it when we create a table with one column, and then use := or set to add a lot more columns?

aitap · 2024-12-29T17:27:17Z

The rbindlist() code pre-allocates O(sum(lengths(lapply(l, names))) memory. In the best case, only O(length(names(l[[1]]))) is needed, length(l) times less. In the worst case, all names are different and the code needs all that memory for fill = TRUE.

The forder() code pre-allocates O(nrow(x)) memory.

I'll try giving many data.tables to rbindlist() and long data.tables with string columns to setorder().

SebKrantz · 2024-12-30T12:50:30Z

Perhaps you can also try a fast 3rd party hash map: https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/

I particular Google's Abseil hash is pretty fast: https://abseil.io/docs/cpp/guides/container https://abseil.io/docs/cpp/guides/hash

aitap · 2024-12-31T10:28:36Z

It's pretty bad. For typical cases, the current hash table eats >1 order of magnitude more R* memory, and it's similarly slower in forder():

Details

The hash table is only on par by time in the worst case for forder() (all strings different) and might be on par in the worst case for rbindlist(fill = TRUE) (all column names different), but the latter is completely unrealistic as a use case.

# may need 16G of RAM to run comfortably due to pathological memory allocation patterns
library(atime)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  master   = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
  truehash = '24e81785669e70caac31501bf4424ba14dbc90f9'
)

N <- 10^seq(2, 8.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 6.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)

save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')

* Edit: Some of the memory use in data.table is invisible to Rprofmem(). For forder(), data.table is said to allocate as much memory as one extra column, i.e. 8 bytes per CHARSXP element on a 64-bit computer. The hash table preallocates enough space for 2*N (original + UTF-8) cells at 50% load factor (so twice more), with each cell consisting of a SEXP pointer and an R_xlen_t mark, so 2 * 2 * 16 = 64 bytes more per element, in addition to normal memory use. Measuring forder()'s memory use as getrusage() + ru_maxrss and subtracting the value before the function call gives coef(lm(forderuse_kb*1024 ~ N:version, x))[-1] as N:versionmaster = 8.993967 and N:versiontruehash = 72.178404, which is close to the theory.

I'll try profiling the code.

Thanks @SebKrantz for the link, a newer benchmark by the same author is also very instructive.

tdhock · 2024-12-31T19:50:05Z

thanks for proposing the performance test cases and sharing the atime benchmark graphs. I agree that we should try to avoid an order of magnitude constant factor increase in time/memory usage.

In forder() and rbindlist(), there is no good upper boundary on the number of elements in the hash known ahead of time. Grow the hash table dynamically. Since the R/W locks are far too slow and OpenMP atomics are too limited, rely on strategically placed flushes, which isn't really a solution.

aitap · 2025-01-01T18:16:10Z

Since profiling has shown that a noticeable amount of time is wasted initialising the giant pre-allocated hash tables, I was able to make the slowdown factor closer to 2 by dynamically re-allocating the hash table:

The memory use is significantly reduced (except for the worst cases), but cannot be measured with atime (just like other uses of calloc/malloc/realloc in data.table are invisible to Rprofmem).

Details

library(atime)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  master   = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
  static_hash = '24e81785669e70caac31501bf4424ba14dbc90f9',
  dynamic_hash = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4'
)

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)

#save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')

The main problem with the current approach is that since the parallel loop in src/forder.c:range_str(...) needs to read the hash table from outside the critical section while hash elements are being set inside the critical section, it's possible that dhash_set() will need to reallocate the table while dhash_lookup() is using it. I have already tried using pthread R/W locks and OpenMP critical sections; they were too slow. I wasn't able to use OpenMP atomics to make atomic reads/writes through a pointer (maybe that's asking for too much). Any third-party hash table will also need to be outfitted with appropriate locking.

The current code keeps one previous hash table until the next reallocation cycle and hopes that dhash_enlarge writes the new table pointer into the object roughly at the same time as the metadata, so that when dhash_lookup reads the object, it will hopefully receive a consistent view of the object. This is not good enough for production because (1) the race window still exists and (2) there's no proof that there won't be a stale reader still using the previous table array when the current writer deallocates it. What can be done with it:

Don't deallocate the previous hash tables until the function is completely done with them. This will double the memory use (because the code doubles the buffer when re-allocating it and $\sum_{i=0}^{n} 2^i = 2^{n+1} - 1$). We'll be able to switch back from malloc to R_alloc and make the allocations visible to R's allocator.
Devise a read-copy-update scheme (see also: 1 2) to properly expire the previous versions of the hash table. These may require synchronisation primitives not easily available through OpenMP. Reference-counting the hash tables may result in contention and slowdowns similar to pthread R/W locks.

SebKrantz · 2025-01-01T20:23:48Z

In case it helps, {collapse}'s hash functions (https://github.com/SebKrantz/collapse/blob/master/src/kit_dup.c and https://github.com/SebKrantz/collapse/blob/master/src/match.c) are pretty fast as well - inspired by base R -> multiplication hash using unsigned integer prime number. It's bloody fast but requires a large table. But Calloc() is quite efficient. Anyway, would be great if you'd test the Google Hash function, curious to see it it can do much better.

PS: you can test collapse::fmatch() against your current attempts rewriting chmatch().

aitap · 2025-01-08T21:41:44Z

The abseil hash function is very slightly slower in my tests, although the difference may be not significant. Perhaps that's because my C port fails to inline some of the things that naturally inline in the original C++ with templates. I can try harder, but that's a lot of extra code to bring in properly.

Details

library(atime)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
  'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
  'abseil_hash' = '159e1d48926b72af9f212b8c645a8bc8ab6b20be'
)

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)

save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')

collapse::fmatch() works very well, faster than data.table::chmatch() even in its current form:

Details

library(atime)
library(data.table)

limit <- 1

# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
chmatch1 <- atime(
  N,
  seconds.limit = limit, verbose = TRUE,
  master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
  Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
  collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)
rm(chmatch_work1); gc(full = TRUE)

save(chmatch1, file = 'times_collapse.rda')

And the real memory cost isn't even that large:

Details

library(atime)
library(data.table)
# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
library(data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac)
library(data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4)
library(parallel)

# only tested on a recent Linux system
# measures the _maximal_ amount of memory in kB used by the current process
writeLines('
#include <sys/resource.h>
void maxrss(double * kb) {
  struct rusage ru;
  int ret = getrusage(RUSAGE_SELF, &ru);
  *kb = ret ? -1 : ru.ru_maxrss;
}
', 'maxrss.c')
tools::Rcmd('SHLIB maxrss.c')
dyn.load(paste0('maxrss', .Platform$dynlib.ext))

limit <- 1

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
versions <- expression(
  master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
  Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
  collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)

plan <- expand.grid(N = N, version = names(versions))
chmatch1 <- lapply(seq_len(nrow(plan)), \(i) {
  # use a disposable child process
  mccollect(mcparallel({
    eval(versions[[plan$version[[i]]]], list(N = plan$N[[i]]))
    .C('maxrss', kb = double(1))$kb
  }))
})

rm(chmatch_work1); gc(full = TRUE)

save(chmatch1, file = 'times_collapse.rda')
chmatch1p <- lattice::xyplot(
  maxrss_kb ~ N, cbind(plan, maxrss_kb = unlist(chmatch1)), group = version,
  auto.key = TRUE, scales = list(log = 10),
  par.settings=list(superpose.symbol=list(pch=19))
)

SebKrantz · 2025-01-08T22:20:59Z

Nice! Thanks. I always wondered about this tradeoff between the size of the table and the quality of the hash function. Looks like speed + large table still wins. Anyway, if you want to adopt, feel free to copy it under your MPL license. Just mention me in the top of the file and as a contributor.

PROTECT() the corresponding EXTPTRSXP while used. Introduce a separate hash_set_shared() operation that avoids long jumps. Deallocate the previous hash table when growing a non-shared hash table.

Explicitly check the return value and update the shared pointer when necessary. If a reallocation attempt fails, signal an error when possible.

Avoid memory leaks from potential long jumps in hash_set().

This avoids a memory leak in case growVector or hash_set fails.

Prevent memory leak in case hash_set() causes a long jump.

Turns out it wasn't used.

Where C-heap allocations also exist, catch and handle potential allocation failures from the hash table.

src/hash.c

ben-schwen · 2025-12-14T16:55:03Z

We should probably also add a NOTE regarding all the changes

Internal use of declared non-API R functions SETLENGTH, TRUELENGTH, SET_TRUELENGTH, and SET_GROWABLE_BIT has been eliminated. Most usages have been migrated to R's experimental resizable vectors API (thanks to @ltierney, introduced in R 4.6.0, backported for older R versions), #7451. Uses of TRUELENGTH for marking seen items during grouping and binding operations (aka free hash table trick) have been replaced with proper hash tables, #6694. The new hash table implementation uses linear probing with power of 2 tables and automatic resizing. Additionally, chmatch() now hashes the needle (x) instead of the haystack (table) when length(table) >> length(x), significantly improving performance for lookups into large tables. TODO ADD MORE INFO about tables?
Thanks to @aitap, @ben-schwen, @jangorecki and @HughParsonage for implementation and reviews.

probably also last benchmark (since we look on par with 1.17.8)

Details

library(atime)
library(data.table)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  '1.17.8' = '4152c6cff80c60896e86b80f8b1c901a4ae93354',
  'truehash' = 'a7f5c6c0c0edd098793837525205244f49c41aa2'
)

set.seed(3)
sample_strings = function(N=10, len=4) {
   do.call(paste0, replicate(len, sample(LETTERS, N, TRUE), FALSE))
}

N <- 10^seq(2, 7.5, .25)

# Benchmark 1: Large table 
tab_full = sample_strings(1e6, 10)
tab_small = sample(tab_full, 9e5)
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(tab_full, N, TRUE)
)

chmatch1 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work1[[as.character(N)]], tab_small),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch1)

# Benchmark 2: Small table (expected case: a few distinct strings - 26 letters)
chmatch_work2 <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
chmatch2 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work2[[as.character(N)]], letters),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch2)

# Benchmark 3: Medium table (10K unique strings)
medium_table <- sample_strings(10000, len=4)
chmatch_work3 <- lapply(setNames(nm = N), \(N)
  sample(medium_table, N, TRUE)
)
chmatch3 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work3[[as.character(N)]], medium_table),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch3)

# Benchmark 4: High miss rate (most items not found)
miss_table <- sample_strings(100, len=4)
chmatch_work4 <- lapply(setNames(nm = N), \(N)
  sample_strings(N, len=4)
)
chmatch4 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work4[[as.character(N)]], miss_table),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch4)

# Benchmark 5: All hits (every item found)
hit_table <- sample_strings(1000, len=4)
chmatch_work5 <- lapply(setNames(nm = N), \(N)
  sample(hit_table, N, TRUE)
)
chmatch5 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work5[[as.character(N)]], hit_table),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch5)

# Benchmark 6: chin (logical return) with small table
chin_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
chin_bench <- atime_versions(
  pkg.path, N,
  expr = data.table::`%chin%`(chin_work[[as.character(N)]], letters),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chin_bench)

# Benchmark 7: Long strings (16 chars)
long_table <- sample_strings(1000, len=16)
chmatch_work7 <- lapply(setNames(nm = N), \(N)
  sample(long_table, N, TRUE)
)
chmatch7 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work7[[as.character(N)]], long_table),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch7)

# Benchmark 8: Realistic join keys (90% hit rate)
join_table <- sample_strings(10000, len=8)
chmatch_work8 <- lapply(setNames(nm = N), \(N) {
  # 90% from join_table (will be found), 10% new (won't be found)
  c(
    sample(join_table, as.integer(N * 0.9), TRUE),
    sample_strings(as.integer(N * 0.1), len=8)
  )
})
chmatch8 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work8[[as.character(N)]], join_table),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch8)

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
plot(forderv1)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
plot(forderv2)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
plot(rbindlist1)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
plot(rbindlist2)
rm(rbindlist2_work); gc(full = TRUE)

# ===== SAVE BENCHMARK RESULTS =====
results_dir <- "benchmark_results"
dir.create(results_dir, showWarnings = FALSE)

# Save R objects
save(
  chmatch1, chmatch2, chmatch3, chmatch4, chmatch5,
  chin_bench, chmatch7, chmatch8,
  forderv1, forderv2,
  rbindlist1, rbindlist2,
  file = file.path(results_dir, "benchmarks_hash.rda")
)

library(atime)
library(data.table)
results_dir <- "benchmark_results"
load(file.path(results_dir, "benchmarks_hash.rda"))

library(ggplot2)
library(patchwork)   # install.packages("patchwork") if needed

# helper: ensure we get a ggplot and add a title
g <- function(p, title) p + ggtitle(title)

p <- list(
  g(plot(chmatch1), "Large table (900K unique)"),
  g(plot(chmatch2), "Small table (26 letters)"),
  g(plot(chmatch3), "Medium table (10K unique)"),
  g(plot(chmatch4), "High miss rate"),
  g(plot(chmatch5), "All hits"),
  g(plot(chin_bench), "chin (logical return)"),
  g(plot(chmatch7), "Long strings (16 chars)"),
  g(plot(chmatch8), "Join keys (90% hit rate)"),
  g(plot(forderv1), "forderv: few distinct strings"),
  g(plot(forderv2), "forderv: all unique strings"),
  g(plot(rbindlist1), "rbindlist: same columns"),
  g(plot(rbindlist2), "rbindlist: different columns")
)

# write a multipage PDF with 2x2 panels per page
pdf("benchmark_results.pdf", width = 10, height = 8)
for (i in seq(1, length(p), by = 4)) {
  pg <- wrap_plots(p[i:min(i+3, length(p))], ncol = 2, nrow = 2, guides = "collect")
  print(pg)
}
dev.off()

MichaelChirico · 2025-12-14T16:58:03Z

Probably also prudent to add something along the lines of "We've benchmarked the refactored code and find the performance satisfactory, but do report any edge case performance regressions we may have missed."

aitap · 2025-12-14T19:40:20Z

Oh no.

# use 6144 elements instead of 6145 and it doesn't trigger
strings = apply(outer(1:6145, 95^(0:2), function(x, y) as.raw(0xa1 + floor(x / y) %% 95)), 1, rawToChar)
Encoding(strings) = "latin1"
DT = data.table(x = strings)
setorder(DT, x)
plot(match(DT$x, sort.int(strings, method='radix')))

Bisect points at c933247. I'm also investigating, but does anyone else have any pointers?

6145 is, of course, 8192*.75+1, where 8192 is the initial true size of the hash table allocated for 4096 elements here:

data.table/src/forder.c

Line 611 in c933247

    
           marks = hash_create(4096); // relatively small to allocate, can grow exponentially later

...and 0.75 is its load factor. The 6145th element triggers a rehash in range_str → hash_set_shared:

data.table/src/forder.c

Line 322 in c933247

    
           marks = hash_set_shared(marks, s, -ustr_n); // unique in any order is fine. first-appearance order is achieved later in count_group

(And according to coverage data, we never rehash in place via hash_set.)

Rehashing itself seems completely fine. The newly created table is empty, as it should be. Once the elements are set, I can walk both tables and verify that the other one looks up the same value

It doesn't seem to be due to threading as well, because this still happens after setDTthreads(1).

Naturally, neither Valgrind nor the sanitizers see any problems.

aitap · 2025-12-14T20:42:20Z

Found the problem: changes to marks that happened inside range_str did not propagate to its caller. My bad. Sorry.

Co-authored-by: Benjamin Schwendinger <[email protected]> Co-authored-by: HughParsonage <[email protected]> Co-authored-by: Jan Gorecki <[email protected]>

TysonStanley · 2025-12-15T06:06:12Z

Really incredible work here. Looks like some conflicts to resolve before merging. Believe this is the last necessary merge before 1.18.0?

aitap · 2025-12-15T07:52:22Z

Yes, this is the last PR needed to get rid of the non-API NOTE (for now).

Looks like rehashing is unreachable from hash_set because we always either pre-allocate a large enough table, or (in case of forder) pre-populate it with all the elements it needs. What to do, # nocov it for now?

aitap · 2025-12-15T09:49:24Z

Changes to GLCI config tested on GitLab: https://gitlab.com/Rdatatable/data.table/-/pipelines/2214728801 UBSan report also addressed: https://gitlab.com/Rdatatable/data.table/-/jobs/12435568788 We should probably specify the number of arguments when registering our callMethods to avoid surprises like this from UBSan.

ben-schwen · 2025-12-16T09:31:18Z

Looks like rehashing is unreachable from hash_set because we always either pre-allocate a large enough table, or (in case of forder) pre-populate it with all the elements it needs. What to do, # nocov it for now?

I would not cover it with # nocov so we become aware when we finally cover it (in the future), but I guess thats just personal preference

aitap marked this pull request as ready for review December 26, 2024 15:52

aitap requested review from HughParsonage, MichaelChirico, ben-schwen and tdhock as code owners December 26, 2024 15:52

aitap added 11 commits December 29, 2024 19:06

Implement the hash table

5782a7a

memrecycle(): replace TRUELENGTH marks with a hash

3846cc9

rbindlist(): replace 1/2 TRUELENGTH with hashing

b700bf6

Also avoid crashing when creating a 0-size hash.

rbindlist(): replace 2/2 TRUELENGTH with hashing

86511ab

This may likely require a dynamically growing hash of TRUELENGTHs instead of the current pre-allocation approach with a very conservative over-estimate.

chmatchMain(): replace TRUELENGTH marks with hash

486dd7a

copySharedColumns(): hash instead of TRUELENGTH

f26d043

combineFactorLevels(): hash instead of TRUELENGTH

cae6b02

anySpecialStatic(): hash instead of TRUELENGTH

a09f39b

forder(): hash instead of TRUELENGTH

962b272

The hash needs O(n) memory (actually 2*n/load_factor entries) which isn't great.

Remove savetl()

22981ad

Add codecov suppressions

24e8178

aitap force-pushed the truehash branch from 566a49e to 24e8178 Compare December 29, 2024 16:07

aitap mentioned this pull request Dec 29, 2024

Growable ALTREP vectors #6697

Closed

1 task

aitap added 8 commits December 12, 2025 13:30

dogroups: replace remaining uses of SETLENGTH

5f17fab

hashtab: switch to C allocator to avoid longjumps

afd53b5

PROTECT() the corresponding EXTPTRSXP while used. Introduce a separate hash_set_shared() operation that avoids long jumps. Deallocate the previous hash table when growing a non-shared hash table.

range_str: use hash_set_shared in OpenMP region

b380969

Explicitly check the return value and update the shared pointer when necessary. If a reallocation attempt fails, signal an error when possible.

range_str: allocate temporaries on the R heap

95e91af

Avoid memory leaks from potential long jumps in hash_set().

memrecycle: allocate temporary on the R heap

1c34139

This avoids a memory leak in case growVector or hash_set fails.

chmatchdup: allocate temporaries on the R heap

ead1859

Prevent memory leak in case hash_set() causes a long jump.

rbindlist: drop 'uniq'

a2b6926

Turns out it wasn't used.

rbindlist: use hash_set_shared

a7f5c6c

Where C-heap allocations also exist, catch and handle potential allocation failures from the hash table.

aitap changed the title ~~Start replacing TRUELENGTH markers with a hash~~ Replace TRUELENGTH markers with a hash Dec 12, 2025

ben-schwen reviewed Dec 14, 2025

View reviewed changes

src/hash.c Show resolved Hide resolved

aitap and others added 3 commits December 15, 2025 00:05

range_str: propagate rehashed marks to caller

89d39f0

fix stale comment

e531180

NEWS item

b5b023e

Co-authored-by: Benjamin Schwendinger <[email protected]> Co-authored-by: HughParsonage <[email protected]> Co-authored-by: Jan Gorecki <[email protected]>

TysonStanley added this to the 1.18.0 milestone Dec 15, 2025

Merge branch 'master' into truehash

ac718aa

aitap added 2 commits December 15, 2025 11:06

glci: update status expectations for R-devel

04e2fd7

copyAsGrowable: use the correct number of arguments

3119135

aitap requested a review from jangorecki as a code owner December 15, 2025 09:46

jangorecki approved these changes Dec 16, 2025

View reviewed changes

ben-schwen merged commit e35dec5 into master Dec 17, 2025
11 of 13 checks passed

ben-schwen deleted the truehash branch December 17, 2025 12:42

ben-schwen mentioned this pull request Dec 17, 2025

Postpone the Non-API Problem. #6640

Closed

Replace TRUELENGTH markers with a hash #6694

Replace TRUELENGTH markers with a hash #6694

Uh oh!

Conversation

aitap commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdhock commented Dec 29, 2024

Uh oh!

aitap commented Dec 29, 2024

Uh oh!

SebKrantz commented Dec 30, 2024

Uh oh!

aitap commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdhock commented Dec 31, 2024

Uh oh!

aitap commented Jan 1, 2025

Uh oh!

SebKrantz commented Jan 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aitap commented Jan 8, 2025

Uh oh!

SebKrantz commented Jan 8, 2025

Uh oh!

Uh oh!

ben-schwen commented Dec 14, 2025

Uh oh!

MichaelChirico commented Dec 14, 2025

Uh oh!

aitap commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aitap commented Dec 14, 2025

Uh oh!

TysonStanley commented Dec 15, 2025

Uh oh!

aitap commented Dec 15, 2025

Uh oh!

aitap commented Dec 15, 2025 via email

Uh oh!

ben-schwen commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Replace `TRUELENGTH` markers with a hash #6694

Replace `TRUELENGTH` markers with a hash #6694

aitap commented Dec 26, 2024 •

edited

Loading

codecov bot commented Dec 26, 2024 •

edited

Loading

github-actions bot commented Dec 26, 2024 •

edited

Loading

aitap commented Dec 31, 2024 •

edited

Loading

SebKrantz commented Jan 1, 2025 •

edited

Loading

aitap commented Dec 14, 2025 •

edited

Loading

ben-schwen commented Dec 16, 2025 •

edited

Loading