Skip to content

Conversation

@aitap
Copy link
Member

@aitap aitap commented Dec 26, 2024

With apologies to Matt Dowle, who had poured a lot of effort into making data.table go fast.

Ongoing work towards #6180. Unfortunately, doesn't completely remove any uses of non-API entry points by itself. Detailed motivation here in a pending blog post. Can't start implementing stretchy ALTREP vectors until data.table stops using TRUELENGTH to mark them.

Currently implemented:

  • An open-addressing hash table with Knuth's linear multiplication hashing
    • Deliberately no removal operation, only insertion and update
    • Can be made more performant (collisions much less likely) by switching to tables of power-of-two length and double hashing
  • Almost all uses of TRUELENGTH to mark CHARSXPs or columns replaced with a hash

Needs more work:

  • the uses in rbindlist() and forder() pre-allocate memory for the worst-case usage
    • may be noticeably bad for cache performance
    • a dynamically growing hash table would take even more CPU time and cache misses and maybe system calls to resize
  • forder.c, the last remaining user of savetl
    • SET_TRUELENGTH is atomic, hash_set is not, will need additional care in multi-threaded environment
  • savetl machinery in assign.c

Let's just see how much worse is the performance going to get.

@aitap aitap marked this pull request as ready for review December 26, 2024 15:52
@codecov
Copy link

codecov bot commented Dec 26, 2024

Codecov Report

❌ Patch coverage is 97.72727% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.04%. Comparing base (6329d94) to head (3119135).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
src/hash.c 94.50% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6694      +/-   ##
==========================================
- Coverage   99.06%   99.04%   -0.03%     
==========================================
  Files          86       87       +1     
  Lines       16616    16668      +52     
==========================================
+ Hits        16461    16508      +47     
- Misses        155      160       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Dec 26, 2024

  • HEAD=truehash slower P<0.001 for memrecycle regression fixed in #5463
    Comparison Plot

Generated via commit 3119135

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 55 seconds
Installing different package versions 43 seconds
Running and plotting the test cases 2 minutes and 40 seconds

@tdhock
Copy link
Member

tdhock commented Dec 29, 2024

hi, thanks for this. Can you please propose one or two performance test cases that you think may be adversely affected by these changes? Is it when we create a table with one column, and then use := or set to add a lot more columns?

@aitap aitap mentioned this pull request Dec 29, 2024
1 task
@aitap
Copy link
Member Author

aitap commented Dec 29, 2024

The rbindlist() code pre-allocates O(sum(lengths(lapply(l, names))) memory. In the best case, only O(length(names(l[[1]]))) is needed, length(l) times less. In the worst case, all names are different and the code needs all that memory for fill = TRUE.

The forder() code pre-allocates O(nrow(x)) memory.

I'll try giving many data.tables to rbindlist() and long data.tables with string columns to setorder().

@SebKrantz
Copy link
Member

Perhaps you can also try a fast 3rd party hash map: https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/

I particular Google's Abseil hash is pretty fast: https://abseil.io/docs/cpp/guides/container https://abseil.io/docs/cpp/guides/hash

@aitap
Copy link
Member Author

aitap commented Dec 31, 2024

It's pretty bad. For typical cases, the current hash table eats >1 order of magnitude more R* memory, and it's similarly slower in forder():
forder, typical case
rbindlist, typical case

Details

The hash table is only on par by time in the worst case for forder() (all strings different) and might be on par in the worst case for rbindlist(fill = TRUE) (all column names different), but the latter is completely unrealistic as a use case.

forder, worst case
rbindlist, worst case

# may need 16G of RAM to run comfortably due to pathological memory allocation patterns
library(atime)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  master   = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
  truehash = '24e81785669e70caac31501bf4424ba14dbc90f9'
)

N <- 10^seq(2, 8.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 6.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)

save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')

* Edit: Some of the memory use in data.table is invisible to Rprofmem(). For forder(), data.table is said to allocate as much memory as one extra column, i.e. 8 bytes per CHARSXP element on a 64-bit computer. The hash table preallocates enough space for 2*N (original + UTF-8) cells at 50% load factor (so twice more), with each cell consisting of a SEXP pointer and an R_xlen_t mark, so 2 * 2 * 16 = 64 bytes more per element, in addition to normal memory use. Measuring forder()'s memory use as getrusage() + ru_maxrss and subtracting the value before the function call gives coef(lm(forderuse_kb*1024 ~ N:version, x))[-1] as N:versionmaster = 8.993967 and N:versiontruehash = 72.178404, which is close to the theory.

I'll try profiling the code.

Thanks @SebKrantz for the link, a newer benchmark by the same author is also very instructive.

@tdhock
Copy link
Member

tdhock commented Dec 31, 2024

thanks for proposing the performance test cases and sharing the atime benchmark graphs. I agree that we should try to avoid an order of magnitude constant factor increase in time/memory usage.

In forder() and rbindlist(), there is no good upper boundary on the
number of elements in the hash known ahead of time. Grow the hash table
dynamically. Since the R/W locks are far too slow and OpenMP atomics are
too limited, rely on strategically placed flushes, which isn't really a
solution.
@aitap
Copy link
Member Author

aitap commented Jan 1, 2025

Since profiling has shown that a noticeable amount of time is wasted initialising the giant pre-allocated hash tables, I was able to make the slowdown factor closer to 2 by dynamically re-allocating the hash table:
dynamic_hash

The memory use is significantly reduced (except for the worst cases), but cannot be measured with atime (just like other uses of calloc/malloc/realloc in data.table are invisible to Rprofmem).

Details
library(atime)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  master   = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
  static_hash = '24e81785669e70caac31501bf4424ba14dbc90f9',
  dynamic_hash = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4'
)

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)

#save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')

The main problem with the current approach is that since the parallel loop in src/forder.c:range_str(...) needs to read the hash table from outside the critical section while hash elements are being set inside the critical section, it's possible that dhash_set() will need to reallocate the table while dhash_lookup() is using it. I have already tried using pthread R/W locks and OpenMP critical sections; they were too slow. I wasn't able to use OpenMP atomics to make atomic reads/writes through a pointer (maybe that's asking for too much). Any third-party hash table will also need to be outfitted with appropriate locking.

The current code keeps one previous hash table until the next reallocation cycle and hopes that dhash_enlarge writes the new table pointer into the object roughly at the same time as the metadata, so that when dhash_lookup reads the object, it will hopefully receive a consistent view of the object. This is not good enough for production because (1) the race window still exists and (2) there's no proof that there won't be a stale reader still using the previous table array when the current writer deallocates it. What can be done with it:

  • Don't deallocate the previous hash tables until the function is completely done with them. This will double the memory use (because the code doubles the buffer when re-allocating it and $\sum_{i=0}^{n} 2^i = 2^{n+1} - 1$). We'll be able to switch back from malloc to R_alloc and make the allocations visible to R's allocator.
  • Devise a read-copy-update scheme (see also: 1 2) to properly expire the previous versions of the hash table. These may require synchronisation primitives not easily available through OpenMP. Reference-counting the hash tables may result in contention and slowdowns similar to pthread R/W locks.

@SebKrantz
Copy link
Member

SebKrantz commented Jan 1, 2025

In case it helps, {collapse}'s hash functions (https://github.com/SebKrantz/collapse/blob/master/src/kit_dup.c and https://github.com/SebKrantz/collapse/blob/master/src/match.c) are pretty fast as well - inspired by base R -> multiplication hash using unsigned integer prime number. It's bloody fast but requires a large table. But Calloc() is quite efficient. Anyway, would be great if you'd test the Google Hash function, curious to see it it can do much better.

PS: you can test collapse::fmatch() against your current attempts rewriting chmatch().

@aitap
Copy link
Member Author

aitap commented Jan 8, 2025

The abseil hash function is very slightly slower in my tests, although the difference may be not significant. Perhaps that's because my C port fails to inline some of the things that naturally inline in the original C++ with templates. I can try harder, but that's a lot of extra code to bring in properly.

forder() performance on a load with a few unique strings

forder() performance on a pathological load with all strings unique

Details
library(atime)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
  'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
  'abseil_hash' = '159e1d48926b72af9f212b8c645a8bc8ab6b20be'
)

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)

save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')

collapse::fmatch() works very well, faster than data.table::chmatch() even in its current form:

chmatch() vs. fmatch() on a best case load

Details
library(atime)
library(data.table)

limit <- 1

# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
chmatch1 <- atime(
  N,
  seconds.limit = limit, verbose = TRUE,
  master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
  Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
  collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)
rm(chmatch_work1); gc(full = TRUE)

save(chmatch1, file = 'times_collapse.rda')

And the real memory cost isn't even that large:
memory use as returned by getrusage()

Details
library(atime)
library(data.table)
# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
library(data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac)
library(data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4)
library(parallel)

# only tested on a recent Linux system
# measures the _maximal_ amount of memory in kB used by the current process
writeLines('
#include <sys/resource.h>
void maxrss(double * kb) {
  struct rusage ru;
  int ret = getrusage(RUSAGE_SELF, &ru);
  *kb = ret ? -1 : ru.ru_maxrss;
}
', 'maxrss.c')
tools::Rcmd('SHLIB maxrss.c')
dyn.load(paste0('maxrss', .Platform$dynlib.ext))

limit <- 1

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
versions <- expression(
  master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
  Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
  collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)

plan <- expand.grid(N = N, version = names(versions))
chmatch1 <- lapply(seq_len(nrow(plan)), \(i) {
  # use a disposable child process
  mccollect(mcparallel({
    eval(versions[[plan$version[[i]]]], list(N = plan$N[[i]]))
    .C('maxrss', kb = double(1))$kb
  }))
})

rm(chmatch_work1); gc(full = TRUE)

save(chmatch1, file = 'times_collapse.rda')
chmatch1p <- lattice::xyplot(
  maxrss_kb ~ N, cbind(plan, maxrss_kb = unlist(chmatch1)), group = version,
  auto.key = TRUE, scales = list(log = 10),
  par.settings=list(superpose.symbol=list(pch=19))
)

@SebKrantz
Copy link
Member

Nice! Thanks. I always wondered about this tradeoff between the size of the table and the quality of the hash function. Looks like speed + large table still wins. Anyway, if you want to adopt, feel free to copy it under your MPL license. Just mention me in the top of the file and as a contributor.

PROTECT() the corresponding EXTPTRSXP while used.

Introduce a separate hash_set_shared() operation that avoids long jumps.
Deallocate the previous hash table when growing a non-shared hash table.
Explicitly check the return value and update the shared pointer when
necessary. If a reallocation attempt fails, signal an error when
possible.
Avoid memory leaks from potential long jumps in hash_set().
This avoids a memory leak in case growVector or hash_set fails.
Prevent memory leak in case hash_set() causes a long jump.
Turns out it wasn't used.
Where C-heap allocations also exist, catch and handle potential
allocation failures from the hash table.
@aitap aitap changed the title Start replacing TRUELENGTH markers with a hash Replace TRUELENGTH markers with a hash Dec 12, 2025
@ben-schwen
Copy link
Member

We should probably also add a NOTE regarding all the changes

Internal use of declared non-API R functions SETLENGTH, TRUELENGTH, SET_TRUELENGTH, and SET_GROWABLE_BIT has been eliminated. Most usages have been migrated to R's experimental resizable vectors API (thanks to @ltierney, introduced in R 4.6.0, backported for older R versions), #7451. Uses of TRUELENGTH for marking seen items during grouping and binding operations (aka free hash table trick) have been replaced with proper hash tables, #6694. The new hash table implementation uses linear probing with power of 2 tables and automatic resizing. Additionally, chmatch() now hashes the needle (x) instead of the haystack (table) when length(table) >> length(x), significantly improving performance for lookups into large tables. TODO ADD MORE INFO about tables?
Thanks to @aitap, @ben-schwen, @jangorecki and @HughParsonage for implementation and reviews.

probably also last benchmark (since we look on par with 1.17.8)

image image image
Details
library(atime)
library(data.table)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  '1.17.8' = '4152c6cff80c60896e86b80f8b1c901a4ae93354',
  'truehash' = 'a7f5c6c0c0edd098793837525205244f49c41aa2'
)

set.seed(3)
sample_strings = function(N=10, len=4) {
   do.call(paste0, replicate(len, sample(LETTERS, N, TRUE), FALSE))
}

N <- 10^seq(2, 7.5, .25)

# Benchmark 1: Large table 
tab_full = sample_strings(1e6, 10)
tab_small = sample(tab_full, 9e5)
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(tab_full, N, TRUE)
)

chmatch1 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work1[[as.character(N)]], tab_small),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch1)

# Benchmark 2: Small table (expected case: a few distinct strings - 26 letters)
chmatch_work2 <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
chmatch2 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work2[[as.character(N)]], letters),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch2)

# Benchmark 3: Medium table (10K unique strings)
medium_table <- sample_strings(10000, len=4)
chmatch_work3 <- lapply(setNames(nm = N), \(N)
  sample(medium_table, N, TRUE)
)
chmatch3 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work3[[as.character(N)]], medium_table),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch3)

# Benchmark 4: High miss rate (most items not found)
miss_table <- sample_strings(100, len=4)
chmatch_work4 <- lapply(setNames(nm = N), \(N)
  sample_strings(N, len=4)
)
chmatch4 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work4[[as.character(N)]], miss_table),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch4)

# Benchmark 5: All hits (every item found)
hit_table <- sample_strings(1000, len=4)
chmatch_work5 <- lapply(setNames(nm = N), \(N)
  sample(hit_table, N, TRUE)
)
chmatch5 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work5[[as.character(N)]], hit_table),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch5)

# Benchmark 6: chin (logical return) with small table
chin_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
chin_bench <- atime_versions(
  pkg.path, N,
  expr = data.table::`%chin%`(chin_work[[as.character(N)]], letters),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chin_bench)

# Benchmark 7: Long strings (16 chars)
long_table <- sample_strings(1000, len=16)
chmatch_work7 <- lapply(setNames(nm = N), \(N)
  sample(long_table, N, TRUE)
)
chmatch7 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work7[[as.character(N)]], long_table),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch7)

# Benchmark 8: Realistic join keys (90% hit rate)
join_table <- sample_strings(10000, len=8)
chmatch_work8 <- lapply(setNames(nm = N), \(N) {
  # 90% from join_table (will be found), 10% new (won't be found)
  c(
    sample(join_table, as.integer(N * 0.9), TRUE),
    sample_strings(as.integer(N * 0.1), len=8)
  )
})
chmatch8 <- atime_versions(
  pkg.path, N,
  expr = data.table::chmatch(chmatch_work8[[as.character(N)]], join_table),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch8)

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
plot(forderv1)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
plot(forderv2)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
plot(rbindlist1)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
plot(rbindlist2)
rm(rbindlist2_work); gc(full = TRUE)

# ===== SAVE BENCHMARK RESULTS =====
results_dir <- "benchmark_results"
dir.create(results_dir, showWarnings = FALSE)

# Save R objects
save(
  chmatch1, chmatch2, chmatch3, chmatch4, chmatch5,
  chin_bench, chmatch7, chmatch8,
  forderv1, forderv2,
  rbindlist1, rbindlist2,
  file = file.path(results_dir, "benchmarks_hash.rda")
)

library(atime)
library(data.table)
results_dir <- "benchmark_results"
load(file.path(results_dir, "benchmarks_hash.rda"))

library(ggplot2)
library(patchwork)   # install.packages("patchwork") if needed

# helper: ensure we get a ggplot and add a title
g <- function(p, title) p + ggtitle(title)

p <- list(
  g(plot(chmatch1), "Large table (900K unique)"),
  g(plot(chmatch2), "Small table (26 letters)"),
  g(plot(chmatch3), "Medium table (10K unique)"),
  g(plot(chmatch4), "High miss rate"),
  g(plot(chmatch5), "All hits"),
  g(plot(chin_bench), "chin (logical return)"),
  g(plot(chmatch7), "Long strings (16 chars)"),
  g(plot(chmatch8), "Join keys (90% hit rate)"),
  g(plot(forderv1), "forderv: few distinct strings"),
  g(plot(forderv2), "forderv: all unique strings"),
  g(plot(rbindlist1), "rbindlist: same columns"),
  g(plot(rbindlist2), "rbindlist: different columns")
)

# write a multipage PDF with 2x2 panels per page
pdf("benchmark_results.pdf", width = 10, height = 8)
for (i in seq(1, length(p), by = 4)) {
  pg <- wrap_plots(p[i:min(i+3, length(p))], ncol = 2, nrow = 2, guides = "collect")
  print(pg)
}
dev.off()

@MichaelChirico
Copy link
Member

Probably also prudent to add something along the lines of "We've benchmarked the refactored code and find the performance satisfactory, but do report any edge case performance regressions we may have missed."

@aitap
Copy link
Member Author

aitap commented Dec 14, 2025

Oh no.

# use 6144 elements instead of 6145 and it doesn't trigger
strings = apply(outer(1:6145, 95^(0:2), function(x, y) as.raw(0xa1 + floor(x / y) %% 95)), 1, rawToChar)
Encoding(strings) = "latin1"
DT = data.table(x = strings)
setorder(DT, x)
plot(match(DT$x, sort.int(strings, method='radix')))
it's not a diagonal line at all

Bisect points at c933247. I'm also investigating, but does anyone else have any pointers?


6145 is, of course, 8192*.75+1, where 8192 is the initial true size of the hash table allocated for 4096 elements here:

marks = hash_create(4096); // relatively small to allocate, can grow exponentially later

...and 0.75 is its load factor. The 6145th element triggers a rehash in range_strhash_set_shared:

marks = hash_set_shared(marks, s, -ustr_n); // unique in any order is fine. first-appearance order is achieved later in count_group

(And according to coverage data, we never rehash in place via hash_set.)


Rehashing itself seems completely fine. The newly created table is empty, as it should be. Once the elements are set, I can walk both tables and verify that the other one looks up the same value

It doesn't seem to be due to threading as well, because this still happens after setDTthreads(1).

Naturally, neither Valgrind nor the sanitizers see any problems.

@aitap
Copy link
Member Author

aitap commented Dec 14, 2025

Found the problem: changes to marks that happened inside range_str did not propagate to its caller. My bad. Sorry.

aitap and others added 3 commits December 15, 2025 00:05
Co-authored-by: Benjamin Schwendinger <[email protected]>
Co-authored-by: HughParsonage <[email protected]>
Co-authored-by: Jan Gorecki <[email protected]>
@TysonStanley
Copy link
Member

Really incredible work here. Looks like some conflicts to resolve before merging. Believe this is the last necessary merge before 1.18.0?

@TysonStanley TysonStanley added this to the 1.18.0 milestone Dec 15, 2025
@aitap
Copy link
Member Author

aitap commented Dec 15, 2025

Yes, this is the last PR needed to get rid of the non-API NOTE (for now).

Looks like rehashing is unreachable from hash_set because we always either pre-allocate a large enough table, or (in case of forder) pre-populate it with all the elements it needs. What to do, # nocov it for now?

@aitap aitap requested a review from jangorecki as a code owner December 15, 2025 09:46
@aitap
Copy link
Member Author

aitap commented Dec 15, 2025 via email

@ben-schwen
Copy link
Member

ben-schwen commented Dec 16, 2025

Looks like rehashing is unreachable from hash_set because we always either pre-allocate a large enough table, or (in case of forder) pre-populate it with all the elements it needs. What to do, # nocov it for now?

I would not cover it with # nocov so we become aware when we finally cover it (in the future), but I guess thats just personal preference

@ben-schwen ben-schwen merged commit e35dec5 into master Dec 17, 2025
11 of 13 checks passed
@ben-schwen ben-schwen deleted the truehash branch December 17, 2025 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants