t3404: adjust collision test case for SHA-256 by dscho · Pull Request #1 · bk2204/git

dscho · 2020-01-16T15:03:15Z

This PR first cherry-picks gitgitgadget#529 and then adjusts the test case so that it succeeds with SHA-256, too.

@bk2204 I am not quite sure how we want to handle this: your work is further along than mine. So maybe you want to go first?

We could also slip in this fix as a "preparatory pair of patches". What would be your preference?

Instead of hard-coding the length of an object ID when creating a tree, generate it based on $ZERO_OID. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Since the object ID used in the index line will differ between different algorithms, compute these values instead of hard-coding them. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Instead of using a specific invalid hard-coded object ID, generate one of the appropriate length by looking one up in the translation tables. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Instead of hard-coding a fixed length example object ID in the test, look one up using the translation tables. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Adjust the test so that it computes values for object IDs instead of using hard-coded hashes. Additionally, update the sanitize_output function to sanitize the index lines in diff output, since it's clear from the assertions in question that we are not interested in the specific object IDs. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Use $OID_REGEX instead of a hard-coded regular expression. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Adjust the test so that it computes values for object IDs instead of using hard-coded hashes. Move the heredocs later in the tests so we can take advantage of computed values. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

When running with SHA-256 as the hash algorithm, the hash version octet is 2 instead of 1. Pick the right value depending on the hash algorithm and use it where we look for the existing value. To ensure the test checking for invalid data passes, use 3 as the test value for an invalid hash version. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

When using SHA-1, the existing value of the byte we use is 0x13, so writing the byte 0x07 serves to corrupt the test and verify that we detect corruption. However, when we use SHA-256, the value at that offset is already 0x07, so our "corruption" doesn't work and the test fails to detect it. To provide a value that is truly out of range, let's use 0xff, which is not likely to be a valid value as the high byte of a two-byte offset in a multi-pack index this small. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

This test corrupts various locations in a multi-pack index to test various error responses. However, these offsets differ between SHA-1 indexes and SHA-256 indexes due to differences in object length. Use test_oid to look up the correct offsets based on the algorithm.

There are some offsets in the commit graph files used to corrupt data. Compute these offsets for both SHA-1 and SHA-256 so that the test works with either. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Instead of hard-coding invalid object IDs in this test, use test_oid to look up ones of the appropriate length. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Use $OID_REGEX instead of hard-coding 40-based regular expressions. Change invocations of cut with a hard-coded constant to split using a delimiter instead. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Adjust the test so that it computes variables for object IDs instead of using hard-coded hashes. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Compute the various pkt-line values based on the length of the object IDs in use. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

This test modifies a pkt-line stream with sed to change a line with "shallow" to "unshallow". However, this modification is dependent on the size of the hash in use; with SHA-256, the pkt-line length is different, leading to the sed command having no effect. Use test_oid_cache to specify the correct values for each hash so that the test continues to work. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Use regex values based on $OID_REGEX instead of hard-coding them based on expected object ID lengths. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

This test uses $_z40 to express an all-zeros object ID, which doesn't work for SHA-256. Use $ZERO_OID instead, which is the right size for all hash values. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

This test performs a clone from outside any repository. Consequently, the hash algorithm used defaults to SHA-1. When the test is running with SHA-256, this results in an object ID that is not usable by the rest of the test. In order to ensure that we provide a usable value, switch into the source repository before hashing. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

To make our values hash independent, we turn the directory of the object into "Y" and the file name into "Z" after having sorted items by their name. However, when using SHA-256, one of our file names begins with an "a" character, which means it sorts into the wrong place in the list, causing the test to fail. Since we don't care about the order of these items, just sort them after stripping actual hash contents, which means they'll work with any hash algorithm. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Update the support routines for generating packs to support both SHA-1 and SHA-256. Compute the trailing pack checksum and its length correctly depending on the algorithm, and look up the object names based on the algorithm as well. Ensure we initialize the algorithm facts so that our callers need not do so. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Fix the one assertion in this test that still uses SHA-1 to use test_oid to be independent of the hash. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

This test relies on a roughly equal distribution of hashes for notes in order to ensure that fanouts are compressed. If there are subtrees with only one item left after removing notes, they'll end up still with one level of fanout, causing the test to fail. The test happens to pass with SHA-1, but doesn't necessarily with other hash algorithms, so annotate it with the SHA1 prerequisite. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Replace the hard-coded SHA-1 constants with the use of test_oid to look up an appropriate constant for each hash algorithm. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Replace the hard-coded SHA-1 constants with the use of test_oid to look up an appropriate constant for each hash algorithm. In addition, adjust the fanout checks to look for either zero or one slashes in the filename without needing to check for an explicit length. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Despite the description in the test, the test does not actually test for an SHA-1 collision; none of the object IDs used in the test collide. As such, it does not need to be adjusted in any way for SHA-256, so remove the SHA1 prerequisite. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

cache_entry contains an object_id, and compare_ce_content() would include that field when calling memcmp on a subset of the cache_entry. Depending on which hashing algorithm is being used, only part of object_id.hash is actually being used, therefore including it in a memcmp() is incorrect. Instead we choose to exclude the object_id when calling memcmp(), and call oideq() separately. This issue was found when running t1700-split-index with MSAN, see MSAN output below (on my machine, offset 76 corresponds to 4 bytes after the start of object_id.hash). Uninitialized bytes in MemcmpInterceptorCommon at offset 76 inside [0x7f60e7c00118, 92) ==27914==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x4524ee in memcmp /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/msan/../sanitizer_common/sanitizer_common_interceptors.inc:873:10 #1 0xc867ae in compare_ce_content /home/ahunt/git/git/split-index.c:208:8 #2 0xc859fb in prepare_to_write_split_index /home/ahunt/git/git/split-index.c:336:9 #3 0xb4bbca in write_split_index /home/ahunt/git/git/read-cache.c:3107:2 #4 0xb42b4d in write_locked_index /home/ahunt/git/git/read-cache.c:3295:8 #5 0x638058 in try_merge_strategy /home/ahunt/git/git/builtin/merge.c:758:7 git#6 0x63057f in cmd_merge /home/ahunt/git/git/builtin/merge.c:1663:9 #7 0x4a1e76 in run_builtin /home/ahunt/git/git/git.c:461:11 git#8 0x49e1e7 in handle_builtin /home/ahunt/git/git/git.c:714:3 git#9 0x4a0c08 in run_argv /home/ahunt/git/git/git.c:781:4 git#10 0x49d5a8 in cmd_main /home/ahunt/git/git/git.c:912:19 git#11 0x7974da in main /home/ahunt/git/git/common-main.c:52:11 git#12 0x7f60e928e349 in __libc_start_main (/lib64/libc.so.6+0x24349) git#13 0x421bd9 in _start /home/abuild/rpmbuild/BUILD/glibc-2.26/csu/../sysdeps/x86_64/start.S:120 Uninitialized value was stored to memory at #0 0x447eb9 in __msan_memcpy /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/msan/msan_interceptors.cpp:1558:3 #1 0xb4d1e6 in dup_cache_entry /home/ahunt/git/git/read-cache.c:3457:2 #2 0xd214fa in add_entry /home/ahunt/git/git/unpack-trees.c:215:18 #3 0xd1fae0 in keep_entry /home/ahunt/git/git/unpack-trees.c:2276:2 #4 0xd1ff9e in twoway_merge /home/ahunt/git/git/unpack-trees.c:2504:11 #5 0xd27028 in call_unpack_fn /home/ahunt/git/git/unpack-trees.c:593:12 git#6 0xd2443d in unpack_nondirectories /home/ahunt/git/git/unpack-trees.c:1106:12 #7 0xd19435 in unpack_callback /home/ahunt/git/git/unpack-trees.c:1306:6 git#8 0xd0d7ff in traverse_trees /home/ahunt/git/git/tree-walk.c:532:17 git#9 0xd1773a in unpack_trees /home/ahunt/git/git/unpack-trees.c:1683:9 git#10 0xdc6370 in checkout /home/ahunt/git/git/merge-ort.c:3590:8 git#11 0xdc51c3 in merge_switch_to_result /home/ahunt/git/git/merge-ort.c:3728:7 git#12 0xa195a9 in merge_ort_recursive /home/ahunt/git/git/merge-ort-wrappers.c:58:2 git#13 0x637fff in try_merge_strategy /home/ahunt/git/git/builtin/merge.c:751:12 git#14 0x63057f in cmd_merge /home/ahunt/git/git/builtin/merge.c:1663:9 git#15 0x4a1e76 in run_builtin /home/ahunt/git/git/git.c:461:11 git#16 0x49e1e7 in handle_builtin /home/ahunt/git/git/git.c:714:3 git#17 0x4a0c08 in run_argv /home/ahunt/git/git/git.c:781:4 git#18 0x49d5a8 in cmd_main /home/ahunt/git/git/git.c:912:19 git#19 0x7974da in main /home/ahunt/git/git/common-main.c:52:11 Uninitialized value was created by a heap allocation #0 0x44e73d in malloc /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/msan/msan_interceptors.cpp:901:3 #1 0xd592f6 in do_xmalloc /home/ahunt/git/git/wrapper.c:41:8 #2 0xd59248 in xmalloc /home/ahunt/git/git/wrapper.c:62:9 #3 0xa17088 in mem_pool_alloc_block /home/ahunt/git/git/mem-pool.c:22:6 #4 0xa16f78 in mem_pool_init /home/ahunt/git/git/mem-pool.c:44:3 #5 0xb481b8 in load_all_cache_entries /home/ahunt/git/git/read-cache.c git#6 0xb44d40 in do_read_index /home/ahunt/git/git/read-cache.c:2298:17 #7 0xb48a1b in read_index_from /home/ahunt/git/git/read-cache.c:2389:8 git#8 0xbd5a0b in repo_read_index /home/ahunt/git/git/repository.c:276:8 git#9 0xb4bcaf in repo_read_index_unmerged /home/ahunt/git/git/read-cache.c:3326:2 git#10 0x62ed26 in cmd_merge /home/ahunt/git/git/builtin/merge.c:1362:6 git#11 0x4a1e76 in run_builtin /home/ahunt/git/git/git.c:461:11 git#12 0x49e1e7 in handle_builtin /home/ahunt/git/git/git.c:714:3 git#13 0x4a0c08 in run_argv /home/ahunt/git/git/git.c:781:4 git#14 0x49d5a8 in cmd_main /home/ahunt/git/git/git.c:912:19 git#15 0x7974da in main /home/ahunt/git/git/common-main.c:52:11 git#16 0x7f60e928e349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: MemorySanitizer: use-of-uninitialized-value /home/abuild/rpmbuild/BUILD/llvm-11.0.0.src/build/../projects/compiler-rt/lib/msan/../sanitizer_common/sanitizer_common_interceptors.inc:873:10 in memcmp Exiting Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

…ints report_result() sends a struct to the parent process, but that struct would contain uninitialised padding bytes. Running this code under MSAN rightly triggers a warning - but we don't particularly care about this warning because we control the receiving code, and we therefore know that those padding bytes won't be read on the receiving end. We could simply suppress this warning under MSAN with the approporiate ifdef'd attributes, but a less intrusive solution is to 0-initialise the struct, which guarantees that the padding will also be initialised. Interestingly, in the error-case branch, we only try to copy the first two members of pc_item_result, by copying only PC_ITEM_RESULT_BASE_SIZE bytes. However PC_ITEM_RESULT_BASE_SIZE is defined as 'offsetof(the_last_member)', which means that we're copying padding bytes after the end of the second last member. We could avoid doing this by redefining PC_ITEM_RESULT_BASE_SIZE as 'offsetof(second_last_member) + sizeof(second_last_member)', but there's no huge benefit to doing so (and this patch silences the MSAN warning in this scenario either way). MSAN output from t2080 (partially interleaved due to the parallel work :) ): Uninitialized bytes in __interceptor_write at offset 12 inside [0x7fff37d83408, 160) ==23279==WARNING: MemorySanitizer: use-of-uninitialized-value Uninitialized bytes in __interceptor_write at offset 12 inside [0x7ffdb8a07ec8, 160) ==23280==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0xd5ac28 in xwrite /home/ahunt/git/git/wrapper.c:256:8 #1 0xd5b327 in write_in_full /home/ahunt/git/git/wrapper.c:311:21 #2 0xb0a8c4 in do_packet_write /home/ahunt/git/git/pkt-line.c:221:6 #3 0xb0a5fd in packet_write /home/ahunt/git/git/pkt-line.c:242:6 #4 0x4f7441 in report_result /home/ahunt/git/git/builtin/checkout--worker.c:69:2 #5 0x4f6be6 in worker_loop /home/ahunt/git/git/builtin/checkout--worker.c:100:3 git#6 0x4f68d3 in cmd_checkout__worker /home/ahunt/git/git/builtin/checkout--worker.c:143:2 #7 0x4a1e76 in run_builtin /home/ahunt/git/git/git.c:461:11 git#8 0x49e1e7 in handle_builtin /home/ahunt/git/git/git.c:714:3 git#9 0x4a0c08 in run_argv /home/ahunt/git/git/git.c:781:4 git#10 0x49d5a8 in cmd_main /home/ahunt/git/git/git.c:912:19 git#11 0x7974da in main /home/ahunt/git/git/common-main.c:52:11 git#12 0x7f8778114349 in __libc_start_main (/lib64/libc.so.6+0x24349) git#13 0x421bd9 in _start /home/abuild/rpmbuild/BUILD/glibc-2.26/csu/../sysdeps/x86_64/start.S:120 Uninitialized value was created by an allocation of 'res' in the stack frame of function 'report_result' #0 0x4f72c0 in report_result /home/ahunt/git/git/builtin/checkout--worker.c:55 SUMMARY: MemorySanitizer: use-of-uninitialized-value /home/ahunt/git/git/wrapper.c:256:8 in xwrite Exiting #0 0xd5ac28 in xwrite /home/ahunt/git/git/wrapper.c:256:8 #1 0xd5b327 in write_in_full /home/ahunt/git/git/wrapper.c:311:21 #2 0xb0a8c4 in do_packet_write /home/ahunt/git/git/pkt-line.c:221:6 #3 0xb0a5fd in packet_write /home/ahunt/git/git/pkt-line.c:242:6 #4 0x4f7441 in report_result /home/ahunt/git/git/builtin/checkout--worker.c:69:2 #5 0x4f6be6 in worker_loop /home/ahunt/git/git/builtin/checkout--worker.c:100:3 git#6 0x4f68d3 in cmd_checkout__worker /home/ahunt/git/git/builtin/checkout--worker.c:143:2 #7 0x4a1e76 in run_builtin /home/ahunt/git/git/git.c:461:11 git#8 0x49e1e7 in handle_builtin /home/ahunt/git/git/git.c:714:3 git#9 0x4a0c08 in run_argv /home/ahunt/git/git/git.c:781:4 git#10 0x49d5a8 in cmd_main /home/ahunt/git/git/git.c:912:19 git#11 0x7974da in main /home/ahunt/git/git/common-main.c:52:11 git#12 0x7f2749a0e349 in __libc_start_main (/lib64/libc.so.6+0x24349) git#13 0x421bd9 in _start /home/abuild/rpmbuild/BUILD/glibc-2.26/csu/../sysdeps/x86_64/start.S:120 Uninitialized value was created by an allocation of 'res' in the stack frame of function 'report_result' #0 0x4f72c0 in report_result /home/ahunt/git/git/builtin/checkout--worker.c:55 SUMMARY: MemorySanitizer: use-of-uninitialized-value /home/ahunt/git/git/wrapper.c:256:8 in xwrite Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

origin starts off pointing to somewhere within line, which is owned by the caller. Later we might allocate a new string using xmemdupz() or xstrfmt(). To avoid leaking these new strings, we introduce a to_free pointer - which allows us to safely free the newly allocated string when we're done (we cannot just free origin directly as it might still be pointing to line). LSAN output from t0090: Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x49a82d in malloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3 #1 0xa71f49 in do_xmalloc wrapper.c:41:8 #2 0xa720b0 in do_xmallocz wrapper.c:75:8 #3 0xa720b0 in xmallocz wrapper.c:83:9 #4 0xa720b0 in xmemdupz wrapper.c:99:16 #5 0x8092ba in handle_line fmt-merge-msg.c:187:23 git#6 0x8092ba in fmt_merge_msg fmt-merge-msg.c:666:7 #7 0x5ce2e6 in prepare_merge_message builtin/merge.c:1119:2 git#8 0x5ce2e6 in collect_parents builtin/merge.c:1215:3 git#9 0x5c9c1e in cmd_merge builtin/merge.c:1454:16 git#10 0x4ce83e in run_builtin git.c:475:11 git#11 0x4ccafe in handle_builtin git.c:729:3 git#12 0x4cb01c in run_argv git.c:818:4 git#13 0x4cb01c in cmd_main git.c:949:19 git#14 0x6b3fad in main common-main.c:52:11 git#15 0x7fb929620349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 8 byte(s) leaked in 1 allocation(s). Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

realpath is only populated if we execute the git_work_tree_initialized block. However that block also causes us to return early, meaning we never actually release the strbuf in the case where we populated it. Therefore we move all strbuf related code into the block to guarantee that we can't leak it. LSAN output from t0095: Direct leak of 129 byte(s) in 1 object(s) allocated from: #0 0x49a9b9 in realloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0x78f585 in xrealloc wrapper.c:126:8 #2 0x713ff4 in strbuf_grow strbuf.c:98:2 #3 0x713ff4 in strbuf_getcwd strbuf.c:597:3 #4 0x4f0c18 in strbuf_realpath_1 abspath.c:99:7 #5 0x5ae4a4 in set_git_work_tree environment.c:259:3 git#6 0x6fdd8a in setup_discovered_git_dir setup.c:931:2 #7 0x6fdd8a in setup_git_directory_gently setup.c:1235:12 git#8 0x4cb50d in get_bloom_filter_for_commit t/helper/test-bloom.c:41:2 git#9 0x4cb50d in cmd__bloom t/helper/test-bloom.c:95:3 git#10 0x4caa1f in cmd_main t/helper/test-tool.c:124:11 git#11 0x4caded in main common-main.c:52:11 git#12 0x7f0869f02349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 129 byte(s) leaked in 1 allocation(s). It looks like this leak has existed since realpath was first added to set_git_work_tree() in: 3d7747e (real_path: remove unsafe API, 2020-03-10) Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

relative_url() populates sb. In the normal return path, its buffer is detached using strbuf_detach(). However the early return path does nothing with sb, which means that sb's memory is leaked - therefore we add a release to avoid this leak. The reset is also only necessary for the normal return path, hence we move it down to after the early-return to avoid unnecessary work. LSAN output from t0060: Direct leak of 121 byte(s) in 1 object(s) allocated from: #0 0x7f31246f28b0 in realloc (/usr/lib64/libasan.so.4+0xdc8b0) #1 0x98d7d6 in xrealloc wrapper.c:126 #2 0x909a60 in strbuf_grow strbuf.c:98 #3 0x90bf00 in strbuf_vaddf strbuf.c:401 #4 0x90c321 in strbuf_addf strbuf.c:335 #5 0x5cb78d in relative_url builtin/submodule--helper.c:182 git#6 0x5cbe46 in resolve_relative_url_test builtin/submodule--helper.c:248 #7 0x410dcd in run_builtin git.c:475 git#8 0x410dcd in handle_builtin git.c:729 git#9 0x414087 in run_argv git.c:818 git#10 0x414087 in cmd_main git.c:949 git#11 0x40e9ec in main common-main.c:52 git#12 0x7f3123c41349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 121 byte(s) leaked in 1 allocation(s). Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

cmd_for_each_repo() copies argv into args (a strvec), which is later passed into run_command_on_repo(), which in turn copies that strvec onto the end of child.args. The initial copy is unnecessary (we never modify args). We therefore choose to just pass argv directly into run_command_on_repo(), which lets us avoid the copy and fixes the leak. LSAN output from t0068: Direct leak of 192 byte(s) in 1 object(s) allocated from: #0 0x7f63bd4ab8b0 in realloc (/usr/lib64/libasan.so.4+0xdc8b0) #1 0x98d7e6 in xrealloc wrapper.c:126 #2 0x916914 in strvec_push_nodup strvec.c:19 #3 0x916a6e in strvec_push strvec.c:26 #4 0x4be4eb in cmd_for_each_repo builtin/for-each-repo.c:49 #5 0x410dcd in run_builtin git.c:475 git#6 0x410dcd in handle_builtin git.c:729 #7 0x414087 in run_argv git.c:818 git#8 0x414087 in cmd_main git.c:949 git#9 0x40e9ec in main common-main.c:52 git#10 0x7f63bc9fa349 in __libc_start_main (/lib64/libc.so.6+0x24349) Indirect leak of 22 byte(s) in 2 object(s) allocated from: #0 0x7f63bd445e30 in __interceptor_strdup (/usr/lib64/libasan.so.4+0x76e30) #1 0x98d698 in xstrdup wrapper.c:29 #2 0x916a63 in strvec_push strvec.c:26 #3 0x4be4eb in cmd_for_each_repo builtin/for-each-repo.c:49 #4 0x410dcd in run_builtin git.c:475 #5 0x410dcd in handle_builtin git.c:729 git#6 0x414087 in run_argv git.c:818 #7 0x414087 in cmd_main git.c:949 git#8 0x40e9ec in main common-main.c:52 git#9 0x7f63bc9fa349 in __libc_start_main (/lib64/libc.so.6+0x24349) See also discussion about the original implementation below - this code appears to have evolved from a callback explaining the double-strvec-copy pattern, but there's no strong reason to keep that now: https://lore.kernel.org/git/68bbeca5-314b-08ee-ef36-040e3f3814e9@gmail.com/ Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

old_dir/new_dir are free()'d at the end of update_dir_rename_counts, however if we return early we'll never free those strings. Therefore we should move all new allocations after the possible early return, avoiding a leak. This seems like a fairly recent leak, that started happening since the early-return was added in: 1ad69eb (diffcore-rename: compute dir_rename_counts in stages, 2021-02-27) LSAN output from t0022: Direct leak of 7 byte(s) in 1 object(s) allocated from: #0 0x486804 in strdup ../projects/compiler-rt/lib/asan/asan_interceptors.cpp:452:3 #1 0xa71e48 in xstrdup wrapper.c:29:14 #2 0x7db9c7 in update_dir_rename_counts diffcore-rename.c:464:12 #3 0x7db6ae in find_renames diffcore-rename.c:1062:3 #4 0x7d76c3 in diffcore_rename_extended diffcore-rename.c:1472:18 #5 0x7b4cfc in diffcore_std diff.c:6705:4 git#6 0x855e46 in log_tree_diff_flush log-tree.c:846:2 #7 0x856574 in log_tree_diff log-tree.c:955:3 git#8 0x856574 in log_tree_commit log-tree.c:986:10 git#9 0x9a9c67 in print_commit_summary sequencer.c:1329:7 git#10 0x52e623 in cmd_commit builtin/commit.c:1862:3 git#11 0x4ce83e in run_builtin git.c:475:11 git#12 0x4ccafe in handle_builtin git.c:729:3 git#13 0x4cb01c in run_argv git.c:818:4 git#14 0x4cb01c in cmd_main git.c:949:19 git#15 0x6b3f3d in main common-main.c:52:11 git#16 0x7fe397c7a349 in __libc_start_main (/lib64/libc.so.6+0x24349) Direct leak of 7 byte(s) in 1 object(s) allocated from: #0 0x486804 in strdup ../projects/compiler-rt/lib/asan/asan_interceptors.cpp:452:3 #1 0xa71e48 in xstrdup wrapper.c:29:14 #2 0x7db9bc in update_dir_rename_counts diffcore-rename.c:463:12 #3 0x7db6ae in find_renames diffcore-rename.c:1062:3 #4 0x7d76c3 in diffcore_rename_extended diffcore-rename.c:1472:18 #5 0x7b4cfc in diffcore_std diff.c:6705:4 git#6 0x855e46 in log_tree_diff_flush log-tree.c:846:2 #7 0x856574 in log_tree_diff log-tree.c:955:3 git#8 0x856574 in log_tree_commit log-tree.c:986:10 git#9 0x9a9c67 in print_commit_summary sequencer.c:1329:7 git#10 0x52e623 in cmd_commit builtin/commit.c:1862:3 git#11 0x4ce83e in run_builtin git.c:475:11 git#12 0x4ccafe in handle_builtin git.c:729:3 git#13 0x4cb01c in run_argv git.c:818:4 git#14 0x4cb01c in cmd_main git.c:949:19 git#15 0x6b3f3d in main common-main.c:52:11 git#16 0x7fe397c7a349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 14 byte(s) leaked in 2 allocation(s). Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

u.head is populated using resolve_refdup(), which returns a newly allocated string - hence we also need to free() it. Found while running t0041 with LSAN: Direct leak of 16 byte(s) in 1 object(s) allocated from: #0 0x486804 in strdup ../projects/compiler-rt/lib/asan/asan_interceptors.cpp:452:3 #1 0xa8be98 in xstrdup wrapper.c:29:14 #2 0x9481db in head_atom_parser ref-filter.c:549:17 #3 0x9408c7 in parse_ref_filter_atom ref-filter.c:703:30 #4 0x9400e3 in verify_ref_format ref-filter.c:974:8 #5 0x4f9e8b in print_ref_list builtin/branch.c:439:6 git#6 0x4f9e8b in cmd_branch builtin/branch.c:757:3 #7 0x4ce83e in run_builtin git.c:475:11 git#8 0x4ccafe in handle_builtin git.c:729:3 git#9 0x4cb01c in run_argv git.c:818:4 git#10 0x4cb01c in cmd_main git.c:949:19 git#11 0x6bdc2d in main common-main.c:52:11 git#12 0x7f96edf86349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 16 byte(s) leaked in 1 allocation(s). Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

repo_diff_setup() calls through to diff.c's static prep_parse_options(), which in turn allocates a new array into diff_opts.parseopts. diff_setup_done() is responsible for freeing that array, and has the benefit of verifying diff_opts too - hence we add a call to diff_setup_done() to avoid leaking parseopts. Output from the leak as found while running t0090 with LSAN: Direct leak of 7120 byte(s) in 1 object(s) allocated from: #0 0x49a82d in malloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3 #1 0xa8bf89 in do_xmalloc wrapper.c:41:8 #2 0x7a7bae in prep_parse_options diff.c:5636:2 #3 0x7a7bae in repo_diff_setup diff.c:4611:2 #4 0x93716c in repo_index_has_changes read-cache.c:2518:3 #5 0x872233 in unclean merge-ort-wrappers.c:12:14 git#6 0x872233 in merge_ort_recursive merge-ort-wrappers.c:53:6 #7 0x5d5b11 in try_merge_strategy builtin/merge.c:752:12 git#8 0x5d0b6b in cmd_merge builtin/merge.c:1666:9 git#9 0x4ce83e in run_builtin git.c:475:11 git#10 0x4ccafe in handle_builtin git.c:729:3 git#11 0x4cb01c in run_argv git.c:818:4 git#12 0x4cb01c in cmd_main git.c:949:19 git#13 0x6bdc2d in main common-main.c:52:11 git#14 0x7f551eb51349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 7120 byte(s) leaked in 1 allocation(s) Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

apply_multi_file_filter and async_query_available_blobs both query subprocess output using subprocess_read_status, which writes data into the identically named filter_status strbuf. We add a strbuf_release to avoid leaking their contents. Leak output seen when running t0021 with LSAN: Direct leak of 24 byte(s) in 1 object(s) allocated from: #0 0x49ab49 in realloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xa8c2b5 in xrealloc wrapper.c:126:8 #2 0x9ff99d in strbuf_grow strbuf.c:98:2 #3 0x9ff99d in strbuf_addbuf strbuf.c:304:2 #4 0xa101d6 in subprocess_read_status sub-process.c:45:5 #5 0x77793c in apply_multi_file_filter convert.c:886:8 git#6 0x77793c in apply_filter convert.c:1042:10 #7 0x77a0b5 in convert_to_git_filter_fd convert.c:1492:7 git#8 0x8b48cd in index_stream_convert_blob object-file.c:2156:2 git#9 0x8b48cd in index_fd object-file.c:2248:9 git#10 0x597411 in hash_fd builtin/hash-object.c:43:9 git#11 0x596be1 in hash_object builtin/hash-object.c:59:2 git#12 0x596be1 in cmd_hash_object builtin/hash-object.c:153:3 git#13 0x4ce83e in run_builtin git.c:475:11 git#14 0x4ccafe in handle_builtin git.c:729:3 git#15 0x4cb01c in run_argv git.c:818:4 git#16 0x4cb01c in cmd_main git.c:949:19 git#17 0x6bdc2d in main common-main.c:52:11 git#18 0x7f42acf79349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 24 byte(s) leaked in 1 allocation(s). Direct leak of 120 byte(s) in 5 object(s) allocated from: #0 0x49ab49 in realloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xa8c295 in xrealloc wrapper.c:126:8 #2 0x9ff97d in strbuf_grow strbuf.c:98:2 #3 0x9ff97d in strbuf_addbuf strbuf.c:304:2 #4 0xa101b6 in subprocess_read_status sub-process.c:45:5 #5 0x775c73 in async_query_available_blobs convert.c:960:8 git#6 0x80029d in finish_delayed_checkout entry.c:183:9 #7 0xa65d1e in check_updates unpack-trees.c:493:10 git#8 0xa5f469 in unpack_trees unpack-trees.c:1747:8 git#9 0x525971 in checkout builtin/clone.c:815:6 git#10 0x525971 in cmd_clone builtin/clone.c:1409:8 git#11 0x4ce83e in run_builtin git.c:475:11 git#12 0x4ccafe in handle_builtin git.c:729:3 git#13 0x4cb01c in run_argv git.c:818:4 git#14 0x4cb01c in cmd_main git.c:949:19 git#15 0x6bdc2d in main common-main.c:52:11 git#16 0x7fa253fce349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 120 byte(s) leaked in 5 allocation(s). Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

These leaks all happen at the end of cmd_mv, hence don't matter in any way. But we still fix the easy ones and squash the rest to get us closer to being able to run tests without leaks. LSAN output from t0050: Direct leak of 384 byte(s) in 1 object(s) allocated from: #0 0x49ab49 in realloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xa8c015 in xrealloc wrapper.c:126:8 #2 0xa0a7e1 in add_entry string-list.c:44:2 #3 0xa0a7e1 in string_list_insert string-list.c:58:14 #4 0x5dac03 in cmd_mv builtin/mv.c:248:4 #5 0x4ce83e in run_builtin git.c:475:11 git#6 0x4ccafe in handle_builtin git.c:729:3 #7 0x4cb01c in run_argv git.c:818:4 git#8 0x4cb01c in cmd_main git.c:949:19 git#9 0x6bd9ad in main common-main.c:52:11 git#10 0x7fbfeffc4349 in __libc_start_main (/lib64/libc.so.6+0x24349) Direct leak of 16 byte(s) in 1 object(s) allocated from: #0 0x49a82d in malloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3 #1 0xa8bd09 in do_xmalloc wrapper.c:41:8 #2 0x5dbc34 in internal_prefix_pathspec builtin/mv.c:32:2 #3 0x5da575 in cmd_mv builtin/mv.c:158:14 #4 0x4ce83e in run_builtin git.c:475:11 #5 0x4ccafe in handle_builtin git.c:729:3 git#6 0x4cb01c in run_argv git.c:818:4 #7 0x4cb01c in cmd_main git.c:949:19 git#8 0x6bd9ad in main common-main.c:52:11 git#9 0x7fbfeffc4349 in __libc_start_main (/lib64/libc.so.6+0x24349) Direct leak of 16 byte(s) in 1 object(s) allocated from: #0 0x49a82d in malloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3 #1 0xa8bd09 in do_xmalloc wrapper.c:41:8 #2 0x5dbc34 in internal_prefix_pathspec builtin/mv.c:32:2 #3 0x5da4e4 in cmd_mv builtin/mv.c:148:11 #4 0x4ce83e in run_builtin git.c:475:11 #5 0x4ccafe in handle_builtin git.c:729:3 git#6 0x4cb01c in run_argv git.c:818:4 #7 0x4cb01c in cmd_main git.c:949:19 git#8 0x6bd9ad in main common-main.c:52:11 git#9 0x7fbfeffc4349 in __libc_start_main (/lib64/libc.so.6+0x24349) Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x49a9a2 in calloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3 #1 0xa8c119 in xcalloc wrapper.c:140:8 #2 0x5da585 in cmd_mv builtin/mv.c:159:22 #3 0x4ce83e in run_builtin git.c:475:11 #4 0x4ccafe in handle_builtin git.c:729:3 #5 0x4cb01c in run_argv git.c:818:4 git#6 0x4cb01c in cmd_main git.c:949:19 #7 0x6bd9ad in main common-main.c:52:11 git#8 0x7fbfeffc4349 in __libc_start_main (/lib64/libc.so.6+0x24349) Direct leak of 4 byte(s) in 1 object(s) allocated from: #0 0x49a9a2 in calloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3 #1 0xa8c119 in xcalloc wrapper.c:140:8 #2 0x5da4f8 in cmd_mv builtin/mv.c:149:10 #3 0x4ce83e in run_builtin git.c:475:11 #4 0x4ccafe in handle_builtin git.c:729:3 #5 0x4cb01c in run_argv git.c:818:4 git#6 0x4cb01c in cmd_main git.c:949:19 #7 0x6bd9ad in main common-main.c:52:11 git#8 0x7fbfeffc4349 in __libc_start_main (/lib64/libc.so.6+0x24349) Indirect leak of 65 byte(s) in 1 object(s) allocated from: #0 0x49ab49 in realloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xa8c015 in xrealloc wrapper.c:126:8 #2 0xa00226 in strbuf_grow strbuf.c:98:2 #3 0xa00226 in strbuf_vaddf strbuf.c:394:3 #4 0xa065c7 in xstrvfmt strbuf.c:981:2 #5 0xa065c7 in xstrfmt strbuf.c:991:8 git#6 0x9e7ce7 in prefix_path_gently setup.c:115:15 #7 0x9e7fa6 in prefix_path setup.c:128:12 git#8 0x5dbdbf in internal_prefix_pathspec builtin/mv.c:55:23 git#9 0x5da575 in cmd_mv builtin/mv.c:158:14 git#10 0x4ce83e in run_builtin git.c:475:11 git#11 0x4ccafe in handle_builtin git.c:729:3 git#12 0x4cb01c in run_argv git.c:818:4 git#13 0x4cb01c in cmd_main git.c:949:19 git#14 0x6bd9ad in main common-main.c:52:11 git#15 0x7fbfeffc4349 in __libc_start_main (/lib64/libc.so.6+0x24349) Indirect leak of 65 byte(s) in 1 object(s) allocated from: #0 0x49ab49 in realloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xa8c015 in xrealloc wrapper.c:126:8 #2 0xa00226 in strbuf_grow strbuf.c:98:2 #3 0xa00226 in strbuf_vaddf strbuf.c:394:3 #4 0xa065c7 in xstrvfmt strbuf.c:981:2 #5 0xa065c7 in xstrfmt strbuf.c:991:8 git#6 0x9e7ce7 in prefix_path_gently setup.c:115:15 #7 0x9e7fa6 in prefix_path setup.c:128:12 git#8 0x5dbdbf in internal_prefix_pathspec builtin/mv.c:55:23 git#9 0x5da4e4 in cmd_mv builtin/mv.c:148:11 git#10 0x4ce83e in run_builtin git.c:475:11 git#11 0x4ccafe in handle_builtin git.c:729:3 git#12 0x4cb01c in run_argv git.c:818:4 git#13 0x4cb01c in cmd_main git.c:949:19 git#14 0x6bd9ad in main common-main.c:52:11 git#15 0x7fbfeffc4349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 558 byte(s) leaked in 7 allocation(s). Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

merge_name() calls dwim_ref(), which allocates a new string into found_ref. Therefore add a free() to avoid leaking found_ref. LSAN output from t0021: Direct leak of 16 byte(s) in 1 object(s) allocated from: #0 0x486804 in strdup ../projects/compiler-rt/lib/asan/asan_interceptors.cpp:452:3 #1 0xa8beb8 in xstrdup wrapper.c:29:14 #2 0x954054 in expand_ref refs.c:671:12 #3 0x953cb6 in repo_dwim_ref refs.c:644:22 #4 0x5d3759 in dwim_ref refs.h:162:9 #5 0x5d3759 in merge_name builtin/merge.c:517:6 git#6 0x5d3759 in collect_parents builtin/merge.c:1214:5 #7 0x5cf60d in cmd_merge builtin/merge.c:1458:16 git#8 0x4ce83e in run_builtin git.c:475:11 git#9 0x4ccafe in handle_builtin git.c:729:3 git#10 0x4cb01c in run_argv git.c:818:4 git#11 0x4cb01c in cmd_main git.c:949:19 git#12 0x6bdbfd in main common-main.c:52:11 git#13 0x7f0430502349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 16 byte(s) leaked in 1 allocation(s). Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

- cmd_rebase populates rebase_options.strategy with newly allocated strings, hence we need to free those strings at the end of cmd_rebase to avoid a leak. - In some cases: get_replay_opts() is called, which prepares replay_opts using data from rebase_options. We used to simply copy the pointer from rebase_options.strategy, however that would now result in a double-free because sequencer_remove_state() is eventually used to free replay_opts.strategy. To avoid this we xstrdup() strategy when adding it to replay_opts. The original leak happens because we always populate rebase_options.strategy, but we don't always enter the path that calls get_replay_opts() and later sequencer_remove_state() - in other words we'd always allocate a new string into rebase_options.strategy but only sometimes did we free it. We now make sure that rebase_options and replay_opts both own their own copies of strategy, and each copy is free'd independently. This was first seen when running t0021 with LSAN, but t2012 helped catch the fact that we can't just free(options.strategy) at the end of cmd_rebase (as that can cause a double-free). LSAN output from t0021: LSAN output from t0021: Direct leak of 4 byte(s) in 1 object(s) allocated from: #0 0x486804 in strdup ../projects/compiler-rt/lib/asan/asan_interceptors.cpp:452:3 #1 0xa71eb8 in xstrdup wrapper.c:29:14 #2 0x61b1cc in cmd_rebase builtin/rebase.c:1779:22 #3 0x4ce83e in run_builtin git.c:475:11 #4 0x4ccafe in handle_builtin git.c:729:3 #5 0x4cb01c in run_argv git.c:818:4 git#6 0x4cb01c in cmd_main git.c:949:19 #7 0x6b3fad in main common-main.c:52:11 git#8 0x7f267b512349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 4 byte(s) leaked in 1 allocation(s). Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

setup_unpack_trees_porcelain() populates various fields on unpack_tree_opts, we need to call clear_unpack_trees_porcelain() to avoid leaking them. Specifically, we used to leak unpack_tree_opts.msgs_to_free. We have to do this in leave_reset_head because there are multiple scenarios where unpack_tree_opts has already been configured, followed by a 'goto leave_reset_head'. But we can also 'goto leave_reset_head' prior to having initialised unpack_tree_opts via memset(..., 0, ...). Therefore we also move unpack_tree_opts initialisation to the start of reset_head(), and convert it to use brace initialisation - which guarantees that we can never clear an uninitialised unpack_tree_opts. clear_unpack_tree_opts() is always safe to call as long as unpack_tree_opts is at least zero-initialised, i.e. it does not depend on a previous call to setup_unpack_trees_porcelain(). LSAN output from t0021: Direct leak of 192 byte(s) in 1 object(s) allocated from: #0 0x49ab49 in realloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xa721e5 in xrealloc wrapper.c:126:8 #2 0x9f7861 in strvec_push_nodup strvec.c:19:2 #3 0x9f7861 in strvec_pushf strvec.c:39:2 #4 0xa43e14 in setup_unpack_trees_porcelain unpack-trees.c:129:3 #5 0x97e011 in reset_head reset.c:53:2 git#6 0x61dfa5 in cmd_rebase builtin/rebase.c:1991:9 #7 0x4ce83e in run_builtin git.c:475:11 git#8 0x4ccafe in handle_builtin git.c:729:3 git#9 0x4cb01c in run_argv git.c:818:4 git#10 0x4cb01c in cmd_main git.c:949:19 git#11 0x6b3f3d in main common-main.c:52:11 git#12 0x7fa8addf3349 in __libc_start_main (/lib64/libc.so.6+0x24349) Indirect leak of 147 byte(s) in 1 object(s) allocated from: #0 0x49ab49 in realloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xa721e5 in xrealloc wrapper.c:126:8 #2 0x9e8d54 in strbuf_grow strbuf.c:98:2 #3 0x9e8d54 in strbuf_vaddf strbuf.c:401:3 #4 0x9f7774 in strvec_pushf strvec.c:36:2 #5 0xa43e14 in setup_unpack_trees_porcelain unpack-trees.c:129:3 git#6 0x97e011 in reset_head reset.c:53:2 #7 0x61dfa5 in cmd_rebase builtin/rebase.c:1991:9 git#8 0x4ce83e in run_builtin git.c:475:11 git#9 0x4ccafe in handle_builtin git.c:729:3 git#10 0x4cb01c in run_argv git.c:818:4 git#11 0x4cb01c in cmd_main git.c:949:19 git#12 0x6b3f3d in main common-main.c:52:11 git#13 0x7fa8addf3349 in __libc_start_main (/lib64/libc.so.6+0x24349) Indirect leak of 134 byte(s) in 1 object(s) allocated from: #0 0x49ab49 in realloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xa721e5 in xrealloc wrapper.c:126:8 #2 0x9e8d54 in strbuf_grow strbuf.c:98:2 #3 0x9e8d54 in strbuf_vaddf strbuf.c:401:3 #4 0x9f7774 in strvec_pushf strvec.c:36:2 #5 0xa43fe4 in setup_unpack_trees_porcelain unpack-trees.c:168:3 git#6 0x97e011 in reset_head reset.c:53:2 #7 0x61dfa5 in cmd_rebase builtin/rebase.c:1991:9 git#8 0x4ce83e in run_builtin git.c:475:11 git#9 0x4ccafe in handle_builtin git.c:729:3 git#10 0x4cb01c in run_argv git.c:818:4 git#11 0x4cb01c in cmd_main git.c:949:19 git#12 0x6b3f3d in main common-main.c:52:11 git#13 0x7fa8addf3349 in __libc_start_main (/lib64/libc.so.6+0x24349) Indirect leak of 130 byte(s) in 1 object(s) allocated from: #0 0x49ab49 in realloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xa721e5 in xrealloc wrapper.c:126:8 #2 0x9e8d54 in strbuf_grow strbuf.c:98:2 #3 0x9e8d54 in strbuf_vaddf strbuf.c:401:3 #4 0x9f7774 in strvec_pushf strvec.c:36:2 #5 0xa43f20 in setup_unpack_trees_porcelain unpack-trees.c:150:3 git#6 0x97e011 in reset_head reset.c:53:2 #7 0x61dfa5 in cmd_rebase builtin/rebase.c:1991:9 git#8 0x4ce83e in run_builtin git.c:475:11 git#9 0x4ccafe in handle_builtin git.c:729:3 git#10 0x4cb01c in run_argv git.c:818:4 git#11 0x4cb01c in cmd_main git.c:949:19 git#12 0x6b3f3d in main common-main.c:52:11 git#13 0x7fa8addf3349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 603 byte(s) leaked in 4 allocation(s). Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When doing reference negotiation, git-fetch-pack(1) is loading all refs from disk in order to determine which commits it has in common with the remote repository. This can be quite expensive in repositories with many references though: in a real-world repository with around 2.2 million refs, fetching a single commit by its ID takes around 44 seconds. Dominating the loading time is decompression and parsing of the objects which are referenced by commits. Given the fact that we only care about commits (or tags which can be peeled to one) in this context, there is thus an easy performance win by switching the parsing logic to make use of the commit graph in case we have one available. Like this, we avoid hitting the object database to parse these commits but instead only load them from the commit-graph. This results in a significant performance boost when executing git-fetch in said repository with 2.2 million refs: Benchmark #1: HEAD~: git fetch $remote $commit Time (mean ± σ): 44.168 s ± 0.341 s [User: 42.985 s, System: 1.106 s] Range (min … max): 43.565 s … 44.577 s 10 runs Benchmark #2: HEAD: git fetch $remote $commit Time (mean ± σ): 19.498 s ± 0.724 s [User: 18.751 s, System: 0.690 s] Range (min … max): 18.629 s … 20.454 s 10 runs Summary 'HEAD: git fetch $remote $commit' ran 2.27 ± 0.09 times faster than 'HEAD~: git fetch $remote $commit' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

In order to compute whether objects reachable from a set of tips are all connected, we do a revision walk with these tips as positive references and `--not --all`. `--not --all` will cause the revision walk to load all preexisting references as uninteresting, which can be very expensive in repositories with many references. Benchmarking the git-rev-list(1) command highlights that by far the most expensive single phase is initial sorting of the input revisions: after all references have been loaded, we first sort commits by author date. In a real-world repository with about 2.2 million references, it makes up about 40% of the total runtime of git-rev-list(1). Ultimately, the connectivity check shouldn't really bother about the order of input revisions at all. We only care whether we can actually walk all objects until we hit the cut-off point. So sorting the input is a complete waste of time. Introduce a new "--unsorted-input" flag to git-rev-list(1) which will cause it to not sort the commits and adjust the connectivity check to always pass the flag. This results in the following speedups, executed in a clone of gitlab-org/gitlab [1]: Benchmark #1: git rev-list --objects --quiet --not --all --not $(cat newrev) Time (mean ± σ): 7.639 s ± 0.065 s [User: 7.304 s, System: 0.335 s] Range (min … max): 7.543 s … 7.742 s 10 runs Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.995 s ± 0.044 s [User: 4.657 s, System: 0.337 s] Range (min … max): 4.909 s … 5.048 s 10 runs Summary 'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran 1.53 ± 0.02 times faster than 'git rev-list --objects --quiet --not --all --not $newrev' [1]: https://gitlab.com/gitlab-org/gitlab.git. Note that not all refs are visible to clients. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When queueing up references for the revision walk, `handle_one_ref()` will resolve the reference's object ID via `get_reference()` and then queue the ID as pending object via `add_pending_oid()`. But given that `add_pending_oid()` is only a thin wrapper around `add_pending_object()` which fist calls `get_reference()`, we effectively resolve the reference twice and thus duplicate some of the work. Fix the issue by instead calling `add_pending_object()` directly, which takes the already-resolved object as input. In a repository with lots of refs, this translates into a near 10% speedup: Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 5.015 s ± 0.038 s [User: 4.698 s, System: 0.316 s] Range (min … max): 4.970 s … 5.089 s 10 runs Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.606 s ± 0.029 s [User: 4.260 s, System: 0.345 s] Range (min … max): 4.565 s … 4.657 s 10 runs Summary 'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran 1.09 ± 0.01 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When queueing references in git-rev-list(1), we try to optimize parsing of commits via the commit-graph. To do so, we first look up the object's type, and if it is a commit we call `repo_parse_commit()` instead of `parse_object()`. This is quite inefficient though given that we're always uncompressing the object header in order to determine the type. Instead, we can opportunistically search the commit-graph for the object ID: in case it's found, we know it's a commit and can directly fill in the commit object without having to uncompress the object header. Expose a new function `lookup_commit_in_graph()`, which tries to find a commit in the commit-graph by ID, and convert `get_reference()` to use this function. This provides a big performance win in cases where we load references in a repository with lots of references pointing to commits. The following has been executed in a real-world repository with about 2.2 million refs: Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.458 s ± 0.044 s [User: 4.115 s, System: 0.342 s] Range (min … max): 4.409 s … 4.534 s 10 runs Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 3.089 s ± 0.015 s [User: 2.768 s, System: 0.321 s] Range (min … max): 3.061 s … 3.105 s 10 runs Summary 'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran 1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When fetching, Git will by default print a list of all updated refs in a nicely formatted table. In order to come up with this table, Git needs to iterate refs twice: first to determine the maximum column width, and a second time to actually format these changed refs. While this table will not be printed in case the user passes `--quiet`, we still go out of our way and do all these steps. In fact, we even do more work compared to not passing `--quiet`: without the flag, we will skip all references in the column width computation which have not been updated, but if it is set we will now compute widths for all refs. Fix this issue by completely skipping both preparation of the format and formatting data for display in case the user passes `--quiet`, improving performance especially with many refs. The following benchmark shows a nice speedup for a quiet mirror-fetch in a repository with 2.3M refs: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 26.929 s ± 0.145 s [User: 24.194 s, System: 4.656 s] Range (min … max): 26.692 s … 27.068 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 25.189 s ± 0.094 s [User: 22.556 s, System: 4.606 s] Range (min … max): 25.070 s … 25.314 s 5 runs Summary 'HEAD: git-fetch' ran 1.07 ± 0.01 times faster than 'HEAD~: git-fetch' While at it, this patch also fixes `adjust_refcol_width()` such that it skips unchanged refs in case the user passed `--quiet`, where verbosity will be negative. While this function won't be called anymore if so, this brings the comment in line with actual code. Furthermore, needless `verbosity >= 0` checks are now removed in `store_updated_refs()`: we never print to the `note` buffer anymore in case `verbosity < 0`, so we won't end up in that code block anyway. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When updating our local refs based on the refs fetched from the remote, we need to iterate through all requested refs and load their respective commits such that we can determine whether they need to be appended to FETCH_HEAD or not. In cases where we're fetching from a remote with exceedingly many refs, resolving these refs can be quite expensive given that we repeatedly need to unpack object headers for each of the referenced objects. Speed this up by opportunistically trying to resolve object IDs via the commit graph. We only do so for any refs which are not in "refs/tags": more likely than not, these are going to be a commit anyway, and this lets us avoid having to unpack object headers completely in case the object is a commit that is part of the commit-graph. This significantly speeds up mirror-fetches in a real-world repository with 2.3M refs: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 56.482 s ± 0.384 s [User: 53.340 s, System: 5.365 s] Range (min … max): 56.050 s … 57.045 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 33.727 s ± 0.170 s [User: 30.252 s, System: 5.194 s] Range (min … max): 33.452 s … 33.871 s 5 runs Summary 'HEAD: git-fetch' ran 1.67 ± 0.01 times faster than 'HEAD~: git-fetch' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When updating local refs after the fetch has transferred all objects, we do an object existence test as a safety guard to avoid updating a ref to an object which we don't have. We do so via `oid_object_info()`: if it returns an error, then we know the object does not exist. One side effect of `oid_object_info()` is that it parses the object's type, and to do so it must unpack the object header. This is completely pointless: we don't care for the type, but only want to assert that the object exists. Refactor the code to use `repo_has_object_file()`, which both makes the code's intent clearer and is also faster because it does not unpack object headers. In a real-world repo with 2.3M refs, this results in a small speedup when doing a mirror-fetch: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 33.686 s ± 0.176 s [User: 30.119 s, System: 5.262 s] Range (min … max): 33.512 s … 33.944 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 31.247 s ± 0.195 s [User: 28.135 s, System: 5.066 s] Range (min … max): 30.948 s … 31.472 s 5 runs Summary 'HEAD: git-fetch' ran 1.08 ± 0.01 times faster than 'HEAD~: git-fetch' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

The object ID iterator used by the connectivity checks returns the next object ID via an out-parameter and then uses a return code to indicate whether an item was found. This is a bit roundabout: instead of a separate error code, we can just return the next object ID directly and use `NULL` pointers as indicator that the iterator got no items left. Furthermore, this avoids a copy of the object ID. Refactor the iterator and all its implementations to return object IDs directly. This brings a tiny performance improvement when doing a mirror-fetch of a repository with about 2.3M refs: Benchmark #1: 328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch Time (mean ± σ): 30.110 s ± 0.148 s [User: 27.161 s, System: 5.075 s] Range (min … max): 29.934 s … 30.406 s 10 runs Benchmark #2: 328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch Time (mean ± σ): 29.899 s ± 0.109 s [User: 26.916 s, System: 5.104 s] Range (min … max): 29.696 s … 29.996 s 10 runs Summary '328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch' ran 1.01 ± 0.01 times faster than '328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch' While this 1% speedup could be labelled as statistically insignificant, the speedup is consistent on my machine. Furthermore, this is an end to end test, so it is expected that the improvement in the connectivity check itself is more significant. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

In order to negotiate a packfile, we need to dereference refs to see which commits we have in common with the remote. To do so, we first look up the object's type -- if it's a tag, we peel until we hit a non-tag object. If we hit a commit eventually, then we return that commit. In case the object ID points to a commit directly, we can avoid the initial lookup of the object type by opportunistically looking up the commit via the commit-graph, if available, which gives us a slight speed bump of about 2% in a huge repository with about 2.3M refs: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 31.634 s ± 0.258 s [User: 28.400 s, System: 5.090 s] Range (min … max): 31.280 s … 31.896 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 31.129 s ± 0.543 s [User: 27.976 s, System: 5.056 s] Range (min … max): 30.172 s … 31.479 s 5 runs Summary 'HEAD: git-fetch' ran 1.02 ± 0.02 times faster than 'HEAD~: git-fetch' In case this fails, we fall back to the old code which peels the objects to a commit. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When fetching refs, we are doing two connectivity checks: - The first one is done such that we can skip fetching refs in the case where we already have all objects referenced by the updated set of refs. - The second one verifies that we have all objects after we have fetched objects. We always execute both connectivity checks, but this is wasteful in case the first connectivity check already notices that we have all objects locally available. Skip the second connectivity check in case we already had all objects available. This gives us a nice speedup when doing a mirror-fetch in a repository with about 2.3M refs where the fetching repo already has all objects: Benchmark #1: HEAD~: git-fetch Time (mean ± σ): 30.025 s ± 0.081 s [User: 27.070 s, System: 4.933 s] Range (min … max): 29.900 s … 30.111 s 5 runs Benchmark #2: HEAD: git-fetch Time (mean ± σ): 25.574 s ± 0.177 s [User: 22.855 s, System: 4.683 s] Range (min … max): 25.399 s … 25.765 s 5 runs Summary 'HEAD: git-fetch' ran 1.17 ± 0.01 times faster than 'HEAD~: git-fetch' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

During reflog expiry, the cmd_reflog_expire() function first iterates over all reflogs in logs/*, and then one-by-one acquires the lock for each one and expires it. This behavior has been with us since this command was implemented in 4264dc1 ("git reflog expire", 2006-12-19). Change this to stop calling lock_ref_oid_basic() with the OID we saw when we looped over the logs, instead have it pass the OID it managed to lock. This mostly mitigates a race condition where e.g. "git gc" will fail in a concurrently updated repository because the branch moved since "git reflog expire --all" was started. I.e. with: error: cannot lock ref '<refname>': ref '<refname>' is at <OID-A> but expected <OID-B> This behavior of passing in an "oid" was needed for an edge-case that I've untangled in this and preceding commits though, namely that we needed this OID because we'd: 1. Lookup the reflog name/OID via dwim_log() 2. With that OID, lock the reflog 3. Later in builtin/reflog.c we use the OID we looked as input to lookup_commit_reference_gently(), assured that it's equal to the OID we got from dwim_log(). We can be sure that this change is safe to make because between dwim_log (step #1) and lock_ref_oid_basic (step #2) there was no other logic relevant to the OID or expiry run in the cmd_reflog_expire() caller. We can thus treat that code as a black box, before and after this change it would get an OID that's been locked, the only difference is that now we mostly won't be failing to get the lock due to the TOCTOU race[0]. That failure was purely an implementation detail in how the "current OID" was looked up, it was divorced from the locking mechanism. What do we mean with "mostly"? It mostly mitigates it because we'll still run into cases where the ref is locked and being updated as we want to expire it, and other git processes wanting to update the refs will in turn race with us as we expire the reflog. That remaining race can in turn be mitigated with the core.filesRefLockTimeout setting, see 4ff0f01 ("refs: retry acquiring reference locks for 100ms", 2017-08-21). In practice if that value is high enough we'll probably never have ref updates or reflog expiry failing, since the clients involved will retry for far longer than the time any of those operations could take. See [1] for an initial report of how this impacted "git gc" and a large discussion about this change in early 2019. In particular patch looked good to Michael Haggerty, see his[2]. That message seems to not have made it to the ML archive, its content is quoted in full in my [3]. I'm leaving behind now-unused code the refs API etc. that takes the now-NULL "unused_oid" argument, and other code that can be simplified now that we never have on OID in that context, that'll be cleaned up in subsequent commits, but for now let's narrowly focus on fixing the "git gc" issue. As the modified assert() shows we always pass a NULL oid to reflog_expire() now. Unfortunately this sort of probabilistic contention is hard to turn into a test. I've tested this by running the following three subshells in concurrent terminals: ( rm -rf /tmp/git && git init /tmp/git && while true do head -c 10 /dev/urandom | hexdump >/tmp/git/out && git -C /tmp/git add out && git -C /tmp/git commit -m"out" done ) ( rm -rf /tmp/git-clone && git clone file:///tmp/git /tmp/git-clone && while git -C /tmp/git-clone pull do date done ) ( while git -C /tmp/git-clone reflog expire --all do date done ) Before this change the "reflog expire" would fail really quickly with the "but expected" error noted above. After this change both the "pull" and "reflog expire" will run for a while, but eventually fail because I get unlucky with core.filesRefLockTimeout (the "reflog expire" is in a really tight loop). As noted above that can in turn be mitigated with higher values of core.filesRefLockTimeout than the 100ms default. As noted in the commentary added in the preceding commit there's also the case of branches being racily deleted, that can be tested by adding this to the above: ( while git -C /tmp/git-clone branch topic master && git -C /tmp/git-clone branch -D topic do date done ) With core.filesRefLockTimeout set to 10 seconds (it can probably be a lot lower) I managed to run all four of these concurrently for about an hour, and accumulated ~125k commits, auto-gc's and all, and didn't have a single failure. The loops visibly stall while waiting for the lock, but that's expected and desired behavior. 0. https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use 1. https://lore.kernel.org/git/87tvg7brlm.fsf@evledraar.gmail.com/ 2. http://lore.kernel.org/git/b870a17d-2103-41b8-3cbc-7389d5fff33a@alum.mit.edu 3. https://lore.kernel.org/git/87pnqkco8v.fsf@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

In a sparse index it is possible for the tree that is being verified to be freed while it is being verified. This happens when the index is sparse but the cache tree is not and index_name_pos() looks up a path from the cache tree that is a descendant of a sparse index entry. That triggers a call to ensure_full_index() which frees the cache tree that is being verified. Carrying on trying to verify the tree after this results in a use-after-free bug. Instead restart the verification if a sparse index is converted to a full index. This bug is triggered by a call to reset_head() in "git rebase --apply". Thanks to René Scharfe and Derrick Stolee for their help analyzing the problem. ==74345==ERROR: AddressSanitizer: heap-use-after-free on address 0x606000001b20 at pc 0x557cbe82d3a2 bp 0x7ffdfee08090 sp 0x7ffdfee08080 READ of size 4 at 0x606000001b20 thread T0 #0 0x557cbe82d3a1 in verify_one /home/phil/src/git/cache-tree.c:863 #1 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #2 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #3 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 #4 0x557cbe830a2b in cache_tree_verify /home/phil/src/git/cache-tree.c:910 #5 0x557cbea53741 in write_locked_index /home/phil/src/git/read-cache.c:3250 git#6 0x557cbeab7fdd in reset_head /home/phil/src/git/reset.c:87 #7 0x557cbe72147f in cmd_rebase builtin/rebase.c:2074 git#8 0x557cbe5bd151 in run_builtin /home/phil/src/git/git.c:461 git#9 0x557cbe5bd151 in handle_builtin /home/phil/src/git/git.c:714 git#10 0x557cbe5c0503 in run_argv /home/phil/src/git/git.c:781 git#11 0x557cbe5c0503 in cmd_main /home/phil/src/git/git.c:912 git#12 0x557cbe5bad28 in main /home/phil/src/git/common-main.c:52 git#13 0x7fdd4b82eb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) git#14 0x557cbe5bcb8d in _start (/home/phil/src/git/git+0x1b9b8d) 0x606000001b20 is located 0 bytes inside of 56-byte region [0x606000001b20,0x606000001b58) freed by thread T0 here: #0 0x7fdd4bacff19 in __interceptor_free /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:127 #1 0x557cbe82af60 in cache_tree_free /home/phil/src/git/cache-tree.c:35 #2 0x557cbe82aee5 in cache_tree_free /home/phil/src/git/cache-tree.c:31 #3 0x557cbe82aee5 in cache_tree_free /home/phil/src/git/cache-tree.c:31 #4 0x557cbe82aee5 in cache_tree_free /home/phil/src/git/cache-tree.c:31 #5 0x557cbeb2557a in ensure_full_index /home/phil/src/git/sparse-index.c:310 git#6 0x557cbea45c4a in index_name_stage_pos /home/phil/src/git/read-cache.c:588 #7 0x557cbe82ce37 in verify_one /home/phil/src/git/cache-tree.c:850 git#8 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 git#9 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 git#10 0x557cbe82ca9d in verify_one /home/phil/src/git/cache-tree.c:840 git#11 0x557cbe830a2b in cache_tree_verify /home/phil/src/git/cache-tree.c:910 git#12 0x557cbea53741 in write_locked_index /home/phil/src/git/read-cache.c:3250 git#13 0x557cbeab7fdd in reset_head /home/phil/src/git/reset.c:87 git#14 0x557cbe72147f in cmd_rebase builtin/rebase.c:2074 git#15 0x557cbe5bd151 in run_builtin /home/phil/src/git/git.c:461 git#16 0x557cbe5bd151 in handle_builtin /home/phil/src/git/git.c:714 git#17 0x557cbe5c0503 in run_argv /home/phil/src/git/git.c:781 git#18 0x557cbe5c0503 in cmd_main /home/phil/src/git/git.c:912 git#19 0x557cbe5bad28 in main /home/phil/src/git/common-main.c:52 git#20 0x7fdd4b82eb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) previously allocated by thread T0 here: #0 0x7fdd4bad0459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154 #1 0x557cbebc1807 in xcalloc /home/phil/src/git/wrapper.c:140 #2 0x557cbe82b7d8 in cache_tree /home/phil/src/git/cache-tree.c:17 #3 0x557cbe82b7d8 in prime_cache_tree_rec /home/phil/src/git/cache-tree.c:763 #4 0x557cbe82b837 in prime_cache_tree_rec /home/phil/src/git/cache-tree.c:764 #5 0x557cbe82b837 in prime_cache_tree_rec /home/phil/src/git/cache-tree.c:764 git#6 0x557cbe8304e1 in prime_cache_tree /home/phil/src/git/cache-tree.c:779 #7 0x557cbeab7fa7 in reset_head /home/phil/src/git/reset.c:85 git#8 0x557cbe72147f in cmd_rebase builtin/rebase.c:2074 git#9 0x557cbe5bd151 in run_builtin /home/phil/src/git/git.c:461 git#10 0x557cbe5bd151 in handle_builtin /home/phil/src/git/git.c:714 git#11 0x557cbe5c0503 in run_argv /home/phil/src/git/git.c:781 git#12 0x557cbe5c0503 in cmd_main /home/phil/src/git/git.c:912 git#13 0x557cbe5bad28 in main /home/phil/src/git/common-main.c:52 git#14 0x7fdd4b82eb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24) Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>

When "cat-file --batch-all-objects" iterates over each object, it knows where to find each one. But when we look up details of the object, we don't use that information at all. This patch teaches it to use the pack/offset pair when we're iterating over objects in a pack. This yields a measurable speed improvement (timings on a fully packed clone of linux.git): Benchmark #1: ./git.old cat-file --batch-all-objects --unordered --batch-check="%(objecttype) %(objectname)" Time (mean ± σ): 8.128 s ± 0.118 s [User: 7.968 s, System: 0.156 s] Range (min … max): 8.007 s … 8.301 s 10 runs Benchmark #2: ./git.new cat-file --batch-all-objects --unordered --batch-check="%(objecttype) %(objectname)" Time (mean ± σ): 4.294 s ± 0.064 s [User: 4.167 s, System: 0.125 s] Range (min … max): 4.227 s … 4.457 s 10 runs Summary './git.new cat-file --batch-all-objects --unordered --batch-check="%(objecttype) %(objectname)"' ran 1.89 ± 0.04 times faster than './git.old cat-file --batch-all-objects --unordered --batch-check="%(objecttype) %(objectname)" The implementation is pretty simple: we just call packed_object_info() instead of oid_object_info_extended() when we can. Most of the changes are just plumbing the pack/offset pair through the callstack. There is one subtlety: replace lookups are not handled by packed_object_info(). But since those are disabled for --batch-all-objects, and since we'll only have pack info when that option is in effect, we don't have to worry about that. There are a few limitations to this optimization which we could address with further work: - I didn't bother recording when we found an object loose. Technically this could save us doing a fruitless lookup in the pack index. But opening and mmap-ing a loose object is so expensive in the first place that this doesn't matter much. And if your repository is large enough to care about per-object performance, most objects are going to be packed anyway. - This works only in --unordered mode. For the sorted mode, we'd have to record the pack/offset pair as part of our oid-collection. That's more code, plus at least 16 extra bytes of heap per object. It would probably still be a net win in runtime, but we'd need to measure. - For --batch, this still helps us with getting the object metadata, but we still do a from-scratch lookup for the object contents. This probably doesn't matter that much, because the lookup cost will be much smaller relative to the cost of actually unpacking and printing the objects. For small objects, we could probably swap out read_object_file() for using packed_object_info() with a "object_info.contentp" to get the contents. But we'd still need to deal with streaming for larger objects. A better path forward here is to teach the initial oid_object_info_extended() / packed_object_info() calls to retrieve the contents of smaller objects while they are already being accessed. That would save the extra lookup entirely. But it's a non-trivial feature to add to the object_info code, so I left it for now. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Extend the trick we use to speed up the "clean" target to also extend to the "lint-docs" target. See 54df875 (Documentation/Makefile: conditionally include doc.dep, 2020-12-08) for the "clean" implementation. The "doc-lint" target only depends on *.txt files, so we don't need to generate GIT-VERSION-FILE etc. if that's all we're doing. This makes the "make lint-docs" target more than 2x as fast: $ git show HEAD~:Documentation/Makefile >Makefile.old $ hyperfine -L f ",.old" 'make -f Makefile{f} lint-docs' Benchmark #1: make -f Makefile lint-docs Time (mean ± σ): 100.2 ms ± 1.3 ms [User: 93.7 ms, System: 6.7 ms] Range (min … max): 98.4 ms … 103.1 ms 29 runs Benchmark #2: make -f Makefile.old lint-docs Time (mean ± σ): 220.0 ms ± 20.0 ms [User: 206.0 ms, System: 18.0 ms] Range (min … max): 206.6 ms … 267.5 ms 11 runs Summary 'make -f Makefile lint-docs' ran 2.19 ± 0.20 times faster than 'make -f Makefile.old lint-docs' Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Speed up the "lint-docs" target by making it non-.PHONY. Similar to my c234e8a (Makefile: make the "sparse" target non-.PHONY, 2021-09-23). We'll now create empty files corresponding to a dependency graph for each of these lint scripts. This speeds things up a bit[1], and makes the output correspond to any in-tree changes we have: $ touch git-add.txt; make lint-docs; make lint-docs GEN cmd-list.made GEN doc.dep LINT GITLINK git-add.txt LINT MAN END git-add.txt LINT MAN SEC git-add.txt make: Nothing to be done for 'lint-docs'. As with the "sparse" target changes this has a hard dependency on the use of ".DELETE_ON_ERROR" in the Makefile, added here in db10fc6 (doc: simplify Makefile using .DELETE_ON_ERROR, 2021-05-21). This method also depends on the output for us emitting any errors on STDERR (fixed in a preceding commit), as well us these scripts exiting with non-zero on any errors (which they were already doing). 1. $ git show HEAD~:Documentation/Makefile >Makefile.old $ hyperfine --warmup 2 -L f ",.old" 'make -j1 -f Makefile{f} lint-docs' Benchmark #1: make -j1 -f Makefile lint-docs Time (mean ± σ): 60.8 ms ± 1.4 ms [User: 58.7 ms, System: 2.5 ms] Range (min … max): 58.9 ms … 64.0 ms 48 runs Benchmark #2: make -j1 -f Makefile.old lint-docs Time (mean ± σ): 84.0 ms ± 1.5 ms [User: 78.6 ms, System: 5.7 ms] Range (min … max): 81.8 ms … 87.8 ms 35 runs Summary 'make -j1 -f Makefile lint-docs' ran 1.38 ± 0.04 times faster than 'make -j1 -f Makefile.old lint-docs' Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>

* cleaning duplicate incorrect translations part 2 * update translation table * unfuzzy all entries * typos check in git commands and git flags * random line translations * updating all phrases #1 Signed-off-by: Daniel Santos <daniel@brilhante.top>

bk2204 added 30 commits January 8, 2020 23:33

:test-fixes-part7

476b0e5

t4054: make hash-size independent

1472b73

Instead of hard-coding the length of an object ID when creating a tree, generate it based on $ZERO_OID. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t4066: compute index line in diffs

1acbbff

Since the object ID used in the index line will differ between different algorithms, compute these values instead of hard-coding them. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t4134: compute appropriate length constant

d8feb2b

Instead of using a specific invalid hard-coded object ID, generate one of the appropriate length by looking one up in the translation tables. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t4200: make hash size independent

ff1721d

Instead of hard-coding a fixed length example object ID in the test, look one up using the translation tables. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t4204: make hash size independent

121f390

Use $OID_REGEX instead of a hard-coded regular expression. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t4300: abstract away SHA-1-specific constants

2702b74

Adjust the test so that it computes values for object IDs instead of using hard-coded hashes. Move the heredocs later in the tests so we can take advantage of computed values. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t5324: make hash size independent

c446194

There are some offsets in the commit graph files used to corrupt data. Compute these offsets for both SHA-1 and SHA-256 so that the test works with either. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t5504: make hash algorithm independent

b9521f7

Instead of hard-coding invalid object IDs in this test, use test_oid to look up ones of the appropriate length. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t5510: make hash size independent

789f839

Use $OID_REGEX instead of hard-coding 40-based regular expressions. Change invocations of cut with a hard-coded constant to split using a delimiter instead. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t5512: abstract away SHA-1-specific constants

6a1939e

Adjust the test so that it computes variables for object IDs instead of using hard-coded hashes. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t5530: compute results based on object length

6f0c076

Compute the various pkt-line values based on the length of the object IDs in use. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t5540: make hash size independent

3f27c99

Use regex values based on $OID_REGEX instead of hard-coding them based on expected object ID lengths. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t5562: use $ZERO_OID

380f6ba

This test uses $_z40 to express an all-zeros object ID, which doesn't work for SHA-256. Use $ZERO_OID instead, which is the right size for all hash values. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

:test-fixes-part8

5694591

t3206: make hash size independent

2b72731

Fix the one assertion in this test that still uses SHA-1 to use test_oid to be independent of the hash. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t3308: make test work with SHA-256

1023b26

Replace the hard-coded SHA-1 constants with the use of test_oid to look up an appropriate constant for each hash algorithm. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t3309: make test work with SHA-256

71a0c05

Replace the hard-coded SHA-1 constants with the use of test_oid to look up an appropriate constant for each hash algorithm. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

t3310: make test work with SHA-256

b085a43

Replace the hard-coded SHA-1 constants with the use of test_oid to look up an appropriate constant for each hash algorithm. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t3404: adjust collision test case for SHA-256#1

t3404: adjust collision test case for SHA-256#1
dscho wants to merge 148 commits intobk2204:transition-stage-4from
dscho:t3404-sha256-collision

dscho commented Jan 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dscho commented Jan 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants