GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader by pitrou · Pull Request #49060 · apache/arrow

pitrou · 2026-01-29T14:57:21Z

Rationale for this change

Fix two issues found by OSS-Fuzz in the IPC reader:

a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984
a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408

None of these two issues is a security issue.

Are these changes tested?

Yes, by new unit tests and new fuzz regression files.

Are there any user-facing changes?

No.

This PR contains a "Critical Fix". (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

GitHub Issue: [C++] Fix issues found by OSS-Fuzz in IPC reader #49059

pitrou · 2026-01-29T15:29:48Z

@lidavidm @WillAyd Would you like to take a look at this?

pitrou · 2026-01-29T15:30:51Z

@github-actions crossbow submit -g cpp

github-actions · 2026-01-29T15:33:32Z

Revision: 593fe47

Submitted crossbow builds: ursacomputing/crossbow @ actions-a646fef723

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-debian-13-cpp-amd64
test-debian-13-cpp-i386
test-debian-experimental-cpp-gcc-15
test-fedora-42-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

pitrou · 2026-01-29T16:05:25Z

cc @raulcd

WillAyd

lgtm

raulcd

Thanks @pitrou ! Will cherry-pick as part of 23.0.1

### Rationale for this change Fix two issues found by OSS-Fuzz in the IPC reader: * a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984 * a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408 None of these two issues is a security issue. ### Are these changes tested? Yes, by new unit tests and new fuzz regression files. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: #49059 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-48965: [Python][C++] Compare unique_ptr for CFlightResult or CFlightInfo to nullptr instead of NULL (#48968) ### Rationale for this change Cython built code is currently failing to compile on free threaded wheels due to: ``` /arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’: /arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous 43068 | __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL)); | ``` ### What changes are included in this PR? Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`. ### Are these changes tested? Yes via archery. ### Are there any user-facing changes? No * GitHub Issue: #48965 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader (#48925) ### What changes are included in this PR? Bug fixes and robustness improvements in the IPC file reader: * Fix bug reading variadic buffers with pre-buffering enabled * Fix bug reading dictionaries with pre-buffering enabled * Validate IPC buffer offsets and lengths Testing improvements: * Exercise pre-buffering in IPC tests * Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated * Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job * Exercise pre-buffering in the IPC file fuzz target Miscellaneous: * Add convenience functions for integer overflow checking ### Are these changes tested? Yes, by existing and improved tests. ### Are there any user-facing changes? Bug fixes. **This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled. * GitHub Issue: #48924 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-48966: [C++] Fix cookie duplication in the Flight SQL ODBC driver and the Flight Client (#48967) ### Rationale for this change The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled ### What changes are included in this PR? 1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields. 2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys. ### Are these changes tested? Manually on Windows, and CI ### Are there any user-facing changes? No * GitHub Issue: #48966 Authored-by: jianfengmao <jianfengmao@deephaven.io> Signed-off-by: David Li <li.davidm96@gmail.com> * GH-48691: [C++][Parquet] Write serializer may crash if the value buffer is empty (#48692) ### Rationale for this change WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow. ### What changes are included in this PR? Early check the array is not all null values before serialize it ### Are these changes tested? Added tests. ### Are there any user-facing changes? No * GitHub Issue: #48691 Authored-by: rexan <rexan@apache.org> Signed-off-by: Gang Wu <ustcwg@gmail.com> * GH-48947 [CI][Python] Install pymanager.msi instead of pymanager.msix to fix docker rebuild on Windows wheels (#48948) ### Rationale for this change As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix ### What changes are included in this PR? - Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker. - Update `pymanager install` command to use newer API (old command fails with missing flags) - Update default python command to use the free-threaded required suffix if free-threaded wheels ### Are these changes tested? Yes via archery ### Are there any user-facing changes? No * GitHub Issue: #48947 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48990: [Ruby] Add support for writing date arrays (#48991) ### Rationale for this change There are date32 and date64 variants for date arrays. ### What changes are included in this PR? * Add `ArrowFormat::DateType#to_flatbuffers` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48990 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48992: [Ruby] Add support for writing large UTF-8 array (#48993) ### Rationale for this change It's a large variant of UTF-8 array. ### What changes are included in this PR? * Add `ArrowFormat::LargeUTF8Type#to_flatbuffers` * Add support for large UTF-8 array of `#values` and `#raw_records` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48992 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48949: [C++][Parquet] Add Result versions for parquet::arrow::FileReader::ReadRowGroup(s) (#48982) ### Rationale for this change `FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter. ### What changes are included in this PR? Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation: - Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods - Deprecate the old Status/out-parameter overloads - Update C++ callers and R/Python/GLib bindings to use the new API ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. Status versions of FileReader::ReadRowGroup(s) have been deprecated. ```cpp virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups, const std::vector<int>& column_indices, std::shared_ptr<::arrow::Table>* out); virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups, std::shared_ptr<::arrow::Table>* out); ``` * GitHub Issue: #48949 Lead-authored-by: fenfeng9 <fenfeng9@qq.com> Co-authored-by: fenfeng9 <36840213+fenfeng9@users.noreply.github.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48985: [GLib][Ruby] Fix GC problems in node options and expressions (#48989) ### Rationale for this change Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC. ### What changes are included in this PR? * Refer arguments of `garrow_filter_node_options_new()` * Refer arguments of `garrow_project_node_options_new()` * Refer arguments of `garrow_aggregate_node_options_new()` * Refer arguments of `garrow_literal_expression_new()` * Refer arguments of `garrow_call_expression_new()` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48985 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-47692: [CI][Python] Do not fallback to return 404 if wheel is found on emscripten jobs (#49007) ### Rationale for this change When looking for the wheel the script was falling back to returning a 404 even when the wheel was found: ``` + python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome 127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found ``` Timing out the job and failing. ### What changes are included in this PR? Correct logic and only return 404 if the file requested wasn't found. ### Are these changes tested? Yes via archery ### Are there any user-facing changes? No * GitHub Issue: #47692 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48912: [R] Configure C++20 in conda R on continuous benchmarking (#48974) ### Rationale for this change Benchmark failing since C++20 upgrade due to lack of C++20 configuration ### What changes are included in this PR? Changes entirely from :robot: (Claude) with discussion from me regarding optimal approach. Description as follows: > conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty. > > This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present. ### Are these changes tested? I got :robot: to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly. > Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch. > > The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding. ### Are there any user-facing changes? Nope * GitHub Issue: #48912 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty (#48718) ### Rationale for this change Fixes https://github.com/apache/arrow/issues/36889 When writing CSV from a table where the first batch is empty, the header gets written twice: ```python table = pa.table({"col1": ["a", "b", "c"]}) combined = pa.concat_tables([table.schema.empty_table(), table]) write_csv(combined, buf) # Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n <-- header appears twice ``` ### What changes are included in this PR? The bug happens because: 1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization 2. The buffer is not cleared after flush 3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_` 4. The write loop then writes `data_buffer_` which still contains stale content The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths: - `WriteHeader()` - `WriteRecordBatch()` - `WriteTable()` This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again. ### Are these changes tested? Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`: - Empty batch at start of table - Empty batch in middle of table ### Are there any user-facing changes? No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches. * GitHub Issue: #36889 Lead-authored-by: Ruiyang Wang <ruiyang@anthropic.com> Co-authored-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com> * GH-48932: [C++][Packaging][FlightRPC] Fix `rsync` build error ODBC Nightly Package (#48933) ### Rationale for this change #48932 ### What changes are included in this PR? - Fix `rsync` build error ODBC Nightly Package ### Are these changes tested? - tested in CI ### Are there any user-facing changes? - After fix, users should be able to get Nightly ODBC package release * GitHub Issue: #48932 Authored-by: Alina (Xi) Li <alina.li@improving.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48951: [Docs] Add documentation relating to AI tooling (#48952) ### Rationale for this change Add guidance re AI tooling ### What changes are included in this PR? Updates to main docs and links to it from new contributor's guide ### Are these changes tested? No but I'll built the docs ### Are there any user-facing changes? Just docs :robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness. * GitHub Issue: #48951 Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49029: [Doc] Run sphinx-build in parallel (#49026) ### Rationale for this change `sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs). ### Are these changes tested? By existing CI jobs. ### Are there any user-facing changes? No. * GitHub Issue: #49029 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-33450: [C++] Remove GlobalForkSafeMutex (#49033) ### Rationale for this change This functionality is unused now that we have a proper atfork facility. ### Are these changes tested? By existing CI tests. ### Are there any user-facing changes? Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal). * GitHub Issue: #33450 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-35437: [C++] Remove obsolete TODO about DictionaryArray const& return types (#48956) ### Rationale for this change The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete. It was added in commit 6ceb12f700a when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies. The issue was fixed in commit 95a8bfb319b which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved. ### What changes are included in this PR? Removed the outdated TODO comment that referenced GH-35437. ### Are these changes tested? I did not test. ### Are there any user-facing changes? No. * GitHub Issue: #35437 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-48586: [Python][CI] Upload artifact to python-sdist job (#49008) ### Rationale for this change When running the python-sdist job we are currently not uploading the build artifact to the job. ### What changes are included in this PR? Upload artifact as part of building the job so it's easier to test and validate contents if necessary. ### Are these changes tested? Yes via archery. ### Are there any user-facing changes? No * GitHub Issue: #48586 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * MINOR: [R] Add 22.0.0.1 to compatiblity matrix (#49039) ### Rationale for this change CI needs updating to test old R package versions ### What changes are included in this PR? Add 22.0.0.1 ### Are these changes tested? Nah, it's CI stuff ### Are there any user-facing changes? No Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48961: [Docs][Python] Doctest fails on pandas 3.0 (#48969) ### Rationale for this change See issue #48961 Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default ### What changes are included in this PR? Updating several doctest examples from `string` to `large_string`. ### Are these changes tested? Yes, locally. ### Are there any user-facing changes? No. Closes #48961 * GitHub Issue: #48961 Authored-by: Tadeja Kadunc <tadeja.kadunc@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-49037: [Benchmarking] Install R from non-conda source for benchmarking (#49038) ### Rationale for this change Slow benchmarks due to conda duckdb building from source ### What changes are included in this PR? Try ditching conda and installing R via rig and using PPM binaries ### Are these changes tested? I'll try running ### Are there any user-facing changes? Nope * GitHub Issue: #49037 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49042: [C++] Remove mimalloc patch (#49041) ### Rationale for this change This patch was integrated upstream in https://github.com/microsoft/mimalloc/pull/1139 ### Are these changes tested? By existing CI. ### Are there any user-facing changes? No. * GitHub Issue: #49042 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49024: [CI] Update Debian version in `.env` (#49032) ### Rationale for this change Default Debian version in `.env` now maps to oldstable, we should use stable instead. Also prune entries that are not used anymore. ### Are these changes tested? By existing CI jobs. ### Are there any user-facing changes? No. * GitHub Issue: #49024 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49027: [Ruby] Add support for writing time arrays (#49028) ### Rationale for this change There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays. ### What changes are included in this PR? * Add `ArrowFormat::TimeType#to_flatbuffers` * Add bit width information to `ArrowFormat::TimeType` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49027 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49030: [Ruby] Add support for writing fixed size binary array (#49031) ### Rationale for this change It's a fixed size variant of binary array. ### What changes are included in this PR? * Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers` * Add `ArrowFormat::FixedSizeBinaryArray#each_buffer` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49030 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48866: [C++][Gandiva] Truncate subseconds beyond milliseconds in `castTIMESTAMP_utf8` and `castTIME_utf8` (#48867) ### Rationale for this change Fixes #48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations. ### What changes are included in this PR? - Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error - Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: #48866 Authored-by: Arkadii Kravchuk <arkadii.kravchuk@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48673: [C++] Fix ToStringWithoutContextLines to check for :\d+ pattern before removing lines (#48674) ### Rationale for this change This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers. I could not find the relevant example to demonstrate within this project but assume that we have a test such as: (Generated by ChatGPT) ```cpp TEST(BlockParser, ErrorMessageWithColonsPreserved) { Status st(StatusCode::Invalid, "CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n" "Error details: Time format: 12:34:56, Key: value\n" "parser_test.cc:940 Parse(parser, csv, &out_size)"); std::string expected_msg = "Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n" "Error details: Time format: 12:34:56, Key: value"; ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st); } // Test with URL-like data (another common case with colons) TEST(BlockParser, ErrorMessageWithURLPreserved) { Status st(StatusCode::Invalid, "CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n" "URL: http://arrow.apache.org:8080/api\n" "parser_test.cc:974 Parse(parser, csv, &out_size)"); std::string expected_msg = "Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n" "URL: http://arrow.apache.org:8080/api"; ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st); } ``` then it fails. ### What changes are included in this PR? Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped. ### Are these changes tested? Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`. ### Are there any user-facing changes? No, test-only. * GitHub Issue: #48673 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49044: [CI][Python] Fix test_download_tzdata_on_windows by adding required user-agent on urllib request (#49052) ### Rationale for this change See: #49044 ### What changes are included in this PR? Urllib now request with `"user-agent": "pyarrow"` ### Are these changes tested? It's a CI fix. ### Are there any user-facing changes? No, just a CI test fix. * GitHub Issue: #49044 Authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48983: [Packaging][Python] Build wheel from sdist using build and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (#48988) ### Rationale for this change Currently the files are missing from the published wheels. ### What changes are included in this PR? - Ensure the license and notice files are part of the wheels - Use build frontend to build wheels - Build wheel from sdist ### Are these changes tested? Yes, via archery. I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing: ``` AssertionError: LICENSE.txt is missing from the wheel. ``` ### Are there any user-facing changes? No * GitHub Issue: #48983 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader (#49060) ### Rationale for this change Fix two issues found by OSS-Fuzz in the IPC reader: * a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984 * a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408 None of these two issues is a security issue. ### Are these changes tested? Yes, by new unit tests and new fuzz regression files. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: #49059 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49055: [Ruby] Add support for writing decimal128/256 arrays (#49056) ### Rationale for this change Decimal128/256 arrays are only supported. ### What changes are included in this PR? Add `ArrowFormat::DecimalType#to_flatbuffers`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49055 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49053: [Ruby] Add support for writing timestamp array (#49054) ### Rationale for this change It has `unit` and `time_zone` parameters. ### What changes are included in this PR? * Add `ArrowFormat::TimestampType#to_flatbuffers` * Set time zone when GLib timestamp type is converted from C++ timestamp type * Use `time_zone` not `timezone` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49053 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619) ### Rationale for this change In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds. ### What changes are included in this PR? IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building of the documentation. ### Are these changes tested? Yes, with the CI. ### Are there any user-facing changes? Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='*.rst' docs/source/python/file.rst` * GitHub Issue: #28859 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-49065: [C++] Remove unnecessary copies of shared_ptr in Type::BOOL and Type::NA at GrouperImpl (#49066) ### Rationale for this change The grouper code was creating a `shared_ptr<DataType>` for every key type, even when it wasn't needed. This resulted in unnecessary reference counting operations. For example, `BooleanKeyEncoder` and `NullKeyEncoder` don't require a `shared_ptr` in their constructors, yet we were creating one for every key of those types. ### What changes are included in this PR? Changed `GrouperImpl::Make()` to use `TypeHolder` references directly and only call `GetSharedPtr()` when needed by encoder constructors. This eliminates `shared_ptr` creation for `Type::BOOL` and `Type::NA` cases. Other encoder types (dictionary, fixed-width, binary) still require `shared_ptr` since their constructors take `shared_ptr<DataType>` parameters for ownership. ### Are these changes tested? Yes, existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #49065 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48159 [C++][Gandiva] Projector make is significantly slower after move to OrcJIT (#49063) ### Rationale for this change Reduces LLVM TargetMachine object creation from 3 to 1. This object is expensive to create and the extra copies weren't needed. ### What changes are included in this PR? Refactor the Engine class to only create one target machine and pass that to the necessary functions. Before the change (3 TargetMachines created): First TargetMachine: In Engine::Make(), MakeTargetMachineBuilder() is called, then BuildJIT() is called. Inside LLJITBuilder::create(), when prepareForConstruction() runs, if no DataLayout was set, it calls JTMB->getDefaultDataLayoutForTarget() which creates a temporary TargetMachine just to get the DataLayout. Second TargetMachine: Inside BuildJIT(), when setCompileFunctionCreator is used with the lambda, that lambda calls JTMB.createTargetMachine() to create a TargetMachine for the TMOwningSimpleCompiler. Third TargetMachine: Back in Engine::Make(), after BuildJIT() returns, there's an explicit call to jtmb.createTargetMachine() to create target_machine_ for the Engine. After the change (1 TargetMachine created): The key changes are: Create TargetMachine first: The code now creates the TargetMachine explicitly at the start of the Engine in Engine::Make. That machine is passed to BuildJIT. In BuildJiIT that machine's DataLayout is sent to LLJITBuilder which prevents prepareForConstruction() from calling getDefaultDataLayoutForTarget() (which would create a temporary TargetMachine). Use SimpleCompiler instead of TMOwningSimpleCompiler: SimpleCompiler takes a reference to an existing TargetMachine rather than owning one, so no new TargetMachine is created. A shared_ptr is used to ensure that TargetMachine stays around for the lifetime of the LLJIT instance. ### Are these changes tested? Yes, unit and integration. ### Are there any user-facing changes? No. * GitHub Issue: #48159 Lead-authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com> Co-authored-by: Logan Riggs <logan.riggs@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49043: [C++][FS][Azure] Avoid bugs caused by empty first page(s) followed by non-empty subsequent page(s) (#49049) ### Rationale for this change Prevent bugs similar to https://github.com/apache/arrow/issues/49043 ### What changes are included in this PR? - Implement `SkipStartingEmptyPages` for various types of PagedResponses used in the `AzureFileSystem`. - Apply `SkipStartingEmptyPages` on the response from every list operation that returns a paged response. ### Are these changes tested? Ran the tests in the codebase including the ones that need to connect to real blob storage. This makes me fairly confident that I haven't introduced a regression. The only reproduce I've found involves reading a production Azure blob storage account. With this I've tested that this PR solves https://github.com/apache/arrow/issues/49043, but I haven't been able to reproduce it in any checked in tests. I tried copying a chunk of data around our prod reproduce into azurite, but still can't reproduce. ### Are there any user-facing changes? Some low probability bugs will be gone. No interface changes. * GitHub Issue: #49043 Authored-by: Thomas Newton <thomas.w.newton@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49034 [C++][Gandiva] Fix binary_string to not trigger error for null strings (#49035) ### Rationale for this change The binary_string function will attempt to allocate 0 bytes of memory, which results in a null ptr being returned and the function interprets that as an error. ### What changes are included in this PR? Add kCanReturnErrors to the function definition to match other string functions. Move the check for 0 byte length input earlier in the binary_string function to prevent the 0 allocation. Add a unit test. ### Are these changes tested? Yes, unit and integration testing. ### Are there any user-facing changes? No. * GitHub Issue: #49034 Authored-by: Logan Riggs <logan.riggs@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48980: [C++] Use COMPILE_OPTIONS instead of deprecated COMPILE_FLAGS (#48981) ### Rationale for this change Arrow requires CMake 3.25 but was still using deprecated `COMPILE_FLAGS` property. Recommanded to use `COMPILE_OPTIONS` (introduced in CMake 3.11). ### What changes are included in this PR? Replaced `COMPILE_FLAGS` with `COMPILE_OPTIONS` across `CMakeLists.txt` files, converted space separated strings to semicolon-separated lists, and removed obsolete TODO comments. ### Are these changes tested? Yes, through CI build and existing tests. ### Are there any user-facing changes? No. * GitHub Issue: #48980 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49069: [C++] Share Trie instances across CSV value decoders (#49070) ### Rationale for this change The CSV converter was building identical Trie data structures (for null/true/false values) in every decoder instance, causing duplicate memory allocation and initialization overhead. ### What changes are included in this PR? - Introduced `TrieCache` struct to hold shared Trie instances (null_trie, true_trie, false_trie) - Updated `ValueDecoder` and all decoder subclasses to accept and reference a shared `TrieCache` instead of building their own Tries - Updated `Converter` base class to create one `TrieCache` per converter and pass it to all decoders ### Are these changes tested? Yes, all existing tests. I ran a simple benchmark showing roughly 2-4% faster converter creation, and obviously less memory usage. ### Are there any user-facing changes? No. * GitHub Issue: #49069 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49076: [CI] Update vcpkg baseline to newer version (#49062) ### Rationale for this change The current version of vcpkg used is a from April 2025 ### What changes are included in this PR? Update baseline to newer version. ### Are these changes tested? Yes on CI. I've validated for example that xsimd 14 will be pulled. ### Are there any user-facing changes? No * GitHub Issue: #49076 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49074: [Ruby] Add support for writing interval arrays (#49075) ### Rationale for this change There are year month/day time/month day nano variants. ### What changes are included in this PR? * Add `ArrowFormat::IntervalType#to_flatbuffers` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49074 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49071: [Ruby] Add support for writing list and large list arrays (#49072) ### Rationale for this change They use different offset size. ### What changes are included in this PR? * Add `ArrowFormat::ListType#to_flatbuffers` * Add `ArrowFormat::LargeListType#to_flatbuffers` * Add `ArrowFormat::VariableSizeListArray#child` * Add `ArrowFormat::VariableSizeListArray#each_buffer` * `garrow_array_get_null_bitmap()` returns `NULL` when null bitmap doesn't exist * Add `garrow_list_array_get_value_offsets_buffer()` * Add `garrow_large_list_array_get_value_offsets_buffer()` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49071 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49087 [CI][Packaging][Gandiva] Add support for LLVM 15 or earlier again (#49091) ### Rationale for this change LLVM 15 or earlier uses `llvm::Optional` not `std::optional`. ### What changes are included in this PR? Use `llvm::Optional` with LLVM 15 or earlier. ### Are these changes tested? Yes, compiling. ### Are there any user-facing changes? No * GitHub Issue: #49087 Authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49100: [Docs] Broken link to Swift page in implementations.rst (#49101) ### Rationale for this change The Swift documentation link in the implementations.rst file was broken and returned a 404 error. ### What changes are included in this PR? Updated the Swift documentation link in https://github.com/apache/arrow/blob/235841d644d5454f7067c44f580f301446ba1cc0/docs/source/implementations.rst?plain=1#L124 from the [broken GitHub README link](https://github.com/apache/arrow-swift/blob/main/Arrow/README.md) to the [Swift Package documentation](https://swiftpackageindex.com/apache/arrow-swift/main/documentation/arrow) ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #49100 Lead-authored-by: ChiLin Chiu <chilin.chiou@gmail.com> Co-authored-by: Chilin <chilin.cs07@nycu.edu.tw> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49096: [Ruby] Add support for writing struct array (#49097) ### Rationale for this change It's a nested array. ### What changes are included in this PR? * Add `ArrowFormat::StructType#to_flatbuffers` * Add `ArrowFormat::StructArray#each_buffer` * Add `ArrowFormat::StructArray#children` * Fix `ArrowFormat::Array#n_nulls` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49096 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49093: [Ruby] Add support for writing duration array (#49094) ### Rationale for this change It has unit parameter. ### What changes are included in this PR? * Add `ArrowFormat::DurationType#to_flatbuffers` * Add duration support to `#values` and `raw_records` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49093 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49098: [Packaging][deb] Add missing libarrow-cuda-glib-doc (#49099) ### Rationale for this change Documents for libarrow-cuda-glib are generated but they aren't packaged. ### What changes are included in this PR? Package documents for libarrow-cuda-glib. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49098 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48764: [C++] Update xsimd (#48765) ### Rationale for this change Homogenized versions used ### What changes are included in this PR? Move to xsimd 14 to benefit from latest improvements relevant for improvements to the integer unpacking routines. ### Are these changes tested? Yes, with current CI. In fact due to the absence of pin, part of the CI already runs xsimd 14. ### Are there any user-facing changes? No. * GitHub Issue: #48764 Authored-by: AntoinePrv <AntoinePrv@users.noreply.github.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-46008: [Python][Benchmarking] Remove unused asv benchmarking files (#49047) ### Rationale for this change As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken. ### What changes are included in this PR? Remove asv benchmarking related files and docs. ### Are these changes tested? No, Validate CI and run preview-docs to validate docs. ### Are there any user-facing changes? No * GitHub Issue: #46008 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49108: [Python] SparseCOOTensor.__repr__ missing f-string prefix (#49109) ### Rationale for this change `SparseCOOTensor.__repr__` outputs literal `{self.type}` and `{self.shape}` instead of actual values due to missing f-string prefix. ### What changes are included in this PR? Add f prefix to the string in `SparseCOOTensor.__repr__`. ### Are these changes tested? Yes, work after adding. f-string prefix: ```python3 >>> import pyarrow as pa >>> import numpy as np >>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32) >>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor) >>> sparse_coo <pyarrow.SparseCOOTensor> type: float shape: (2, 3) ``` ### Are there any user-facing changes? a bug that caused incorrect or invalid data to be produced: ```python3 >>> import pyarrow as pa >>> import numpy as np >>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32) >>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor) >>> sparse_coo <pyarrow.SparseCOOTensor> type: {self.type} shape: {self.shape} ``` * GitHub Issue: #49108 Authored-by: Chilin <chilin.cs07@nycu.edu.tw> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49083: [CI][Python] Remove dask-contrib/dask-expr from the nightly dask test builds (#49126) ### Rationale for this change Failing nightly job for dask (test-conda-python-3.11-dask-upstream_devel). ### What changes are included in this PR? Removal of dask-contrib/dask-expr package as it is included in the dask dataframe module since January 2025. ### Are these changes tested? Yes, with extendeed dask build. ### Are there any user-facing changes? No. * GitHub Issue: #49083 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-49117: [Ruby] Add support for writing union arrays (#49118) ### Rationale for this change There are dense and sparse variants. ### What changes are included in this PR? * Add `garrow_union_array_get_n_fields()` * Add `ArrowFormat::UnionArray#children` * Add `ArrowFormat::DenseUnionArray#each_buffer` * Add `ArrowFormat::SparseUnionArray#each_buffer` * Add `ArrowFormat::UnionType#to_flatbuffers` * Add `Arrow::UnionArray#fields` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49117 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49119: [Ruby] Add support for writing map array (#49120) ### Rationale for this change It's a list based array. ### What changes are included in this PR? * Add `ArrowFormat::MapType#to_flatbuffers` ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #49119 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48922: [C++] Support Status-returning callables in Result::Map (#49127) ### Rationale for this change Currently, Result::Map fails to compile when the mapping function returns a Status because it tries to instantiate Result, which is prohibited. This change allows Map to return Status directly in such cases. ### What changes are included in this PR? - Added EnsureResult specialization to allow Map to return Status directly. - Added unit tests to verify success/error propagation and return type resolution. ### Are these changes tested? Yes. ### Are there any user-facing changes? No * GitHub Issue: #48922 Authored-by: Abhishek Bansal <abhibansal593@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49003: [C++] Don't consider `out_of_range` an error in float parsing (#49095) ### Rationale for this change This PR restores the behavior previous to version 23 for floating-point parsing on overflow and subnormal. `fast_float` didn't assign an error code on overflow in version `3.10.1` and assigned `±Inf` on overflow and `0.0` on subnormal. With the update to version `8.1`, it started to assign `std::errc::result_out_of_range` in such cases. ### What changes are included in this PR? Ignores `std::errc::result_out_of_range` and produce `±Inf` / `0.0` as appropriate instead of failing the conversion. ### Are these changes tested? Yes. Created tests for overflow with positive and negative signed mantissa, and also created tests for subnormal, all of them for binary{16,32,64}. ### Are there any user-facing changes? It's a user facing change. The CSV reader on version `libarrow==23` was assigning them as strings, while before it was parsing it as `0` or `+- inf`. With this patch, the CSV reader in PyArrow outputs: ```python >>> import pyarrow >>> import pyarrow.csv >>> import io >>> table = pyarrow.csv.read_csv(io.BytesIO(f"data\n10E-617\n10E617\n-10E617".encode())) >>> print(table) pyarrow.Table data: double ---- data: [[0,inf,-inf]] ``` Closes #49003 * GitHub Issue: #49003 Authored-by: Alvaro-Kothe <kothe65@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-48941: [C++] Generate proper UTF-8 strings in JSON test utilities (#48943) ### Rationale for this change The JSON test utility `GenerateAscii` was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling. ### What changes are included in this PR? Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629. Added that function as an util. ### Are these changes tested? There are existent tests for JSON. ### Are there any user-facing changes? No, test-only. * GitHub Issue: #48941 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49067: [R] Disable GCS on macos (#49068) ### Rationale for this change Builds that complete on CRAN ### What changes are included in this PR? Disable GCS by default ### Are these changes tested? ### Are there any user-facing changes? Hopefully not **This PR includes breaking changes to public APIs.** (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.) **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: #49067 --------- Co-authored-by: Nic Crane <thisisnic@gmail.com> * GH-49115: [CI][Packaging][Python] Update vcpkg baseline for our wheels (#49116) ### Rationale for this change Current wheels are failing to be built due to old version of vcpkg failing with our latest main. ### What changes are included in this PR? - Update vcpkg version. - Update patches - Add `perl-Time-Piece` to some images as required to build newer OpenSSL. ### Are these changes tested? Yes on CI ### Are there any user-facing changes? No * GitHub Issue: #49115 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48954: [C++] Add test for null-type dictionary sorting and clarify XXX comment (#48955) ### Rationale for this change Null-type dictionaries (e.g., `dictionary(int8(), null())`) are valid Arrow constructs supported from day one, but the sorting code had an uncertain `XXX Should this support Type::NA?` comment. We should explicitly support and test this because other functions already support this: ```python import pyarrow as pa import pyarrow.compute as pc pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int32())) # [0, 1, 2, 3] pc.array_sort_indices(pa.DictionaryArray.from_arrays( indices=pa.array([None, None, None, None], type=pa.int8()), dictionary=pa.array([], type=pa.null()) )) # [0, 1, 2, 3] ``` I believe it does not make sense to specifically disallow this in dictionaries at this point. ### What changes are included in this PR? Added a unittest for null sorting behaviour. ### Are these changes tested? Yes, the unittest was added. ### Are there any user-facing changes? No. * GitHub Issue: #48954 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-36193: [R] arm64 binaries for R (#48574) ### Rationale for this change Issues building on ARM ### What changes are included in this PR? CI job and nixlibs update ### Are these changes tested? On CI ### Are there any user-facing changes? No AI changes :robot:: Claude decided where to make the changes and helped debug failing builds, but I updated most of it (e.g. rstudio -> posit, choice of runners etc) * GitHub Issue: #36193 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-48397: [R] Update docs on how to get our libarrow builds (#48995) ### Rationale for this change Turning off GCS on CRAN to prevent excessive build times, need to tell people who wanna work with GCS how to do that. ### What changes are included in this PR? Update docs. ### Are these changes tested? Will preview docs build. ### Are there any user-facing changes? Just docs. * GitHub Issue: #48397 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49104: [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions (#49105) ### Rationale for This Change The `SparseCSFIndex::Equals` method can crash when comparing two sparse indices that have a different number of dimensions. The method iterates over the `indices()` and `indptr()` vectors of the current object and accesses the corresponding elements in the `other` object without first verifying that both objects have matching vector sizes. This can lead to out-of-bounds access and a segmentation fault when the dimension counts differ. ### What Changes Are Included in This PR? This change adds explicit size equality checks for the `indices()` and `indptr()` vectors at the beginning of the `SparseCSFIndex::Equals` method. If the dimensions do not match, the method now safely returns `false` instead of attempting invalid memory access. ### Are These Changes Tested? Yes. The fix has been validated through targeted reproduction of the crash scenario using mismatched dimension counts, ensuring the method behaves safely and deterministically. ### Are There Any User-Facing Changes? No. This change improves internal safety and robustness without altering public APIs or observable user behavior. * GitHub Issue: #49104 Lead-authored-by: Alirana2829 <alimahmoodrana00@gmail.com> Co-authored-by: Ali Mahmood Rana <159713825+AliRana30@users.noreply.github.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Rok Mihevc <rok@mihevc.org> * MINOR: [Docs] Add links to AI-generated code guidance (#49131) ### Rationale for this change Add link to AI-generated code guidance - we should make sure the docs are updated before we merge this though ### What changes are included in this PR? Add link to AI-generated code guidance ### Are these changes tested? No ### Are there any user-facing changes? No Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * MINOR: [R] Add new vignette to pkgdown config (#49145) ### Rationale for this change CI failing on preview-docs; see #49141 ### What changes are included in this PR? Add the vignette created in #49068 to pkgdown config ### Are these changes tested? I'll trigger CI ### Are there any user-facing changes? Nah Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com> * GH-49150: [Doc][CI][Python] Doctests failing on rst files due to pandas 3+ (#49088) Fixes: #49150 See https://github.com/apache/arrow/pull/48619#issuecomment-3823269381 ### Rationale for this change Fix CI failures ### What changes are included in this PR? Tests are made more general to allow for Pandas 2 and Pandas 3 style string types ### Are these changes tested? By CI ### Are there any user-facing changes? No * GitHub Issue: #49150 Authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Rok Mihevc <rok@mihevc.org> * GH-41990: [C++] Fix AzureFileSystem compilation on Windows (#48971) Let me preface this pull request that I have not worked in C++ in quite a while. Apologies if this is missing modern idioms or is an obtuse fix. ### Rationale for this change I encountered an issue trying to compile the AzureFileSystem backend in C++ on Windows. Searching the issue tracker, it appears this is already a [known](https://github.com/apache/arrow/issues/41990) but unresolved problem. This is an attempt to either address the issue or move the conversation forward for someone more experienced. ### What changes are included in this PR? AzureFileSystem uses `unique_ptr` while the other cloud file system implementations rely on `shared_ptr`. Since this is a forward-declared Impl in the headers file but the destructor was defined inline (via `= default`), we're getting compilation issues with MSVC due to it requiring the complete type earlier than GCC/Clang. This change removes the defaulted definition from the header file and moves it into the .cc file where we have a complete type. Unrelated, I've also wrapped 2 exception variables in `ARROW_UNUSED`. These are warnings treated as errors by MSVC at compile time. This was revealed in CI after resolving the issue above. ### Are these changes tested? I've enabled building and running the test suite in GHA in 8dd62d62a9af022813e9c8662956740340a9473f. I believe a large portion of those tests may be skipped though since Azurite isn't present from what I can see. I'm not tied to the GHA updates being included in the PR, it's currently here for demonstration purposes. I noticed the other FS implementations are also not built and tested on Windows. One quirk of this PR is getting WIL in place to compile the Azure C++ SDK was not intuitive for me. I've placed a dummy `wilConfig.cmake` to get the Azure SDK to build, but I'd assume there's a better way to do this. I'm happy to refine the build setup if we choose to keep it. ### Are there any user-facing changes? Nothing here should affect user-facing code beyond fixing the compilation issues. If there are concerns for things I'm missing, I'm happy to discuss those. * GitHub Issue: #41990 Lead-authored-by: Nate Prewitt <nateprewitt@microsoft.com> Co-authored-by: Nate Prewitt <nate.prewitt@gmail.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49138: [Packaging][Python] Remove nightly cython install from manylinux wheel dockerfile (#49139) ### Rationale for this change We use nightlies version of Cython for free-threaded PyArrow wheels and they are currently failing, see https://github.com/apache/arrow/issues/49138 ### What changes are included in this PR? Nightly Cython install is removed and Cython is installed via [requirements file](https://github.com/apache/arrow/blob/main/python/requirements-wheel-build.txt#L2). ### Are these changes tested? Tes. ### Are there any user-facing changes? No. * GitHub Issue: #49138 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-33459: [C++][Python] Support step >= 1 in list_slice kernel (#48769) ### Rationale for this change Closes ARROW-18281, which has been open since 2022. The `list_slice` kernel currently rejects `start == stop`, but should return empty lists instead (following Python slicing semantics). The implementation already handles this case correctly. When ARROW-18282 added step support, `bit_util::CeilDiv(stop - start, step)` naturally returns 0 for `start == stop`, producing empty lists. The only issue was the validation check (`start >= stop`) that prevented this from working. ### What changes are included in this PR? - Changed validation from `start >= stop` to `start > stop` - Updated error message - Added test cases ### Are these changes tested? Yes, tests were added. ### Are there any user-facing changes? Yes. ```python import pyarrow.compute as pc pc.list_slice([[1,2,3]], 0, 0) ``` Before: ``` pyarrow.lib.ArrowInvalid: `start`(0) should be greater than 0 and smaller than `stop`(0) ``` After: ``` <pyarrow.lib.ListArray object at 0x1a01b8b20> [ [] ] ``` * GitHub Issue: #33459 Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias (#49135) Closes https://github.com/apache/arrow/issues/41863 ### Rationale for this change Other tools in the parquet ecosystem distinguish between `LZ4` and `LZ4_RAW`, matching the specification: https://parquet.apache.org/docs/file-format/data-pages/compression/ `LZ4` (framing) is of course deprecated. PyArrow does not support it, and instead simplifies the user-facing API, using `LZ4` as an alias for the `LZ4_RAW` codec. However, PyArrow does not accept `LZ4_RAW` as a valid alias for the `LZ4_RAW` codec: ``` ArrowException: Unsupported compression: lz4_raw ``` This is a friction issue, and confusing for some users who are aware of the differences. ### What changes are included in this PR? - Adding `LZ4_RAW` to the acceptable codec names list. - Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`. - Adding a test ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, an additive change to the accepted codec names. * GitHub Issue: #41863 Authored-by: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com> * GH-48868: [Doc] Document security model for the Arrow formats (#48870) ### Rationale for this change Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those. ### What changes are included in this PR? Add a Security Considerations page in the Format section. **Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html ### Are these changes tested? N/A ### Are there any user-facing changes? No. * GitHub Issue: #48868 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org> * GH-49004: [C++][FlightRPC] Run ODBC tests in workflow using `cpp_test.sh` (#49005) ### Rationale for this change #49004 ### What changes are included in this PR? - Run tests using `cpp_test.sh` in the ODBC job of C++ Extra CI. Note: `find_package(Arrow)` check in `cpp_test.sh` is disabled due to blocker GH-49050 ### Are these changes tested? Yes, in CI ### Are there any user-facing changes? N/A * GitHub Issue: #49004 Lead-authored-by: Alina (Xi) Li <alina.li@improving.com> Co-authored-by: Alina (Xi) Li <96995091+alinaliBQ@users.noreply.github.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49092: [C++][FlightRPC][CI] Nightly Packaging: Add `dev-yyyy-mm-dd` to ODBC MSI name (#49151) ### Rationale for this change #49092 ### What changes are included in this PR? - Add `dev-yyyy-mm-dd` to ODBC MSI name. This is a similar approach to R nightly. Before: `Apache Arrow Flight SQL ODBC-1.0.0-win64.msi`. After: `Apache Arrow Flight SQL ODBC-1.0.0-dev-2026-02-04-win64.msi`. ### Are these changes tested? Tested in CI. Successfully renamed file: https://github.com/apache/arrow/actions/runs/21686252848/job/62534629714?pr=49151#step:3:26 ### Are there any user-facing changes? Yes, the nightly ODBC file names will be changed as described above. * GitHub Issue: #49092 Authored-by: Alina (Xi) Li <alina.li@improving.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49156: [Python] Require GIL for string comparison (#49161) ### Rationale for this change With Cython 3.3.0.a0 this failed. After some discussion it seems that this should have always had to require the GIL. ### What changes are included in this PR? Moving statement out of the `with nogil` context manager. ### Are these changes tested? Existing CI builds pyarrow. ### Are there any user-facing changes? No * GitHub Issue: #49156 Authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> * GH-48575: [C++][FlightRPC] Standalone ODBC macOS CI (#48577) ### Rationale for this change #48575 ### What changes are included in this PR? - Add new ODBC workflow for macOS Intel 15 and 14 arm64. - Added ODBC build fixes to enable build on macOS CI. ### Are these changes tested? Tested in CI and local macOS Intel and M1 environments. ### Are there any user-facing changes? N/A * GitHub Issue: #48575 Lead-authored-by: Alina (Xi) Li <alina.li@improving.com> Co-authored-by: justing-bq <62349012+justing-bq@users.noreply.github.com> Co-authored-by: Victor Tsang <victor.tsang@improving.com> Co-authored-by: Alina (Xi) Li <alinal@bitquilltech.com> Co-authored-by: vic-tsang <victor.tsang@improving.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49164: [C++] Avoid invalid if() args in cmake when arrow is a subproject (#49165) ### Rationale for this change Ref #49164: In subproject builds, `DefineOptions.cmake` sets `ARROW_DEFINE_OPTIONS_DEFAULT` to OFF, so `ARROW_SIMD_LEVEL` is never defined. The `if()` at `cpp/src/arrow/io/CMakeLists.txt:48` uses `${ARROW_SIMD_LEVEL}` and expands to empty, leading to invalid `if()` arguments. ### What changes are included in this PR? Use the variable name directly (no `${}`). ### Are these changes tested? Yes. ### Are there any user-facing changes? None. * GitHub Issue: #49164 Authored-by: Rossi Sun <zanmato1984@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-48132: [Ruby] Add support for writing dictionary array (#49175) ### Rationale for this change Delta dictionary message support is out of scope. ### What changes are included in this PR? * Add `ArrowFormat::DictionaryArray#each_buffer` * Add `ArrowFormat::DictionaryType#build_fb_type` * Add support for dictionary message in `ArrowFormat::StreamingWriter` * Add support for writing dictionary message blocks in footer in `ArrowFormat::FileWriter`. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #48132 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * GH-49081: [C++][Parquet] Correct variant's extension name (#49082) ### Rationale for this change Correct variant extension according to arrow's specification. ### What changes are included in this PR? Modified variant's hardcoded extension name. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #49081 Authored-by: Zehua Zou <zehuazou2000@gmail.com> Signed-off-by: Gang Wu <ustcwg@gmail.com> * GH-49102: [CI] Add type checking infrastructure and CI workflow for type annotations (#48618) ### Rationale for this change This is the first in series of PRs adding type annotations to pyarrow and resolving #32609. ### What changes are included in this PR? This PR establishes infrastructure for type checking: - Adds CI workflow for running mypy, pyright, and ty type checkers on linux, macos and windows - Configures type checkers to validate stub files (excluding source files for now) - Adds PEP 561 `py.typed` marker to enable type checking - Updates wheel build scripts to include stub files in distributions - Creates initial minimal stub directory structure - Updates developer documentation with type checking workflow ### Are these changes tested? No. This is mostly a CI change. ### Are there any user-facing changes? This does not add any actual annotations (only `py.typed` marker) so user should not be affected. * GitHub Issue: #32609 * GitHub Issue: #49102 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Signed-off-by: Rok Mihevc <rok@mihevc.org> * GH-49190: [C++][CI] Fix `unknown job 'odbc' error` in C++ Extra Workflow (#49192) ### Rationale for this change See #49190 ### What changes are included in this PR? Fix `unknown job 'odbc' error` caused by typo ### Are these changes tested? Tested in CI ### Are there any user-facing changes? N/A * GitHub Issue: #49190 Authored-by: Alina (Xi) Li <alinal@bitquilltech.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * MINOR: [CI] Bump docker/login-action from 3.6.0 to 3.7.0 (#49191) Bumps [docker/login-action](https://github.com/docker/login-action) from 3.6.0 to 3.7.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/docker/login-action/releases">docker/login-action's releases</a>.</em></p> <blockquote> <h2>v3.7.0</h2> <ul> <li>Add <code>scope</code> input to set scopes for the authentication token by <a href="https://github.com/crazy-max"><code>@crazy-max</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/912">docker/login-action#912</a></li> <li>Add support for AWS European Sovereign Cloud ECR by <a href="https://github.com/dphi"><code>@dphi</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/914">docker/login-action#914</a></li> <li>Ensure passwords are redacted with <code>registry-auth</code> input by <a href="https://github.com/crazy-max"><code>@crazy-max</code></a> in <a href="https://redirect…

apacheGH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader

593fe47

github-actions bot added Component: C++ awaiting review Awaiting review labels Jan 29, 2026

pitrou marked this pull request as ready for review January 29, 2026 15:23

pitrou added backport-candidate Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. labels Jan 29, 2026

pitrou requested a review from bkietz January 29, 2026 15:28

WillAyd approved these changes Jan 29, 2026

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 29, 2026

raulcd approved these changes Jan 29, 2026

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jan 29, 2026

pitrou merged commit 3e6182a into apache:main Jan 29, 2026
88 of 89 checks passed

pitrou removed the awaiting merge Awaiting merge label Jan 29, 2026

pitrou mentioned this pull request Jan 29, 2026

[C++] Fix issues found by OSS-Fuzz in IPC reader #49059

Closed

pitrou deleted the ipc-oss-fuzz-fixes branch January 29, 2026 17:31

raulcd removed the backport-candidate label Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader#49060

GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader#49060
pitrou merged 1 commit intoapache:mainfrom
pitrou:ipc-oss-fuzz-fixes

pitrou commented Jan 29, 2026 •

edited by github-actions bot

Loading

Uh oh!

pitrou commented Jan 29, 2026

Uh oh!

pitrou commented Jan 29, 2026

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

pitrou commented Jan 29, 2026

Uh oh!

WillAyd left a comment

Uh oh!

raulcd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

pitrou commented Jan 29, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pitrou commented Jan 29, 2026

Uh oh!

pitrou commented Jan 29, 2026

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

pitrou commented Jan 29, 2026

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

raulcd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pitrou commented Jan 29, 2026 •

edited by github-actions bot

Loading