Skip to content

Comments

GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader#49060

Merged
pitrou merged 1 commit intoapache:mainfrom
pitrou:ipc-oss-fuzz-fixes
Jan 29, 2026
Merged

GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader#49060
pitrou merged 1 commit intoapache:mainfrom
pitrou:ipc-oss-fuzz-fixes

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jan 29, 2026

Rationale for this change

Fix two issues found by OSS-Fuzz in the IPC reader:

None of these two issues is a security issue.

Are these changes tested?

Yes, by new unit tests and new fuzz regression files.

Are there any user-facing changes?

No.

This PR contains a "Critical Fix". (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

@pitrou pitrou marked this pull request as ready for review January 29, 2026 15:23
@pitrou pitrou added backport-candidate Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. labels Jan 29, 2026
@pitrou pitrou requested a review from bkietz January 29, 2026 15:28
@pitrou
Copy link
Member Author

pitrou commented Jan 29, 2026

@lidavidm @WillAyd Would you like to take a look at this?

@pitrou
Copy link
Member Author

pitrou commented Jan 29, 2026

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 593fe47

Submitted crossbow builds: ursacomputing/crossbow @ actions-a646fef723

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-debian-13-cpp-amd64 GitHub Actions
test-debian-13-cpp-i386 GitHub Actions
test-debian-experimental-cpp-gcc-15 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou
Copy link
Member Author

pitrou commented Jan 29, 2026

cc @raulcd

Copy link
Contributor

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 29, 2026
Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pitrou ! Will cherry-pick as part of 23.0.1

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jan 29, 2026
@pitrou pitrou merged commit 3e6182a into apache:main Jan 29, 2026
88 of 89 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Jan 29, 2026
@pitrou pitrou deleted the ipc-oss-fuzz-fixes branch January 29, 2026 17:31
raulcd pushed a commit that referenced this pull request Feb 3, 2026
### Rationale for this change

Fix two issues found by OSS-Fuzz in the IPC reader:

* a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984
* a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408

None of these two issues is a security issue.

### Are these changes tested?

Yes, by new unit tests and new fuzz regression files.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

* GitHub Issue: #49059

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
raulcd pushed a commit that referenced this pull request Feb 4, 2026
### Rationale for this change

Fix two issues found by OSS-Fuzz in the IPC reader:

* a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984
* a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408

None of these two issues is a security issue.

### Are these changes tested?

Yes, by new unit tests and new fuzz regression files.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

* GitHub Issue: #49059

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
cbb330 added a commit to cbb330/arrow that referenced this pull request Feb 20, 2026
* GH-48965: [Python][C++] Compare unique_ptr for CFlightResult or CFlightInfo to nullptr instead of NULL (#48968)

### Rationale for this change

Cython built code is currently failing to compile on free threaded wheels due to:
```
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’:
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous
43068 |           __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL));
      |                            
```

### What changes are included in this PR?

Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`.

### Are these changes tested?

Yes via archery.

### Are there any user-facing changes?

No

* GitHub Issue: #48965

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader (#48925)

### What changes are included in this PR?

Bug fixes and robustness improvements in the IPC file reader:
* Fix bug reading variadic buffers with pre-buffering enabled
* Fix bug reading dictionaries with pre-buffering enabled
* Validate IPC buffer offsets and lengths

Testing improvements:
* Exercise pre-buffering in IPC tests
* Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated
* Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job
* Exercise pre-buffering in the IPC file fuzz target

Miscellaneous:
* Add convenience functions for integer overflow checking

### Are these changes tested?

Yes, by existing and improved tests.

### Are there any user-facing changes?

Bug fixes.

**This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled.

* GitHub Issue: #48924

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-48966: [C++] Fix cookie duplication in the Flight SQL ODBC driver and the Flight Client (#48967)


### Rationale for this change

The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled

### What changes are included in this PR?

1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields.
2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys.

### Are these changes tested?
Manually on Windows, and CI

### Are there any user-facing changes?

No
* GitHub Issue: #48966

Authored-by: jianfengmao <jianfengmao@deephaven.io>
Signed-off-by: David Li <li.davidm96@gmail.com>

* GH-48691: [C++][Parquet] Write serializer may crash if the value buffer is empty (#48692)

### Rationale for this change
WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow.

### What changes are included in this PR?
Early check the array is not all null values before serialize it

### Are these changes tested?

Added tests.
### Are there any user-facing changes?

No

* GitHub Issue: #48691

Authored-by: rexan <rexan@apache.org>
Signed-off-by: Gang Wu <ustcwg@gmail.com>

* GH-48947 [CI][Python] Install pymanager.msi instead of pymanager.msix to fix docker rebuild on Windows wheels (#48948)

### Rationale for this change

As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix

### What changes are included in this PR?

- Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker.
- Update `pymanager install` command to use newer API (old command fails with missing flags)
- Update default python command to use the free-threaded required suffix if free-threaded wheels

### Are these changes tested?

Yes via archery

### Are there any user-facing changes?

No
* GitHub Issue: #48947

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48990: [Ruby] Add support for writing date arrays (#48991)

### Rationale for this change

There are date32 and date64 variants for date arrays.

### What changes are included in this PR?

* Add `ArrowFormat::DateType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48990

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48992: [Ruby] Add support for writing large UTF-8 array (#48993)

### Rationale for this change

It's a large variant of UTF-8 array.

### What changes are included in this PR?

* Add `ArrowFormat::LargeUTF8Type#to_flatbuffers`
* Add support for large UTF-8 array of `#values` and `#raw_records`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48992

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48949: [C++][Parquet] Add Result versions for parquet::arrow::FileReader::ReadRowGroup(s) (#48982)

### Rationale for this change
`FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter.
### What changes are included in this PR?
Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation:
  - Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods
  - Deprecate the old Status/out-parameter overloads
  - Update C++ callers and R/Python/GLib bindings to use the new API
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
Status versions of FileReader::ReadRowGroup(s) have been deprecated.
```cpp
virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices,
                                     std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out);

virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                      const std::vector<int>& column_indices,
                                      std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                      std::shared_ptr<::arrow::Table>* out);
```
* GitHub Issue: #48949

Lead-authored-by: fenfeng9 <fenfeng9@qq.com>
Co-authored-by: fenfeng9 <36840213+fenfeng9@users.noreply.github.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48985: [GLib][Ruby] Fix GC problems in node options and expressions (#48989)

### Rationale for this change

Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC.

### What changes are included in this PR?

* Refer arguments of `garrow_filter_node_options_new()`
* Refer arguments of `garrow_project_node_options_new()`
* Refer arguments of `garrow_aggregate_node_options_new()`
* Refer arguments of `garrow_literal_expression_new()`
* Refer arguments of `garrow_call_expression_new()`
 
### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48985

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-47692: [CI][Python] Do not fallback to return 404 if wheel is found on emscripten jobs (#49007)

### Rationale for this change

When looking for the wheel the script was falling back to returning a 404 even when the wheel was found:
```
 + python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome
127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found
```
Timing out the job and failing.

### What changes are included in this PR?

Correct logic and only return 404 if the file requested wasn't found.

### Are these changes tested?

Yes via archery

### Are there any user-facing changes?

No
* GitHub Issue: #47692

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48912: [R] Configure C++20 in conda R on continuous benchmarking (#48974)

### Rationale for this change

Benchmark failing since C++20 upgrade due to lack of C++20 configuration

### What changes are included in this PR?

Changes entirely from :robot: (Claude) with discussion from me regarding optimal approach.  

Description as follows:

> conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty. 
>
> This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present.                                                               

### Are these changes tested?

I got :robot:  to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly.

>  Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch.
>
> The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding.                                       

### Are there any user-facing changes?

Nope
* GitHub Issue: #48912

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty (#48718)

### Rationale for this change

Fixes https://github.com/apache/arrow/issues/36889

When writing CSV from a table where the first batch is empty, the header gets written twice:

```python
table = pa.table({"col1": ["a", "b", "c"]})
combined = pa.concat_tables([table.schema.empty_table(), table])
write_csv(combined, buf)
# Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n  <-- header appears twice
```

### What changes are included in this PR?

The bug happens because:
1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization
2. The buffer is not cleared after flush
3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_`
4. The write loop then writes `data_buffer_` which still contains stale content

The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths:
- `WriteHeader()`
- `WriteRecordBatch()`
- `WriteTable()`

This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.

### Are these changes tested?

Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`:
- Empty batch at start of table
- Empty batch in middle of table

### Are there any user-facing changes?

No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.

* GitHub Issue: #36889

Lead-authored-by: Ruiyang Wang <ruiyang@anthropic.com>
Co-authored-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Gang Wu <ustcwg@gmail.com>

* GH-48932: [C++][Packaging][FlightRPC] Fix `rsync` build error ODBC Nightly Package (#48933)

### Rationale for this change
#48932
### What changes are included in this PR?
- Fix `rsync` build error ODBC Nightly Package 
### Are these changes tested?
- tested in CI
### Are there any user-facing changes?
- After fix, users should be able to get Nightly ODBC package release

* GitHub Issue: #48932

Authored-by: Alina (Xi) Li <alina.li@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48951: [Docs] Add documentation relating to AI tooling (#48952)

### Rationale for this change

Add guidance re AI tooling

### What changes are included in this PR?

Updates to main docs and links to it from new contributor's guide

### Are these changes tested?

No but I'll built the docs

### Are there any user-facing changes?

Just docs

:robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness.
* GitHub Issue: #48951

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-49029: [Doc] Run sphinx-build in parallel (#49026)

### Rationale for this change

`sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs).

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: #49029

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-33450: [C++] Remove GlobalForkSafeMutex (#49033)

### Rationale for this change

This functionality is unused now that we have a proper atfork facility.

### Are these changes tested?

By existing CI tests.

### Are there any user-facing changes?

Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal).

* GitHub Issue: #33450

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-35437: [C++] Remove obsolete TODO about DictionaryArray const& return types (#48956)

### Rationale for this change

The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete.

It was added in commit 6ceb12f700a when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies.

The issue was fixed in commit 95a8bfb319b which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved.

### What changes are included in this PR?

Removed the outdated TODO comment that referenced GH-35437.

### Are these changes tested?

I did not test.

### Are there any user-facing changes?

No.
* GitHub Issue: #35437

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-48586: [Python][CI] Upload artifact to python-sdist job (#49008)

### Rationale for this change

When running the python-sdist job we are currently not uploading the build artifact to the job.

### What changes are included in this PR?

Upload artifact as part of building the job so it's easier to test and validate contents if necessary.

### Are these changes tested?

Yes via archery.

### Are there any user-facing changes?

No

* GitHub Issue: #48586

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* MINOR: [R] Add 22.0.0.1 to compatiblity matrix (#49039)

### Rationale for this change

CI needs updating to test old R package versions

### What changes are included in this PR?

Add 22.0.0.1

### Are these changes tested?

Nah, it's CI stuff

### Are there any user-facing changes?

No

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48961: [Docs][Python] Doctest fails on pandas 3.0 (#48969)

### Rationale for this change
See issue #48961
Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes
and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default

### What changes are included in this PR?
Updating several doctest examples from `string` to `large_string`.

### Are these changes tested?
Yes, locally.

### Are there any user-facing changes?
No.

Closes #48961 
* GitHub Issue: #48961

Authored-by: Tadeja Kadunc <tadeja.kadunc@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>

* GH-49037: [Benchmarking] Install R from non-conda source for benchmarking  (#49038)

### Rationale for this change

Slow benchmarks due to conda duckdb building from source

### What changes are included in this PR?

Try ditching conda and installing R via rig and using PPM binaries

### Are these changes tested?

I'll try running

### Are there any user-facing changes?
 
Nope
* GitHub Issue: #49037

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-49042: [C++] Remove mimalloc patch (#49041)

### Rationale for this change

This patch was integrated upstream in https://github.com/microsoft/mimalloc/pull/1139

### Are these changes tested?

By existing CI.

### Are there any user-facing changes?

No.
* GitHub Issue: #49042

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49024: [CI] Update Debian version in `.env` (#49032)

### Rationale for this change

Default Debian version in `.env` now maps to oldstable, we should use stable instead.
Also prune entries that are not used anymore.

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: #49024

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49027: [Ruby] Add support for writing time arrays (#49028)

### Rationale for this change

There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays.

### What changes are included in this PR?

* Add `ArrowFormat::TimeType#to_flatbuffers`
* Add bit width information to `ArrowFormat::TimeType`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* GitHub Issue: #49027

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49030: [Ruby] Add support for writing fixed size binary array (#49031)

### Rationale for this change

It's a fixed size variant of binary array.

### What changes are included in this PR?

* Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers`
* Add `ArrowFormat::FixedSizeBinaryArray#each_buffer`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49030

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48866: [C++][Gandiva] Truncate subseconds beyond milliseconds in `castTIMESTAMP_utf8` and `castTIME_utf8` (#48867)

### Rationale for this change

Fixes #48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations.

### What changes are included in this PR?

- Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error
- Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases

### Are these changes tested?

Yes

### Are there any user-facing changes?

No
* GitHub Issue: #48866

Authored-by: Arkadii Kravchuk <arkadii.kravchuk@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48673: [C++] Fix ToStringWithoutContextLines to check for :\d+ pattern before removing lines (#48674)

### Rationale for this change

This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers.

I could not find the relevant example to demonstrate within this project but assume that we have a test such as:

(Generated by ChatGPT)

```cpp
TEST(BlockParser, ErrorMessageWithColonsPreserved) {
  Status st(StatusCode::Invalid,
            "CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
            "Error details: Time format: 12:34:56, Key: value\n"
            "parser_test.cc:940  Parse(parser, csv, &out_size)");

  std::string expected_msg =
      "Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
      "Error details: Time format: 12:34:56, Key: value";

  ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}

// Test with URL-like data (another common case with colons)
TEST(BlockParser, ErrorMessageWithURLPreserved) {
  Status st(StatusCode::Invalid,
            "CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
            "URL: http://arrow.apache.org:8080/api\n"
            "parser_test.cc:974  Parse(parser, csv, &out_size)");

  std::string expected_msg =
      "Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
      "URL: http://arrow.apache.org:8080/api";

  ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}
```

then it fails.

### What changes are included in this PR?

Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped.

### Are these changes tested?

Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`.

### Are there any user-facing changes?

No, test-only.

* GitHub Issue: #48673

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49044: [CI][Python] Fix test_download_tzdata_on_windows by adding required user-agent on urllib request (#49052)

### Rationale for this change

See: #49044

### What changes are included in this PR?

Urllib now request with `"user-agent": "pyarrow"`

### Are these changes tested?

It's a CI fix.

### Are there any user-facing changes?

No, just a CI test fix.
* GitHub Issue: #49044

Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48983: [Packaging][Python] Build wheel from sdist using build and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (#48988)

### Rationale for this change

Currently the files are missing from the published wheels.

### What changes are included in this PR?

- Ensure the license and notice files are part of the wheels
- Use build frontend to build wheels
- Build wheel from sdist

### Are these changes tested?

Yes, via archery.
I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing:
```
 AssertionError: LICENSE.txt is missing from the wheel.
```

### Are there any user-facing changes?

No

* GitHub Issue: #48983

Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader (#49060)

### Rationale for this change

Fix two issues found by OSS-Fuzz in the IPC reader:

* a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984
* a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408

None of these two issues is a security issue.

### Are these changes tested?

Yes, by new unit tests and new fuzz regression files.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

* GitHub Issue: #49059

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-49055: [Ruby] Add support for writing decimal128/256 arrays (#49056)

### Rationale for this change

Decimal128/256 arrays are only supported.

### What changes are included in this PR?

Add `ArrowFormat::DecimalType#to_flatbuffers`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49055

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49053: [Ruby] Add support for writing timestamp array (#49054)

### Rationale for this change

It has `unit` and `time_zone` parameters.

### What changes are included in this PR?

* Add `ArrowFormat::TimestampType#to_flatbuffers`
* Set time zone when GLib timestamp type is converted from C++ timestamp type
* Use `time_zone` not `timezone`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49053

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619)

### Rationale for this change

In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds.

### What changes are included in this PR?

IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building  of the documentation.

### Are these changes tested?

Yes, with the CI.

### Are there any user-facing changes?

Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='*.rst' docs/source/python/file.rst`

* GitHub Issue: #28859

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: tadeja <tadeja@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>

* GH-49065: [C++] Remove unnecessary copies of shared_ptr in Type::BOOL and Type::NA at GrouperImpl (#49066)

### Rationale for this change

The grouper code was creating a `shared_ptr<DataType>` for every key type, even when it wasn't needed. This resulted in unnecessary reference counting operations. For example, `BooleanKeyEncoder` and `NullKeyEncoder` don't require a `shared_ptr` in their constructors, yet we were creating one for every key of those types.

### What changes are included in this PR?

Changed `GrouperImpl::Make()` to use `TypeHolder` references directly and only call `GetSharedPtr()` when needed by encoder constructors. This eliminates `shared_ptr` creation for `Type::BOOL` and `Type::NA` cases. Other encoder types (dictionary, fixed-width, binary) still require `shared_ptr` since their constructors take `shared_ptr<DataType>` parameters for ownership.

### Are these changes tested?

Yes, existing tests.

### Are there any user-facing changes?

No.
* GitHub Issue: #49065

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48159 [C++][Gandiva] Projector make is significantly slower after move to OrcJIT (#49063)

### Rationale for this change
Reduces LLVM TargetMachine object creation from 3 to 1. This object is expensive to create and the extra copies weren't needed.

### What changes are included in this PR?
Refactor the Engine class to only create one target machine and pass that to the necessary functions.

Before the change (3 TargetMachines created):

First TargetMachine: In Engine::Make(), MakeTargetMachineBuilder() is called, then BuildJIT() is called. Inside LLJITBuilder::create(), when prepareForConstruction() runs, if no DataLayout was set, it calls JTMB->getDefaultDataLayoutForTarget() which creates a temporary TargetMachine just to get the DataLayout.

Second TargetMachine: Inside BuildJIT(), when setCompileFunctionCreator is used with the lambda, that lambda calls JTMB.createTargetMachine() to create a TargetMachine for the TMOwningSimpleCompiler.

Third TargetMachine: Back in Engine::Make(), after BuildJIT() returns, there's an explicit call to jtmb.createTargetMachine() to create target_machine_ for the Engine.

After the change (1 TargetMachine created):

The key changes are:

Create TargetMachine first: The code now creates the TargetMachine explicitly at the start of the Engine in Engine::Make. That machine is passed to BuildJIT. In BuildJiIT that machine's DataLayout is sent to LLJITBuilder which prevents prepareForConstruction() from calling getDefaultDataLayoutForTarget() (which would create a temporary TargetMachine).

Use SimpleCompiler instead of TMOwningSimpleCompiler:
SimpleCompiler takes a reference to an existing TargetMachine rather than owning one, so no new TargetMachine is created.
A shared_ptr is used to ensure that TargetMachine stays around for the lifetime of the LLJIT instance.

### Are these changes tested?
Yes, unit and integration.

### Are there any user-facing changes?
No.

* GitHub Issue: #48159

Lead-authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com>
Co-authored-by: Logan Riggs <logan.riggs@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49043: [C++][FS][Azure] Avoid bugs caused by empty first page(s) followed by non-empty subsequent page(s) (#49049)

### Rationale for this change
Prevent bugs similar to https://github.com/apache/arrow/issues/49043

### What changes are included in this PR?
- Implement `SkipStartingEmptyPages` for various types of PagedResponses used in the `AzureFileSystem`.
- Apply `SkipStartingEmptyPages` on the response from every list operation that returns a paged response.
 
### Are these changes tested?
Ran the tests in the codebase including the ones that need to connect to real blob storage. This makes me fairly confident that I haven't introduced a regression.

The only reproduce I've found involves reading a production Azure blob storage account. With this I've tested that this PR solves https://github.com/apache/arrow/issues/49043, but I haven't been able to reproduce it in any checked in tests. I tried copying a chunk of data around our prod reproduce into azurite, but still can't reproduce.

### Are there any user-facing changes?
Some low probability bugs will be gone. No interface changes. 
* GitHub Issue: #49043

Authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49034 [C++][Gandiva] Fix binary_string to not trigger error for null strings (#49035)

### Rationale for this change

The binary_string function will attempt to allocate 0 bytes of memory, which results in a null ptr being returned and the function interprets that as an error.

### What changes are included in this PR?
Add kCanReturnErrors to the function definition to match other string functions. 
Move the check for 0 byte length input earlier in the binary_string function to prevent the 0 allocation.
Add a unit test.

### Are these changes tested?
Yes, unit and integration testing.

### Are there any user-facing changes?
No.

* GitHub Issue: #49034

Authored-by: Logan Riggs <logan.riggs@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48980: [C++] Use COMPILE_OPTIONS instead of deprecated COMPILE_FLAGS (#48981)

### Rationale for this change

Arrow requires CMake 3.25 but was still using deprecated `COMPILE_FLAGS` property. Recommanded to use `COMPILE_OPTIONS` (introduced in CMake 3.11).

### What changes are included in this PR?

Replaced `COMPILE_FLAGS` with `COMPILE_OPTIONS` across `CMakeLists.txt` files, converted space separated strings to semicolon-separated lists, and removed obsolete TODO comments.

### Are these changes tested?

Yes, through CI build and existing tests.

### Are there any user-facing changes?

No.
* GitHub Issue: #48980

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49069: [C++] Share Trie instances across CSV value decoders (#49070)

### Rationale for this change

The CSV converter was building identical Trie data structures (for null/true/false values) in every decoder instance, causing duplicate memory allocation and initialization overhead.

### What changes are included in this PR?

- Introduced `TrieCache` struct to hold shared Trie instances (null_trie, true_trie, false_trie)
- Updated `ValueDecoder` and all decoder subclasses to accept and reference a shared `TrieCache` instead of building their own Tries
- Updated `Converter` base class to create one `TrieCache` per converter and pass it to all decoders

### Are these changes tested?

Yes, all existing tests. I ran a simple benchmark showing roughly 2-4% faster converter creation, and obviously less memory usage.

### Are there any user-facing changes?

No.
* GitHub Issue: #49069

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49076: [CI] Update vcpkg baseline to newer version (#49062)

### Rationale for this change

The current version of vcpkg used is a from April 2025

### What changes are included in this PR?

Update baseline to newer version.

### Are these changes tested?

Yes on CI. I've validated for example that xsimd 14 will be pulled.

### Are there any user-facing changes?
No

* GitHub Issue: #49076

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49074: [Ruby] Add support for writing interval arrays (#49075)

### Rationale for this change

There are year month/day time/month day nano variants.

### What changes are included in this PR?

* Add `ArrowFormat::IntervalType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49074

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49071: [Ruby] Add support for writing list and large list arrays (#49072)

### Rationale for this change

They use different offset size.

### What changes are included in this PR?

* Add `ArrowFormat::ListType#to_flatbuffers`
* Add `ArrowFormat::LargeListType#to_flatbuffers`
* Add `ArrowFormat::VariableSizeListArray#child`
* Add `ArrowFormat::VariableSizeListArray#each_buffer`
* `garrow_array_get_null_bitmap()` returns `NULL` when null bitmap doesn't exist
* Add `garrow_list_array_get_value_offsets_buffer()`
* Add `garrow_large_list_array_get_value_offsets_buffer()`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49071

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49087 [CI][Packaging][Gandiva] Add support for LLVM 15 or earlier again (#49091)

### Rationale for this change

LLVM 15 or earlier uses `llvm::Optional` not `std::optional`.

### What changes are included in this PR?

Use `llvm::Optional` with LLVM 15 or earlier.

### Are these changes tested?

Yes, compiling.

### Are there any user-facing changes?

No

* GitHub Issue: #49087

Authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49100: [Docs] Broken link to Swift page in implementations.rst (#49101)

### Rationale for this change

The Swift documentation link in the implementations.rst file was broken and returned a 404 error.

### What changes are included in this PR?

Updated the Swift documentation link in https://github.com/apache/arrow/blob/235841d644d5454f7067c44f580f301446ba1cc0/docs/source/implementations.rst?plain=1#L124 from the [broken GitHub README link](https://github.com/apache/arrow-swift/blob/main/Arrow/README.md) to the [Swift Package documentation](https://swiftpackageindex.com/apache/arrow-swift/main/documentation/arrow)

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #49100

Lead-authored-by: ChiLin Chiu <chilin.chiou@gmail.com>
Co-authored-by: Chilin <chilin.cs07@nycu.edu.tw>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49096: [Ruby] Add support for writing struct array (#49097)

### Rationale for this change

It's a nested array.

### What changes are included in this PR?

* Add `ArrowFormat::StructType#to_flatbuffers`
* Add `ArrowFormat::StructArray#each_buffer`
* Add `ArrowFormat::StructArray#children`
* Fix `ArrowFormat::Array#n_nulls`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49096

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49093: [Ruby] Add support for writing duration array (#49094)

### Rationale for this change

It has unit parameter.

### What changes are included in this PR?

* Add `ArrowFormat::DurationType#to_flatbuffers`
* Add duration support to `#values` and `raw_records`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49093

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49098: [Packaging][deb] Add missing libarrow-cuda-glib-doc (#49099)

### Rationale for this change

Documents for libarrow-cuda-glib are generated but they aren't packaged.

### What changes are included in this PR?

Package documents for libarrow-cuda-glib.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49098

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48764: [C++] Update xsimd (#48765)

### Rationale for this change
Homogenized versions used

### What changes are included in this PR?
Move to xsimd 14 to benefit from latest improvements relevant for improvements to the integer unpacking routines.

### Are these changes tested?
Yes, with current CI.
In fact due to the absence of pin, part of the CI already runs xsimd 14.

### Are there any user-facing changes?
No.

* GitHub Issue: #48764

Authored-by: AntoinePrv <AntoinePrv@users.noreply.github.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-46008: [Python][Benchmarking] Remove unused asv benchmarking files (#49047)

### Rationale for this change

As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken.

### What changes are included in this PR?

Remove asv benchmarking related files and docs.

### Are these changes tested?

No, Validate CI and run preview-docs to validate docs.

### Are there any user-facing changes?

No
* GitHub Issue: #46008

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-49108: [Python] SparseCOOTensor.__repr__ missing f-string prefix (#49109)

### Rationale for this change

`SparseCOOTensor.__repr__` outputs literal `{self.type}` and `{self.shape}` instead of actual values due to missing f-string prefix.

### What changes are included in this PR?

Add f prefix to the string in `SparseCOOTensor.__repr__`.

### Are these changes tested?

Yes, work after adding. f-string prefix:
```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: float
shape: (2, 3)
```

### Are there any user-facing changes?

a bug that caused incorrect or invalid data to be produced:

```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: {self.type}
shape: {self.shape}
```

* GitHub Issue: #49108

Authored-by: Chilin <chilin.cs07@nycu.edu.tw>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-49083: [CI][Python] Remove dask-contrib/dask-expr from the nightly dask test builds (#49126)

### Rationale for this change
Failing nightly job for dask (test-conda-python-3.11-dask-upstream_devel).

### What changes are included in this PR?
Removal of dask-contrib/dask-expr package as it is included in the dask dataframe module since January 2025.

### Are these changes tested?
Yes, with extendeed dask build.

### Are there any user-facing changes?
No.
* GitHub Issue: #49083

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-49117: [Ruby] Add support for writing union arrays (#49118)

### Rationale for this change

There are dense and sparse variants.

### What changes are included in this PR?

* Add `garrow_union_array_get_n_fields()`
* Add `ArrowFormat::UnionArray#children`
* Add `ArrowFormat::DenseUnionArray#each_buffer`
* Add `ArrowFormat::SparseUnionArray#each_buffer`
* Add `ArrowFormat::UnionType#to_flatbuffers`
* Add `Arrow::UnionArray#fields`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49117

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49119: [Ruby] Add support for writing map array (#49120)

### Rationale for this change

It's a list based array.

### What changes are included in this PR?

* Add `ArrowFormat::MapType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49119

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48922: [C++] Support Status-returning callables in Result::Map (#49127)

### Rationale for this change
Currently, Result::Map fails to compile when the mapping function returns a Status because it tries to instantiate Result, which is prohibited. This change allows Map to return Status directly in such cases.

### What changes are included in this PR?
- Added EnsureResult specialization to allow Map to return Status directly.
- Added unit tests to verify success/error propagation and return type resolution.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No
* GitHub Issue: #48922

Authored-by: Abhishek Bansal <abhibansal593@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-49003: [C++] Don't consider `out_of_range` an error in float parsing (#49095)

### Rationale for this change
This PR restores the behavior previous to version 23 for floating-point parsing on overflow and subnormal.

`fast_float` didn't assign an error code on overflow in version `3.10.1` and assigned `±Inf` on overflow and `0.0` on subnormal. With the update to version `8.1`, it started to assign `std::errc::result_out_of_range` in such cases. 

### What changes are included in this PR?
Ignores `std::errc::result_out_of_range` and produce `±Inf` / `0.0` as appropriate instead of failing the conversion.

### Are these changes tested?
Yes. Created tests for overflow with positive and negative signed mantissa, and also created tests for subnormal, all of them for binary{16,32,64}.

### Are there any user-facing changes?
It's a user facing change. The CSV reader on version `libarrow==23` was assigning them as strings, while before it was parsing it as `0` or `+- inf`.

With this patch, the CSV reader in PyArrow outputs:

```python
>>> import pyarrow
>>> import pyarrow.csv
>>> import io
>>> table = pyarrow.csv.read_csv(io.BytesIO(f"data\n10E-617\n10E617\n-10E617".encode()))
>>> print(table)
pyarrow.Table
data: double
----
data: [[0,inf,-inf]]
```

Closes #49003 

* GitHub Issue: #49003

Authored-by: Alvaro-Kothe <kothe65@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-48941: [C++] Generate proper UTF-8 strings in JSON test utilities (#48943)

### Rationale for this change

The JSON test utility `GenerateAscii` was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling.

### What changes are included in this PR?

Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629.
Added that function as an util.

### Are these changes tested?

There are existent tests for JSON.

### Are there any user-facing changes?

No, test-only.
* GitHub Issue: #48941

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-49067: [R] Disable GCS on macos (#49068)

### Rationale for this change
Builds that complete on CRAN

### What changes are included in this PR?
Disable GCS by default

### Are these changes tested?

### Are there any user-facing changes?
Hopefully not 

**This PR includes breaking changes to public APIs.** (If there are any
breaking changes to public APIs, please explain which changes are
breaking. If not, you can remove this.)

**This PR contains a "Critical Fix".** (If the changes fix either (a) a
security vulnerability, (b) a bug that caused incorrect or invalid data
to be produced, or (c) a bug that causes a crash (even when the API
contract is upheld), please provide explanation. If not, you can remove
this.)

* GitHub Issue: #49067

---------

Co-authored-by: Nic Crane <thisisnic@gmail.com>

* GH-49115: [CI][Packaging][Python] Update vcpkg baseline for our wheels (#49116)

### Rationale for this change

Current wheels are failing to be built due to old version of vcpkg failing with our latest main.

### What changes are included in this PR?

- Update vcpkg version.
- Update patches
- Add `perl-Time-Piece` to some images as required to build newer OpenSSL.

### Are these changes tested?

Yes on CI

### Are there any user-facing changes?

No

* GitHub Issue: #49115

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48954: [C++] Add test for null-type dictionary sorting and clarify XXX comment (#48955)

### Rationale for this change

Null-type dictionaries (e.g., `dictionary(int8(), null())`) are valid Arrow constructs supported from day one, but the sorting code had an uncertain `XXX Should this support Type::NA?` comment. We should explicitly support and test this because other functions already support this:

```python
import pyarrow as pa
import pyarrow.compute as pc

pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int32()))
# [0, 1, 2, 3]
pc.array_sort_indices(pa.DictionaryArray.from_arrays(
    indices=pa.array([None, None, None, None], type=pa.int8()),
    dictionary=pa.array([], type=pa.null())
))
# [0, 1, 2, 3]
```

I believe it does not make sense to specifically disallow this in dictionaries at this point.

### What changes are included in this PR?

Added a unittest for null sorting behaviour.

### Are these changes tested?

Yes, the unittest was added.

### Are there any user-facing changes?

No.
* GitHub Issue: #48954

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-36193: [R] arm64 binaries for R  (#48574)

### Rationale for this change

Issues building on ARM

### What changes are included in this PR?

CI job and nixlibs update

### Are these changes tested?

On CI

### Are there any user-facing changes?

No

AI changes :robot:: Claude decided where to make the changes and helped debug failing builds, but I updated most of it (e.g. rstudio -> posit, choice of runners etc) 

* GitHub Issue: #36193

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-48397: [R] Update docs on how to get our libarrow builds (#48995)

### Rationale for this change

Turning off GCS on CRAN to prevent excessive build times, need to tell people who wanna work with GCS how to do that.

### What changes are included in this PR?

Update docs.

### Are these changes tested?

Will preview docs build.

### Are there any user-facing changes?

Just docs.
* GitHub Issue: #48397

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-49104: [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions (#49105)

### Rationale for This Change

The `SparseCSFIndex::Equals` method can crash when comparing two sparse indices that have a different number of dimensions. The method iterates over the `indices()` and `indptr()` vectors of the current object and accesses the corresponding elements in the `other` object without first verifying that both objects have matching vector sizes. This can lead to out-of-bounds access and a segmentation fault when the dimension counts differ.

### What Changes Are Included in This PR?

This change adds explicit size equality checks for the `indices()` and `indptr()` vectors at the beginning of the `SparseCSFIndex::Equals` method. If the dimensions do not match, the method now safely returns `false` instead of attempting invalid memory access.

### Are These Changes Tested?

Yes. The fix has been validated through targeted reproduction of the crash scenario using mismatched dimension counts, ensuring the method behaves safely and deterministically.

### Are There Any User-Facing Changes?

No. This change improves internal safety and robustness without altering public APIs or observable user behavior.

* GitHub Issue: #49104

Lead-authored-by: Alirana2829 <alimahmoodrana00@gmail.com>
Co-authored-by: Ali Mahmood Rana <159713825+AliRana30@users.noreply.github.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>

* MINOR: [Docs] Add links to AI-generated code guidance (#49131)

### Rationale for this change

Add link to AI-generated code guidance - we should make sure the docs are updated before we merge this though

### What changes are included in this PR?

Add link to AI-generated code guidance

### Are these changes tested?

No

### Are there any user-facing changes?

No

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* MINOR: [R] Add new vignette to pkgdown config (#49145)

### Rationale for this change

CI failing on preview-docs; see #49141

### What changes are included in this PR?

Add the vignette created in #49068 to pkgdown config

### Are these changes tested?

I'll trigger CI

### Are there any user-facing changes?

Nah

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-49150: [Doc][CI][Python] Doctests failing on rst files due to pandas 3+ (#49088)

Fixes: #49150
See https://github.com/apache/arrow/pull/48619#issuecomment-3823269381

### Rationale for this change

Fix CI failures

### What changes are included in this PR?

Tests are made more general to allow for Pandas 2 and Pandas 3 style string types

### Are these changes tested?

By CI

### Are there any user-facing changes?

No
* GitHub Issue: #49150

Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>

* GH-41990: [C++] Fix AzureFileSystem compilation on Windows (#48971)

Let me preface this pull request that I have not worked in C++ in quite a while. Apologies if this is missing modern idioms or is an obtuse fix.

### Rationale for this change

I encountered an issue trying to compile the AzureFileSystem backend in C++ on Windows. Searching the issue tracker, it appears this is already a [known](https://github.com/apache/arrow/issues/41990) but unresolved problem. This is an attempt to either address the issue or move the conversation forward for someone more experienced.

### What changes are included in this PR?

AzureFileSystem uses `unique_ptr` while the other cloud file system implementations rely on `shared_ptr`. Since this is a forward-declared Impl in the headers file but the destructor was defined inline (via `= default`), we're getting compilation issues with MSVC due to it requiring the complete type earlier than GCC/Clang.

This change removes the defaulted definition from the header file and moves it into the .cc file where we have a complete type.

Unrelated, I've also wrapped 2 exception variables in `ARROW_UNUSED`. These are warnings treated as errors by MSVC at compile time. This was revealed in CI after resolving the issue above.

### Are these changes tested?

I've enabled building and running the test suite in GHA in 8dd62d62a9af022813e9c8662956740340a9473f. I believe a large portion of those tests may be skipped though since Azurite isn't present from what I can see. I'm not tied to the GHA updates being included in the PR, it's currently here for demonstration purposes. I noticed the other FS implementations are also not built and tested on Windows.

One quirk of this PR is getting WIL in place to compile the Azure C++ SDK was not intuitive for me. I've placed a dummy `wilConfig.cmake` to get the Azure SDK to build, but I'd assume there's a better way to do this. I'm happy to refine the build setup if we choose to keep it.

### Are there any user-facing changes?

Nothing here should affect user-facing code beyond fixing the compilation issues. If there are concerns for things I'm missing, I'm happy to discuss those.

* GitHub Issue: #41990

Lead-authored-by: Nate Prewitt <nateprewitt@microsoft.com>
Co-authored-by: Nate Prewitt <nate.prewitt@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49138: [Packaging][Python] Remove nightly cython install from manylinux wheel dockerfile (#49139)

### Rationale for this change

We use nightlies version of Cython for free-threaded PyArrow wheels and they are currently failing, see https://github.com/apache/arrow/issues/49138

### What changes are included in this PR?

Nightly Cython install is removed and Cython is installed via [requirements file](https://github.com/apache/arrow/blob/main/python/requirements-wheel-build.txt#L2).

### Are these changes tested?
Tes.

### Are there any user-facing changes?
No.
* GitHub Issue: #49138

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>

* GH-33459: [C++][Python] Support step >= 1 in list_slice kernel (#48769)

### Rationale for this change

Closes ARROW-18281, which has been open since 2022. The `list_slice` kernel currently rejects `start == stop`, but should return empty lists instead (following Python slicing semantics).

The implementation already handles this case correctly. When ARROW-18282 added step support, `bit_util::CeilDiv(stop - start, step)` naturally returns 0 for `start == stop`, producing empty lists. The only issue was the validation check (`start >= stop`) that prevented this from working.

### What changes are included in this PR?

- Changed validation from `start >= stop` to `start > stop` 
- Updated error message
- Added test cases

### Are these changes tested?

Yes, tests were added.

### Are there any user-facing changes?

Yes.

```python
import pyarrow.compute as pc
pc.list_slice([[1,2,3]], 0, 0)
```

Before:

```
pyarrow.lib.ArrowInvalid: `start`(0) should be greater than 0 and smaller than `stop`(0)
```

After:

```
<pyarrow.lib.ListArray object at 0x1a01b8b20>
[
  []
]
```
* GitHub Issue: #33459

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>

* GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias (#49135)

Closes https://github.com/apache/arrow/issues/41863

### Rationale for this change

Other tools in the parquet ecosystem distinguish between `LZ4` and `LZ4_RAW`, matching the specification: https://parquet.apache.org/docs/file-format/data-pages/compression/

`LZ4` (framing) is of course deprecated. PyArrow does not support it, and instead simplifies the user-facing API, using `LZ4` as an alias for the `LZ4_RAW` codec. 

However, PyArrow does not accept `LZ4_RAW` as a valid alias for the `LZ4_RAW` codec:

```
ArrowException: Unsupported compression: lz4_raw
```

This is a friction issue, and confusing for some users who are aware of the differences.

### What changes are included in this PR?

- Adding `LZ4_RAW` to the acceptable codec names list.
- Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`.
- Adding a test

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, an additive change to the accepted codec names.

* GitHub Issue: #41863

Authored-by: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>

* GH-48868: [Doc] Document security model for the Arrow formats (#48870)

### Rationale for this change

Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.

### What changes are included in this PR?

Add a Security Considerations page in the Format section.

**Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html

### Are these changes tested?

N/A

### Are there any user-facing changes?

No.
* GitHub Issue: #48868

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-49004: [C++][FlightRPC] Run ODBC tests in workflow using `cpp_test.sh` (#49005)

### Rationale for this change
#49004 

### What changes are included in this PR?
- Run tests using `cpp_test.sh` in the ODBC job of C++ Extra CI.
Note:  `find_package(Arrow)` check in `cpp_test.sh` is disabled due to blocker GH-49050

### Are these changes tested?
Yes, in CI
### Are there any user-facing changes?
N/A
* GitHub Issue: #49004

Lead-authored-by: Alina (Xi) Li <alina.li@improving.com>
Co-authored-by: Alina (Xi) Li <96995091+alinaliBQ@users.noreply.github.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49092: [C++][FlightRPC][CI] Nightly Packaging: Add `dev-yyyy-mm-dd` to ODBC MSI name (#49151)

### Rationale for this change
#49092

### What changes are included in this PR?

-  Add `dev-yyyy-mm-dd` to ODBC MSI name. This is a similar approach to R nightly.

Before: `Apache Arrow Flight SQL ODBC-1.0.0-win64.msi`. After: `Apache Arrow Flight SQL ODBC-1.0.0-dev-2026-02-04-win64.msi`.

### Are these changes tested?

Tested in CI. Successfully renamed file: https://github.com/apache/arrow/actions/runs/21686252848/job/62534629714?pr=49151#step:3:26

### Are there any user-facing changes?

Yes, the nightly ODBC file names will be changed as described above. 

* GitHub Issue: #49092

Authored-by: Alina (Xi) Li <alina.li@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49156: [Python] Require GIL for string comparison (#49161)

### Rationale for this change

With Cython 3.3.0.a0 this failed. After some discussion it seems that this should have always had to require the GIL.

### What changes are included in this PR?

Moving statement out of the `with nogil` context manager.

### Are these changes tested?

Existing CI builds pyarrow.

### Are there any user-facing changes?

No
* GitHub Issue: #49156

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48575: [C++][FlightRPC] Standalone ODBC macOS CI (#48577)

### Rationale for this change
#48575

### What changes are included in this PR?
- Add new ODBC workflow for macOS Intel 15 and 14 arm64.
- Added ODBC build fixes to enable build on macOS CI.
### Are these changes tested?
Tested in CI and local macOS Intel and M1 environments.
### Are there any user-facing changes?
N/A

* GitHub Issue: #48575

Lead-authored-by: Alina (Xi) Li <alina.li@improving.com>
Co-authored-by: justing-bq <62349012+justing-bq@users.noreply.github.com>
Co-authored-by: Victor Tsang <victor.tsang@improving.com>
Co-authored-by: Alina (Xi) Li <alinal@bitquilltech.com>
Co-authored-by: vic-tsang <victor.tsang@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49164: [C++] Avoid invalid if() args in cmake when arrow is a subproject (#49165)

### Rationale for this change

Ref #49164: In subproject builds, `DefineOptions.cmake` sets `ARROW_DEFINE_OPTIONS_DEFAULT` to OFF, so `ARROW_SIMD_LEVEL` is never defined. The `if()` at `cpp/src/arrow/io/CMakeLists.txt:48` uses `${ARROW_SIMD_LEVEL}` and expands to empty, leading to invalid `if()` arguments.

### What changes are included in this PR?

Use the variable name directly (no `${}`).

### Are these changes tested?

Yes.

### Are there any user-facing changes?

None.
* GitHub Issue: #49164

Authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48132: [Ruby] Add support for writing dictionary array (#49175)

### Rationale for this change

Delta dictionary message support is out of scope.

### What changes are included in this PR?

* Add `ArrowFormat::DictionaryArray#each_buffer`
* Add `ArrowFormat::DictionaryType#build_fb_type`
* Add support for dictionary message in `ArrowFormat::StreamingWriter`
* Add support for writing dictionary message blocks in footer in `ArrowFormat::FileWriter`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48132

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49081: [C++][Parquet] Correct variant's extension name (#49082)

### Rationale for this change

Correct variant extension according to arrow's specification.

### What changes are included in this PR?

Modified variant's hardcoded extension name.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #49081

Authored-by: Zehua Zou <zehuazou2000@gmail.com>
Signed-off-by: Gang Wu <ustcwg@gmail.com>

* GH-49102: [CI] Add type checking infrastructure and CI workflow for type annotations (#48618)

### Rationale for this change

This is the first in series of PRs adding type annotations to pyarrow and resolving #32609.

### What changes are included in this PR?

This PR establishes infrastructure for type checking:

- Adds CI workflow for running mypy, pyright, and ty type checkers on linux, macos and windows
- Configures type checkers to validate stub files (excluding source files for now)
- Adds PEP 561 `py.typed` marker to enable type checking
- Updates wheel build scripts to include stub files in distributions
- Creates initial minimal stub directory structure
- Updates developer documentation with type checking workflow

### Are these changes tested?

No. This is mostly a CI change.

### Are there any user-facing changes?

This does not add any actual annotations (only `py.typed` marker) so user should not be affected.
* GitHub Issue: #32609
* GitHub Issue: #49102

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Rok Mihevc <rok@mihevc.org>

* GH-49190: [C++][CI] Fix `unknown job 'odbc' error` in C++ Extra Workflow (#49192)

### Rationale for this change
See #49190

### What changes are included in this PR?

Fix `unknown job 'odbc' error` caused by typo

### Are these changes tested?

Tested in CI

### Are there any user-facing changes?

N/A

* GitHub Issue: #49190

Authored-by: Alina (Xi) Li <alinal@bitquilltech.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* MINOR: [CI] Bump docker/login-action from 3.6.0 to 3.7.0 (#49191)

Bumps [docker/login-action](https://github.com/docker/login-action) from 3.6.0 to 3.7.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/docker/login-action/releases">docker/login-action's releases</a>.</em></p>
<blockquote>
<h2>v3.7.0</h2>
<ul>
<li>Add <code>scope</code> input to set scopes for the authentication token by <a href="https://github.com/crazy-max"><code>@​crazy-max</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/912">docker/login-action#912</a></li>
<li>Add support for AWS European Sovereign Cloud ECR by <a href="https://github.com/dphi"><code>@​dphi</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/914">docker/login-action#914</a></li>
<li>Ensure passwords are redacted with <code>registry-auth</code> input by <a href="https://github.com/crazy-max"><code>@​crazy-max</code></a> in <a href="https://redirect…
cbb330 added a commit to cbb330/arrow that referenced this pull request Feb 20, 2026
* GH-48965: [Python][C++] Compare unique_ptr for CFlightResult or CFlightInfo to nullptr instead of NULL (#48968)

### Rationale for this change

Cython built code is currently failing to compile on free threaded wheels due to:
```
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp: In function ‘PyObject* __pyx_gb_7pyarrow_7_flight_12FlightClient_9do_action_2generator2(__pyx_CoroutineObject*, PyThreadState*, PyObject*)’:
/arrow/python/build/temp.linux-x86_64-cpython-313t/_flight.cpp:43068:110: error: call of overloaded ‘unique_ptr(NULL)’ is ambiguous
43068 |           __pyx_t_3 = (__pyx_cur_scope->__pyx_v_result->result == ((std::unique_ptr< arrow::flight::Result> )NULL));
      |                            
```

### What changes are included in this PR?

Update comparing `unique_ptr[CFlightResult]` and `unique_ptr[CFlightInfo]` from `NULL` to `nullptr`.

### Are these changes tested?

Yes via archery.

### Are there any user-facing changes?

No

* GitHub Issue: #48965

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48924: [C++][CI] Fix pre-buffering issues in IPC file reader (#48925)

### What changes are included in this PR?

Bug fixes and robustness improvements in the IPC file reader:
* Fix bug reading variadic buffers with pre-buffering enabled
* Fix bug reading dictionaries with pre-buffering enabled
* Validate IPC buffer offsets and lengths

Testing improvements:
* Exercise pre-buffering in IPC tests
* Actually exercise variadic buffers in IPC tests, by ensuring non-inline binary views are generated
* Run fuzz targets on golden IPC integration files in ASAN/UBSAN CI job
* Exercise pre-buffering in the IPC file fuzz target

Miscellaneous:
* Add convenience functions for integer overflow checking

### Are these changes tested?

Yes, by existing and improved tests.

### Are there any user-facing changes?

Bug fixes.

**This PR contains a "Critical Fix".** Fixes a potential crash reading variadic buffers with pre-buffering enabled.

* GitHub Issue: #48924

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-48966: [C++] Fix cookie duplication in the Flight SQL ODBC driver and the Flight Client (#48967)


### Rationale for this change

The bug breaks a Flight SQL server that refreshens the auth token when cookie authentication is enabled

### What changes are included in this PR?

1. In the ODBC layer, removed the code that adds a 2nd ClientCookieMiddlewareFactory in the client options (the 1st one is registered in `BuildFlightClientOptions`). This fixes the issue of the duplicate header cookie fields.
2. In the flight client layer, uses the case-insensitive equality comparator instead of the case-insensitive less-than comparator for the cookies cache which is an unordered map. This fixes the issue of duplicate cookie keys.

### Are these changes tested?
Manually on Windows, and CI

### Are there any user-facing changes?

No
* GitHub Issue: #48966

Authored-by: jianfengmao <jianfengmao@deephaven.io>
Signed-off-by: David Li <li.davidm96@gmail.com>

* GH-48691: [C++][Parquet] Write serializer may crash if the value buffer is empty (#48692)

### Rationale for this change
WriteArrowSerialize could unconditionally read values from the Arrow array even for null rows. Since it's possible the caller could provided a zero-sized dummy buffer for all-null arrays, this caused an ASAN heap-buffer-overflow.

### What changes are included in this PR?
Early check the array is not all null values before serialize it

### Are these changes tested?

Added tests.
### Are there any user-facing changes?

No

* GitHub Issue: #48691

Authored-by: rexan <rexan@apache.org>
Signed-off-by: Gang Wu <ustcwg@gmail.com>

* GH-48947 [CI][Python] Install pymanager.msi instead of pymanager.msix to fix docker rebuild on Windows wheels (#48948)

### Rationale for this change

As soon as we have to rebuild our Windows docker images they will fail installing python-manager-25.0.msix

### What changes are included in this PR?

- Use `pymanager.msi` to install python version instead of `pymanager.msix` which has problems on Docker.
- Update `pymanager install` command to use newer API (old command fails with missing flags)
- Update default python command to use the free-threaded required suffix if free-threaded wheels

### Are these changes tested?

Yes via archery

### Are there any user-facing changes?

No
* GitHub Issue: #48947

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48990: [Ruby] Add support for writing date arrays (#48991)

### Rationale for this change

There are date32 and date64 variants for date arrays.

### What changes are included in this PR?

* Add `ArrowFormat::DateType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48990

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48992: [Ruby] Add support for writing large UTF-8 array (#48993)

### Rationale for this change

It's a large variant of UTF-8 array.

### What changes are included in this PR?

* Add `ArrowFormat::LargeUTF8Type#to_flatbuffers`
* Add support for large UTF-8 array of `#values` and `#raw_records`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48992

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48949: [C++][Parquet] Add Result versions for parquet::arrow::FileReader::ReadRowGroup(s) (#48982)

### Rationale for this change
`FileReader::ReadRowGroup(s)` previously returned `Status` and required callers to pass an `out` parameter.
### What changes are included in this PR?
Introduce `Result<std::shared_ptr<Table>>` returning APIs to allow clearer error propagation:
  - Add new Result-returning `ReadRowGroup()` / `ReadRowGroups()` methods
  - Deprecate the old Status/out-parameter overloads
  - Update C++ callers and R/Python/GLib bindings to use the new API
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
Status versions of FileReader::ReadRowGroup(s) have been deprecated.
```cpp
virtual ::arrow::Status ReadRowGroup(int i, const std::vector<int>& column_indices,
                                     std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroup(int i, std::shared_ptr<::arrow::Table>* out);

virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                      const std::vector<int>& column_indices,
                                      std::shared_ptr<::arrow::Table>* out);
virtual ::arrow::Status ReadRowGroups(const std::vector<int>& row_groups,
                                      std::shared_ptr<::arrow::Table>* out);
```
* GitHub Issue: #48949

Lead-authored-by: fenfeng9 <fenfeng9@qq.com>
Co-authored-by: fenfeng9 <36840213+fenfeng9@users.noreply.github.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48985: [GLib][Ruby] Fix GC problems in node options and expressions (#48989)

### Rationale for this change

Some node options and expressions miss arguments reference. If they miss, arguments may be freed by GC.

### What changes are included in this PR?

* Refer arguments of `garrow_filter_node_options_new()`
* Refer arguments of `garrow_project_node_options_new()`
* Refer arguments of `garrow_aggregate_node_options_new()`
* Refer arguments of `garrow_literal_expression_new()`
* Refer arguments of `garrow_call_expression_new()`
 
### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48985

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-47692: [CI][Python] Do not fallback to return 404 if wheel is found on emscripten jobs (#49007)

### Rationale for this change

When looking for the wheel the script was falling back to returning a 404 even when the wheel was found:
```
 + python scripts/run_emscripten_tests.py dist/pyarrow-24.0.0.dev31-cp312-cp312-pyodide_2024_0_wasm32.whl --dist-dir=/pyodide --runtime=chrome
127.0.0.1 - - [27/Jan/2026 01:14:50] code 404, message File not found
```
Timing out the job and failing.

### What changes are included in this PR?

Correct logic and only return 404 if the file requested wasn't found.

### Are these changes tested?

Yes via archery

### Are there any user-facing changes?

No
* GitHub Issue: #47692

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48912: [R] Configure C++20 in conda R on continuous benchmarking (#48974)

### Rationale for this change

Benchmark failing since C++20 upgrade due to lack of C++20 configuration

### What changes are included in this PR?

Changes entirely from :robot: (Claude) with discussion from me regarding optimal approach.  

Description as follows:

> conda-forge's R package doesn't have CXX20 configured in Makeconf, even though the compiler (gcc 14.3.0) supports C++20. This causes Arrow R package installation to fail with "a C++20 compiler is required" because `R CMD config CXX20` returns empty. 
>
> This PR adds CXX20 configuration to R's Makeconf before building the Arrow R package in the benchmark hooks, if not already present.                                                               

### Are these changes tested?

I got :robot:  to try it locally in a container but I'm not convinced we'll know for sure til we try it out properly.

>  Tested in Docker container with Amazon Linux 2023 + conda-forge R - confirmed `R CMD config CXX20` returns empty before patch and `g++` after patch.
>
> The only thing we didn't test end-to-end was actually building Arrow R, but that would have taken much longer and the configure check (R CMD config CXX20 returning non-empty) is exactly what Arrow's configure script tests before proceeding.                                       

### Are there any user-facing changes?

Nope
* GitHub Issue: #48912

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-36889: [C++][Python] Fix duplicate CSV header when first batch is empty (#48718)

### Rationale for this change

Fixes https://github.com/apache/arrow/issues/36889

When writing CSV from a table where the first batch is empty, the header gets written twice:

```python
table = pa.table({"col1": ["a", "b", "c"]})
combined = pa.concat_tables([table.schema.empty_table(), table])
write_csv(combined, buf)
# Result: "col1"\n"col1"\n"a"\n"b"\n"c"\n  <-- header appears twice
```

### What changes are included in this PR?

The bug happens because:
1. Header is written to `data_buffer_` and flushed during `CSVWriterImpl` initialization
2. The buffer is not cleared after flush
3. When the next batch is empty, `TranslateMinimalBatch` returns early without modifying `data_buffer_`
4. The write loop then writes `data_buffer_` which still contains stale content

The fix introduces a `WriteAndClearBuffer()` helper that writes the buffer to sink and clears it. This helper is used in all write paths:
- `WriteHeader()`
- `WriteRecordBatch()`
- `WriteTable()`

This ensures the buffer is always clean after any flush, making it impossible for stale content to be written again.

### Are these changes tested?

Yes. Added C++ tests in `writer_test.cc` and Python tests in `test_csv.py`:
- Empty batch at start of table
- Empty batch in middle of table

### Are there any user-facing changes?

No API changes. This is a bug fix that prevents duplicate headers when writing CSV from tables with empty batches.

* GitHub Issue: #36889

Lead-authored-by: Ruiyang Wang <ruiyang@anthropic.com>
Co-authored-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Gang Wu <ustcwg@gmail.com>

* GH-48932: [C++][Packaging][FlightRPC] Fix `rsync` build error ODBC Nightly Package (#48933)

### Rationale for this change
#48932
### What changes are included in this PR?
- Fix `rsync` build error ODBC Nightly Package 
### Are these changes tested?
- tested in CI
### Are there any user-facing changes?
- After fix, users should be able to get Nightly ODBC package release

* GitHub Issue: #48932

Authored-by: Alina (Xi) Li <alina.li@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48951: [Docs] Add documentation relating to AI tooling (#48952)

### Rationale for this change

Add guidance re AI tooling

### What changes are included in this PR?

Updates to main docs and links to it from new contributor's guide

### Are these changes tested?

No but I'll built the docs

### Are there any user-facing changes?

Just docs

:robot: Changes generated using Claude Code - I took the discussion from the mailing list, asked it to add the original text and then apply suggested changes one at a time, made a few of my own tweaks, and then instructed it to edit things down a bit for clarity and conciseness.
* GitHub Issue: #48951

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-49029: [Doc] Run sphinx-build in parallel (#49026)

### Rationale for this change

`sphinx-build` allows for parallel operation, but it builds serially by default and that can be very slow on our docs given the amount of documents (many of them auto-generated from API docs).

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: #49029

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-33450: [C++] Remove GlobalForkSafeMutex (#49033)

### Rationale for this change

This functionality is unused now that we have a proper atfork facility.

### Are these changes tested?

By existing CI tests.

### Are there any user-facing changes?

Removing an API that was always meant for internal use (though we didn't flag it explicitly as internal).

* GitHub Issue: #33450

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-35437: [C++] Remove obsolete TODO about DictionaryArray const& return types (#48956)

### Rationale for this change

The TODO comment in `vector_array_sort.cc` asking whether `DictionaryArray::dictionary()` and `DictionaryArray::indices()` should return `const&` has been obsolete.

It was added in commit 6ceb12f700a when dictionary array sorting was implemented. At that time, these methods returned `std::shared_ptr<Array>` by value, causing unnecessary copies.

The issue was fixed in commit 95a8bfb319b which changed both methods to return `const std::shared_ptr<Array>&`, removing the copies. However, the TODO comment was left unremoved.

### What changes are included in this PR?

Removed the outdated TODO comment that referenced GH-35437.

### Are these changes tested?

I did not test.

### Are there any user-facing changes?

No.
* GitHub Issue: #35437

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-48586: [Python][CI] Upload artifact to python-sdist job (#49008)

### Rationale for this change

When running the python-sdist job we are currently not uploading the build artifact to the job.

### What changes are included in this PR?

Upload artifact as part of building the job so it's easier to test and validate contents if necessary.

### Are these changes tested?

Yes via archery.

### Are there any user-facing changes?

No

* GitHub Issue: #48586

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* MINOR: [R] Add 22.0.0.1 to compatiblity matrix (#49039)

### Rationale for this change

CI needs updating to test old R package versions

### What changes are included in this PR?

Add 22.0.0.1

### Are these changes tested?

Nah, it's CI stuff

### Are there any user-facing changes?

No

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48961: [Docs][Python] Doctest fails on pandas 3.0 (#48969)

### Rationale for this change
See issue #48961
Pandas 3.0.0 string storage type changes https://github.com/pandas-dev/pandas/pull/62118/changes
and https://pandas.pydata.org/docs/whatsnew/v3.0.0.html#dedicated-string-data-type-by-default

### What changes are included in this PR?
Updating several doctest examples from `string` to `large_string`.

### Are these changes tested?
Yes, locally.

### Are there any user-facing changes?
No.

Closes #48961 
* GitHub Issue: #48961

Authored-by: Tadeja Kadunc <tadeja.kadunc@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>

* GH-49037: [Benchmarking] Install R from non-conda source for benchmarking  (#49038)

### Rationale for this change

Slow benchmarks due to conda duckdb building from source

### What changes are included in this PR?

Try ditching conda and installing R via rig and using PPM binaries

### Are these changes tested?

I'll try running

### Are there any user-facing changes?
 
Nope
* GitHub Issue: #49037

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-49042: [C++] Remove mimalloc patch (#49041)

### Rationale for this change

This patch was integrated upstream in https://github.com/microsoft/mimalloc/pull/1139

### Are these changes tested?

By existing CI.

### Are there any user-facing changes?

No.
* GitHub Issue: #49042

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49024: [CI] Update Debian version in `.env` (#49032)

### Rationale for this change

Default Debian version in `.env` now maps to oldstable, we should use stable instead.
Also prune entries that are not used anymore.

### Are these changes tested?

By existing CI jobs.

### Are there any user-facing changes?

No.
* GitHub Issue: #49024

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49027: [Ruby] Add support for writing time arrays (#49028)

### Rationale for this change

There are 32/64 bit and second/millisecond/microsecond/nanosecond variants for time arrays.

### What changes are included in this PR?

* Add `ArrowFormat::TimeType#to_flatbuffers`
* Add bit width information to `ArrowFormat::TimeType`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* GitHub Issue: #49027

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49030: [Ruby] Add support for writing fixed size binary array (#49031)

### Rationale for this change

It's a fixed size variant of binary array.

### What changes are included in this PR?

* Add `ArrowFormat::FixedSizeBinaryType#to_flatbuffers`
* Add `ArrowFormat::FixedSizeBinaryArray#each_buffer`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49030

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48866: [C++][Gandiva] Truncate subseconds beyond milliseconds in `castTIMESTAMP_utf8` and `castTIME_utf8` (#48867)

### Rationale for this change

Fixes #48866. The Gandiva precompiled time functions `castTIMESTAMP_utf8` and `castTIME_utf8` currently reject timestamp and time string literals with more than 3 subsecond digits (beyond millisecond precision), throwing an "Invalid millis" error. This behavior is inconsistent with other implementations.

### What changes are included in this PR?

- Fixed `castTIMESTAMP_utf8` and `castTIME_utf8` functions to truncate subseconds beyond 3 digits instead of throwing an error
- Updated tests. Replaced error-expecting tests with truncation verification tests and added edge cases

### Are these changes tested?

Yes

### Are there any user-facing changes?

No
* GitHub Issue: #48866

Authored-by: Arkadii Kravchuk <arkadii.kravchuk@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48673: [C++] Fix ToStringWithoutContextLines to check for :\d+ pattern before removing lines (#48674)

### Rationale for this change

This PR proposes to fix the todo https://github.com/apache/arrow/blob/7ebc88c8fae62ed97bc30865c845c8061132af7e/cpp/src/arrow/status.cc#L131-L134 which would allows a better parsing for line numbers.

I could not find the relevant example to demonstrate within this project but assume that we have a test such as:

(Generated by ChatGPT)

```cpp
TEST(BlockParser, ErrorMessageWithColonsPreserved) {
  Status st(StatusCode::Invalid,
            "CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
            "Error details: Time format: 12:34:56, Key: value\n"
            "parser_test.cc:940  Parse(parser, csv, &out_size)");

  std::string expected_msg =
      "Invalid: CSV parse error: Row #2: Expected 2 columns, got 3: 12:34:56,key:value,data\n"
      "Error details: Time format: 12:34:56, Key: value";

  ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}

// Test with URL-like data (another common case with colons)
TEST(BlockParser, ErrorMessageWithURLPreserved) {
  Status st(StatusCode::Invalid,
            "CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
            "URL: http://arrow.apache.org:8080/api\n"
            "parser_test.cc:974  Parse(parser, csv, &out_size)");

  std::string expected_msg =
      "Invalid: CSV parse error: Row #2: Expected 1 columns, got 2: http://arrow.apache.org:8080/api,data\n"
      "URL: http://arrow.apache.org:8080/api";

  ASSERT_RAISES_WITH_MESSAGE(Invalid, expected_msg, st);
}
```

then it fails.

### What changes are included in this PR?

Fixed `Status::ToStringWithoutContextLines()` to only remove context lines matching the `filename:line` pattern (`:\d+`), preventing legitimate error messages containing colons from being incorrectly stripped.

### Are these changes tested?

Manually tested, and unittests were added, with `cmake .. --preset ninja-debug -DARROW_EXTRA_ERROR_CONTEXT=ON`.

### Are there any user-facing changes?

No, test-only.

* GitHub Issue: #48673

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49044: [CI][Python] Fix test_download_tzdata_on_windows by adding required user-agent on urllib request (#49052)

### Rationale for this change

See: #49044

### What changes are included in this PR?

Urllib now request with `"user-agent": "pyarrow"`

### Are these changes tested?

It's a CI fix.

### Are there any user-facing changes?

No, just a CI test fix.
* GitHub Issue: #49044

Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48983: [Packaging][Python] Build wheel from sdist using build and add check to validate LICENSE.txt and NOTICE.txt are part of the wheel contents (#48988)

### Rationale for this change

Currently the files are missing from the published wheels.

### What changes are included in this PR?

- Ensure the license and notice files are part of the wheels
- Use build frontend to build wheels
- Build wheel from sdist

### Are these changes tested?

Yes, via archery.
I've validated all wheels will fail with the new check if LICENSE.txt or NOTICE.txt are missing:
```
 AssertionError: LICENSE.txt is missing from the wheel.
```

### Are there any user-facing changes?

No

* GitHub Issue: #48983

Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-49059: [C++] Fix issues found by OSS-Fuzz in IPC reader (#49060)

### Rationale for this change

Fix two issues found by OSS-Fuzz in the IPC reader:

* a controlled abort on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5301064831401984
* a nullptr dereference on invalid IPC metadata: https://oss-fuzz.com/testcase-detail/5091511766417408

None of these two issues is a security issue.

### Are these changes tested?

Yes, by new unit tests and new fuzz regression files.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

* GitHub Issue: #49059

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-49055: [Ruby] Add support for writing decimal128/256 arrays (#49056)

### Rationale for this change

Decimal128/256 arrays are only supported.

### What changes are included in this PR?

Add `ArrowFormat::DecimalType#to_flatbuffers`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49055

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49053: [Ruby] Add support for writing timestamp array (#49054)

### Rationale for this change

It has `unit` and `time_zone` parameters.

### What changes are included in this PR?

* Add `ArrowFormat::TimestampType#to_flatbuffers`
* Set time zone when GLib timestamp type is converted from C++ timestamp type
* Use `time_zone` not `timezone`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49053

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-28859: [Doc][Python] Use only code-block directive and set up doctest for the python user guide (#48619)

### Rationale for this change

In many places in the Python User Guide the code exampels are written with IPython directive (elsewhere code-block is used). IPython directives are converted to IPython format (`In` and `Out` during the doc build). This can lead to slower builds.

### What changes are included in this PR?

IPython directives are converted to runnable code-block (with `>>>` and `...`) and pytest doctest support for `.rst` files is added to the `conda-python-docs` CI job. This means the code in the Python User Guide is tested separately to the building  of the documentation.

### Are these changes tested?

Yes, with the CI.

### Are there any user-facing changes?

Changes to the Python User Guide examples will have to be tested with `pytest --doctest-glob='*.rst' docs/source/python/file.rst`

* GitHub Issue: #28859

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: tadeja <tadeja@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>

* GH-49065: [C++] Remove unnecessary copies of shared_ptr in Type::BOOL and Type::NA at GrouperImpl (#49066)

### Rationale for this change

The grouper code was creating a `shared_ptr<DataType>` for every key type, even when it wasn't needed. This resulted in unnecessary reference counting operations. For example, `BooleanKeyEncoder` and `NullKeyEncoder` don't require a `shared_ptr` in their constructors, yet we were creating one for every key of those types.

### What changes are included in this PR?

Changed `GrouperImpl::Make()` to use `TypeHolder` references directly and only call `GetSharedPtr()` when needed by encoder constructors. This eliminates `shared_ptr` creation for `Type::BOOL` and `Type::NA` cases. Other encoder types (dictionary, fixed-width, binary) still require `shared_ptr` since their constructors take `shared_ptr<DataType>` parameters for ownership.

### Are these changes tested?

Yes, existing tests.

### Are there any user-facing changes?

No.
* GitHub Issue: #49065

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48159 [C++][Gandiva] Projector make is significantly slower after move to OrcJIT (#49063)

### Rationale for this change
Reduces LLVM TargetMachine object creation from 3 to 1. This object is expensive to create and the extra copies weren't needed.

### What changes are included in this PR?
Refactor the Engine class to only create one target machine and pass that to the necessary functions.

Before the change (3 TargetMachines created):

First TargetMachine: In Engine::Make(), MakeTargetMachineBuilder() is called, then BuildJIT() is called. Inside LLJITBuilder::create(), when prepareForConstruction() runs, if no DataLayout was set, it calls JTMB->getDefaultDataLayoutForTarget() which creates a temporary TargetMachine just to get the DataLayout.

Second TargetMachine: Inside BuildJIT(), when setCompileFunctionCreator is used with the lambda, that lambda calls JTMB.createTargetMachine() to create a TargetMachine for the TMOwningSimpleCompiler.

Third TargetMachine: Back in Engine::Make(), after BuildJIT() returns, there's an explicit call to jtmb.createTargetMachine() to create target_machine_ for the Engine.

After the change (1 TargetMachine created):

The key changes are:

Create TargetMachine first: The code now creates the TargetMachine explicitly at the start of the Engine in Engine::Make. That machine is passed to BuildJIT. In BuildJiIT that machine's DataLayout is sent to LLJITBuilder which prevents prepareForConstruction() from calling getDefaultDataLayoutForTarget() (which would create a temporary TargetMachine).

Use SimpleCompiler instead of TMOwningSimpleCompiler:
SimpleCompiler takes a reference to an existing TargetMachine rather than owning one, so no new TargetMachine is created.
A shared_ptr is used to ensure that TargetMachine stays around for the lifetime of the LLJIT instance.

### Are these changes tested?
Yes, unit and integration.

### Are there any user-facing changes?
No.

* GitHub Issue: #48159

Lead-authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com>
Co-authored-by: Logan Riggs <logan.riggs@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49043: [C++][FS][Azure] Avoid bugs caused by empty first page(s) followed by non-empty subsequent page(s) (#49049)

### Rationale for this change
Prevent bugs similar to https://github.com/apache/arrow/issues/49043

### What changes are included in this PR?
- Implement `SkipStartingEmptyPages` for various types of PagedResponses used in the `AzureFileSystem`.
- Apply `SkipStartingEmptyPages` on the response from every list operation that returns a paged response.
 
### Are these changes tested?
Ran the tests in the codebase including the ones that need to connect to real blob storage. This makes me fairly confident that I haven't introduced a regression.

The only reproduce I've found involves reading a production Azure blob storage account. With this I've tested that this PR solves https://github.com/apache/arrow/issues/49043, but I haven't been able to reproduce it in any checked in tests. I tried copying a chunk of data around our prod reproduce into azurite, but still can't reproduce.

### Are there any user-facing changes?
Some low probability bugs will be gone. No interface changes. 
* GitHub Issue: #49043

Authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49034 [C++][Gandiva] Fix binary_string to not trigger error for null strings (#49035)

### Rationale for this change

The binary_string function will attempt to allocate 0 bytes of memory, which results in a null ptr being returned and the function interprets that as an error.

### What changes are included in this PR?
Add kCanReturnErrors to the function definition to match other string functions. 
Move the check for 0 byte length input earlier in the binary_string function to prevent the 0 allocation.
Add a unit test.

### Are these changes tested?
Yes, unit and integration testing.

### Are there any user-facing changes?
No.

* GitHub Issue: #49034

Authored-by: Logan Riggs <logan.riggs@dremio.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48980: [C++] Use COMPILE_OPTIONS instead of deprecated COMPILE_FLAGS (#48981)

### Rationale for this change

Arrow requires CMake 3.25 but was still using deprecated `COMPILE_FLAGS` property. Recommanded to use `COMPILE_OPTIONS` (introduced in CMake 3.11).

### What changes are included in this PR?

Replaced `COMPILE_FLAGS` with `COMPILE_OPTIONS` across `CMakeLists.txt` files, converted space separated strings to semicolon-separated lists, and removed obsolete TODO comments.

### Are these changes tested?

Yes, through CI build and existing tests.

### Are there any user-facing changes?

No.
* GitHub Issue: #48980

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49069: [C++] Share Trie instances across CSV value decoders (#49070)

### Rationale for this change

The CSV converter was building identical Trie data structures (for null/true/false values) in every decoder instance, causing duplicate memory allocation and initialization overhead.

### What changes are included in this PR?

- Introduced `TrieCache` struct to hold shared Trie instances (null_trie, true_trie, false_trie)
- Updated `ValueDecoder` and all decoder subclasses to accept and reference a shared `TrieCache` instead of building their own Tries
- Updated `Converter` base class to create one `TrieCache` per converter and pass it to all decoders

### Are these changes tested?

Yes, all existing tests. I ran a simple benchmark showing roughly 2-4% faster converter creation, and obviously less memory usage.

### Are there any user-facing changes?

No.
* GitHub Issue: #49069

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49076: [CI] Update vcpkg baseline to newer version (#49062)

### Rationale for this change

The current version of vcpkg used is a from April 2025

### What changes are included in this PR?

Update baseline to newer version.

### Are these changes tested?

Yes on CI. I've validated for example that xsimd 14 will be pulled.

### Are there any user-facing changes?
No

* GitHub Issue: #49076

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49074: [Ruby] Add support for writing interval arrays (#49075)

### Rationale for this change

There are year month/day time/month day nano variants.

### What changes are included in this PR?

* Add `ArrowFormat::IntervalType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49074

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49071: [Ruby] Add support for writing list and large list arrays (#49072)

### Rationale for this change

They use different offset size.

### What changes are included in this PR?

* Add `ArrowFormat::ListType#to_flatbuffers`
* Add `ArrowFormat::LargeListType#to_flatbuffers`
* Add `ArrowFormat::VariableSizeListArray#child`
* Add `ArrowFormat::VariableSizeListArray#each_buffer`
* `garrow_array_get_null_bitmap()` returns `NULL` when null bitmap doesn't exist
* Add `garrow_list_array_get_value_offsets_buffer()`
* Add `garrow_large_list_array_get_value_offsets_buffer()`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49071

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49087 [CI][Packaging][Gandiva] Add support for LLVM 15 or earlier again (#49091)

### Rationale for this change

LLVM 15 or earlier uses `llvm::Optional` not `std::optional`.

### What changes are included in this PR?

Use `llvm::Optional` with LLVM 15 or earlier.

### Are these changes tested?

Yes, compiling.

### Are there any user-facing changes?

No

* GitHub Issue: #49087

Authored-by: logan.riggs@gmail.com <logan.riggs@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49100: [Docs] Broken link to Swift page in implementations.rst (#49101)

### Rationale for this change

The Swift documentation link in the implementations.rst file was broken and returned a 404 error.

### What changes are included in this PR?

Updated the Swift documentation link in https://github.com/apache/arrow/blob/235841d644d5454f7067c44f580f301446ba1cc0/docs/source/implementations.rst?plain=1#L124 from the [broken GitHub README link](https://github.com/apache/arrow-swift/blob/main/Arrow/README.md) to the [Swift Package documentation](https://swiftpackageindex.com/apache/arrow-swift/main/documentation/arrow)

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #49100

Lead-authored-by: ChiLin Chiu <chilin.chiou@gmail.com>
Co-authored-by: Chilin <chilin.cs07@nycu.edu.tw>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49096: [Ruby] Add support for writing struct array (#49097)

### Rationale for this change

It's a nested array.

### What changes are included in this PR?

* Add `ArrowFormat::StructType#to_flatbuffers`
* Add `ArrowFormat::StructArray#each_buffer`
* Add `ArrowFormat::StructArray#children`
* Fix `ArrowFormat::Array#n_nulls`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49096

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49093: [Ruby] Add support for writing duration array (#49094)

### Rationale for this change

It has unit parameter.

### What changes are included in this PR?

* Add `ArrowFormat::DurationType#to_flatbuffers`
* Add duration support to `#values` and `raw_records`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49093

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49098: [Packaging][deb] Add missing libarrow-cuda-glib-doc (#49099)

### Rationale for this change

Documents for libarrow-cuda-glib are generated but they aren't packaged.

### What changes are included in this PR?

Package documents for libarrow-cuda-glib.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49098

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48764: [C++] Update xsimd (#48765)

### Rationale for this change
Homogenized versions used

### What changes are included in this PR?
Move to xsimd 14 to benefit from latest improvements relevant for improvements to the integer unpacking routines.

### Are these changes tested?
Yes, with current CI.
In fact due to the absence of pin, part of the CI already runs xsimd 14.

### Are there any user-facing changes?
No.

* GitHub Issue: #48764

Authored-by: AntoinePrv <AntoinePrv@users.noreply.github.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-46008: [Python][Benchmarking] Remove unused asv benchmarking files (#49047)

### Rationale for this change

As discussed on the issue we don't seem to have run asv benchmarks on Python for the last years. It is probably broken.

### What changes are included in this PR?

Remove asv benchmarking related files and docs.

### Are these changes tested?

No, Validate CI and run preview-docs to validate docs.

### Are there any user-facing changes?

No
* GitHub Issue: #46008

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-49108: [Python] SparseCOOTensor.__repr__ missing f-string prefix (#49109)

### Rationale for this change

`SparseCOOTensor.__repr__` outputs literal `{self.type}` and `{self.shape}` instead of actual values due to missing f-string prefix.

### What changes are included in this PR?

Add f prefix to the string in `SparseCOOTensor.__repr__`.

### Are these changes tested?

Yes, work after adding. f-string prefix:
```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: float
shape: (2, 3)
```

### Are there any user-facing changes?

a bug that caused incorrect or invalid data to be produced:

```python3
>>> import pyarrow as pa
>>> import numpy as np
>>> dense_tensor = np.array([[0, 1, 0], [2, 0, 3]], dtype=np.float32)
>>> sparse_coo = pa.SparseCOOTensor.from_dense_numpy(dense_tensor)
>>> sparse_coo
<pyarrow.SparseCOOTensor>
type: {self.type}
shape: {self.shape}
```

* GitHub Issue: #49108

Authored-by: Chilin <chilin.cs07@nycu.edu.tw>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-49083: [CI][Python] Remove dask-contrib/dask-expr from the nightly dask test builds (#49126)

### Rationale for this change
Failing nightly job for dask (test-conda-python-3.11-dask-upstream_devel).

### What changes are included in this PR?
Removal of dask-contrib/dask-expr package as it is included in the dask dataframe module since January 2025.

### Are these changes tested?
Yes, with extendeed dask build.

### Are there any user-facing changes?
No.
* GitHub Issue: #49083

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-49117: [Ruby] Add support for writing union arrays (#49118)

### Rationale for this change

There are dense and sparse variants.

### What changes are included in this PR?

* Add `garrow_union_array_get_n_fields()`
* Add `ArrowFormat::UnionArray#children`
* Add `ArrowFormat::DenseUnionArray#each_buffer`
* Add `ArrowFormat::SparseUnionArray#each_buffer`
* Add `ArrowFormat::UnionType#to_flatbuffers`
* Add `Arrow::UnionArray#fields`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49117

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49119: [Ruby] Add support for writing map array (#49120)

### Rationale for this change

It's a list based array.

### What changes are included in this PR?

* Add `ArrowFormat::MapType#to_flatbuffers`

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #49119

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48922: [C++] Support Status-returning callables in Result::Map (#49127)

### Rationale for this change
Currently, Result::Map fails to compile when the mapping function returns a Status because it tries to instantiate Result, which is prohibited. This change allows Map to return Status directly in such cases.

### What changes are included in this PR?
- Added EnsureResult specialization to allow Map to return Status directly.
- Added unit tests to verify success/error propagation and return type resolution.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No
* GitHub Issue: #48922

Authored-by: Abhishek Bansal <abhibansal593@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-49003: [C++] Don't consider `out_of_range` an error in float parsing (#49095)

### Rationale for this change
This PR restores the behavior previous to version 23 for floating-point parsing on overflow and subnormal.

`fast_float` didn't assign an error code on overflow in version `3.10.1` and assigned `±Inf` on overflow and `0.0` on subnormal. With the update to version `8.1`, it started to assign `std::errc::result_out_of_range` in such cases. 

### What changes are included in this PR?
Ignores `std::errc::result_out_of_range` and produce `±Inf` / `0.0` as appropriate instead of failing the conversion.

### Are these changes tested?
Yes. Created tests for overflow with positive and negative signed mantissa, and also created tests for subnormal, all of them for binary{16,32,64}.

### Are there any user-facing changes?
It's a user facing change. The CSV reader on version `libarrow==23` was assigning them as strings, while before it was parsing it as `0` or `+- inf`.

With this patch, the CSV reader in PyArrow outputs:

```python
>>> import pyarrow
>>> import pyarrow.csv
>>> import io
>>> table = pyarrow.csv.read_csv(io.BytesIO(f"data\n10E-617\n10E617\n-10E617".encode()))
>>> print(table)
pyarrow.Table
data: double
----
data: [[0,inf,-inf]]
```

Closes #49003 

* GitHub Issue: #49003

Authored-by: Alvaro-Kothe <kothe65@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-48941: [C++] Generate proper UTF-8 strings in JSON test utilities (#48943)

### Rationale for this change

The JSON test utility `GenerateAscii` was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling.

### What changes are included in this PR?

Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629.
Added that function as an util.

### Are these changes tested?

There are existent tests for JSON.

### Are there any user-facing changes?

No, test-only.
* GitHub Issue: #48941

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-49067: [R] Disable GCS on macos (#49068)

### Rationale for this change
Builds that complete on CRAN

### What changes are included in this PR?
Disable GCS by default

### Are these changes tested?

### Are there any user-facing changes?
Hopefully not 

**This PR includes breaking changes to public APIs.** (If there are any
breaking changes to public APIs, please explain which changes are
breaking. If not, you can remove this.)

**This PR contains a "Critical Fix".** (If the changes fix either (a) a
security vulnerability, (b) a bug that caused incorrect or invalid data
to be produced, or (c) a bug that causes a crash (even when the API
contract is upheld), please provide explanation. If not, you can remove
this.)

* GitHub Issue: #49067

---------

Co-authored-by: Nic Crane <thisisnic@gmail.com>

* GH-49115: [CI][Packaging][Python] Update vcpkg baseline for our wheels (#49116)

### Rationale for this change

Current wheels are failing to be built due to old version of vcpkg failing with our latest main.

### What changes are included in this PR?

- Update vcpkg version.
- Update patches
- Add `perl-Time-Piece` to some images as required to build newer OpenSSL.

### Are these changes tested?

Yes on CI

### Are there any user-facing changes?

No

* GitHub Issue: #49115

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48954: [C++] Add test for null-type dictionary sorting and clarify XXX comment (#48955)

### Rationale for this change

Null-type dictionaries (e.g., `dictionary(int8(), null())`) are valid Arrow constructs supported from day one, but the sorting code had an uncertain `XXX Should this support Type::NA?` comment. We should explicitly support and test this because other functions already support this:

```python
import pyarrow as pa
import pyarrow.compute as pc

pc.array_sort_indices(pa.array([None, None, None, None], type=pa.int32()))
# [0, 1, 2, 3]
pc.array_sort_indices(pa.DictionaryArray.from_arrays(
    indices=pa.array([None, None, None, None], type=pa.int8()),
    dictionary=pa.array([], type=pa.null())
))
# [0, 1, 2, 3]
```

I believe it does not make sense to specifically disallow this in dictionaries at this point.

### What changes are included in this PR?

Added a unittest for null sorting behaviour.

### Are these changes tested?

Yes, the unittest was added.

### Are there any user-facing changes?

No.
* GitHub Issue: #48954

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-36193: [R] arm64 binaries for R  (#48574)

### Rationale for this change

Issues building on ARM

### What changes are included in this PR?

CI job and nixlibs update

### Are these changes tested?

On CI

### Are there any user-facing changes?

No

AI changes :robot:: Claude decided where to make the changes and helped debug failing builds, but I updated most of it (e.g. rstudio -> posit, choice of runners etc) 

* GitHub Issue: #36193

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-48397: [R] Update docs on how to get our libarrow builds (#48995)

### Rationale for this change

Turning off GCS on CRAN to prevent excessive build times, need to tell people who wanna work with GCS how to do that.

### What changes are included in this PR?

Update docs.

### Are these changes tested?

Will preview docs build.

### Are there any user-facing changes?

Just docs.
* GitHub Issue: #48397

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-49104: [C++] Fix Segfault in SparseCSFIndex::Equals with mismatched dimensions (#49105)

### Rationale for This Change

The `SparseCSFIndex::Equals` method can crash when comparing two sparse indices that have a different number of dimensions. The method iterates over the `indices()` and `indptr()` vectors of the current object and accesses the corresponding elements in the `other` object without first verifying that both objects have matching vector sizes. This can lead to out-of-bounds access and a segmentation fault when the dimension counts differ.

### What Changes Are Included in This PR?

This change adds explicit size equality checks for the `indices()` and `indptr()` vectors at the beginning of the `SparseCSFIndex::Equals` method. If the dimensions do not match, the method now safely returns `false` instead of attempting invalid memory access.

### Are These Changes Tested?

Yes. The fix has been validated through targeted reproduction of the crash scenario using mismatched dimension counts, ensuring the method behaves safely and deterministically.

### Are There Any User-Facing Changes?

No. This change improves internal safety and robustness without altering public APIs or observable user behavior.

* GitHub Issue: #49104

Lead-authored-by: Alirana2829 <alimahmoodrana00@gmail.com>
Co-authored-by: Ali Mahmood Rana <159713825+AliRana30@users.noreply.github.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>

* MINOR: [Docs] Add links to AI-generated code guidance (#49131)

### Rationale for this change

Add link to AI-generated code guidance - we should make sure the docs are updated before we merge this though

### What changes are included in this PR?

Add link to AI-generated code guidance

### Are these changes tested?

No

### Are there any user-facing changes?

No

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* MINOR: [R] Add new vignette to pkgdown config (#49145)

### Rationale for this change

CI failing on preview-docs; see #49141

### What changes are included in this PR?

Add the vignette created in #49068 to pkgdown config

### Are these changes tested?

I'll trigger CI

### Are there any user-facing changes?

Nah

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>

* GH-49150: [Doc][CI][Python] Doctests failing on rst files due to pandas 3+ (#49088)

Fixes: #49150
See https://github.com/apache/arrow/pull/48619#issuecomment-3823269381

### Rationale for this change

Fix CI failures

### What changes are included in this PR?

Tests are made more general to allow for Pandas 2 and Pandas 3 style string types

### Are these changes tested?

By CI

### Are there any user-facing changes?

No
* GitHub Issue: #49150

Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>

* GH-41990: [C++] Fix AzureFileSystem compilation on Windows (#48971)

Let me preface this pull request that I have not worked in C++ in quite a while. Apologies if this is missing modern idioms or is an obtuse fix.

### Rationale for this change

I encountered an issue trying to compile the AzureFileSystem backend in C++ on Windows. Searching the issue tracker, it appears this is already a [known](https://github.com/apache/arrow/issues/41990) but unresolved problem. This is an attempt to either address the issue or move the conversation forward for someone more experienced.

### What changes are included in this PR?

AzureFileSystem uses `unique_ptr` while the other cloud file system implementations rely on `shared_ptr`. Since this is a forward-declared Impl in the headers file but the destructor was defined inline (via `= default`), we're getting compilation issues with MSVC due to it requiring the complete type earlier than GCC/Clang.

This change removes the defaulted definition from the header file and moves it into the .cc file where we have a complete type.

Unrelated, I've also wrapped 2 exception variables in `ARROW_UNUSED`. These are warnings treated as errors by MSVC at compile time. This was revealed in CI after resolving the issue above.

### Are these changes tested?

I've enabled building and running the test suite in GHA in 8dd62d62a9af022813e9c8662956740340a9473f. I believe a large portion of those tests may be skipped though since Azurite isn't present from what I can see. I'm not tied to the GHA updates being included in the PR, it's currently here for demonstration purposes. I noticed the other FS implementations are also not built and tested on Windows.

One quirk of this PR is getting WIL in place to compile the Azure C++ SDK was not intuitive for me. I've placed a dummy `wilConfig.cmake` to get the Azure SDK to build, but I'd assume there's a better way to do this. I'm happy to refine the build setup if we choose to keep it.

### Are there any user-facing changes?

Nothing here should affect user-facing code beyond fixing the compilation issues. If there are concerns for things I'm missing, I'm happy to discuss those.

* GitHub Issue: #41990

Lead-authored-by: Nate Prewitt <nateprewitt@microsoft.com>
Co-authored-by: Nate Prewitt <nate.prewitt@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49138: [Packaging][Python] Remove nightly cython install from manylinux wheel dockerfile (#49139)

### Rationale for this change

We use nightlies version of Cython for free-threaded PyArrow wheels and they are currently failing, see https://github.com/apache/arrow/issues/49138

### What changes are included in this PR?

Nightly Cython install is removed and Cython is installed via [requirements file](https://github.com/apache/arrow/blob/main/python/requirements-wheel-build.txt#L2).

### Are these changes tested?
Tes.

### Are there any user-facing changes?
No.
* GitHub Issue: #49138

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>

* GH-33459: [C++][Python] Support step >= 1 in list_slice kernel (#48769)

### Rationale for this change

Closes ARROW-18281, which has been open since 2022. The `list_slice` kernel currently rejects `start == stop`, but should return empty lists instead (following Python slicing semantics).

The implementation already handles this case correctly. When ARROW-18282 added step support, `bit_util::CeilDiv(stop - start, step)` naturally returns 0 for `start == stop`, producing empty lists. The only issue was the validation check (`start >= stop`) that prevented this from working.

### What changes are included in this PR?

- Changed validation from `start >= stop` to `start > stop` 
- Updated error message
- Added test cases

### Are these changes tested?

Yes, tests were added.

### Are there any user-facing changes?

Yes.

```python
import pyarrow.compute as pc
pc.list_slice([[1,2,3]], 0, 0)
```

Before:

```
pyarrow.lib.ArrowInvalid: `start`(0) should be greater than 0 and smaller than `stop`(0)
```

After:

```
<pyarrow.lib.ListArray object at 0x1a01b8b20>
[
  []
]
```
* GitHub Issue: #33459

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>

* GH-41863: [Python][Parquet] Support lz4_raw as a compression name alias (#49135)

Closes https://github.com/apache/arrow/issues/41863

### Rationale for this change

Other tools in the parquet ecosystem distinguish between `LZ4` and `LZ4_RAW`, matching the specification: https://parquet.apache.org/docs/file-format/data-pages/compression/

`LZ4` (framing) is of course deprecated. PyArrow does not support it, and instead simplifies the user-facing API, using `LZ4` as an alias for the `LZ4_RAW` codec. 

However, PyArrow does not accept `LZ4_RAW` as a valid alias for the `LZ4_RAW` codec:

```
ArrowException: Unsupported compression: lz4_raw
```

This is a friction issue, and confusing for some users who are aware of the differences.

### What changes are included in this PR?

- Adding `LZ4_RAW` to the acceptable codec names list.
- Modifying the `LZ4->LZ4_RAW` mapping to also accept `LZ4_RAW->LZ4_RAW`.
- Adding a test

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, an additive change to the accepted codec names.

* GitHub Issue: #41863

Authored-by: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>

* GH-48868: [Doc] Document security model for the Arrow formats (#48870)

### Rationale for this change

Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.

### What changes are included in this PR?

Add a Security Considerations page in the Format section.

**Doc preview:** https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html

### Are these changes tested?

N/A

### Are there any user-facing changes?

No.
* GitHub Issue: #48868

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>

* GH-49004: [C++][FlightRPC] Run ODBC tests in workflow using `cpp_test.sh` (#49005)

### Rationale for this change
#49004 

### What changes are included in this PR?
- Run tests using `cpp_test.sh` in the ODBC job of C++ Extra CI.
Note:  `find_package(Arrow)` check in `cpp_test.sh` is disabled due to blocker GH-49050

### Are these changes tested?
Yes, in CI
### Are there any user-facing changes?
N/A
* GitHub Issue: #49004

Lead-authored-by: Alina (Xi) Li <alina.li@improving.com>
Co-authored-by: Alina (Xi) Li <96995091+alinaliBQ@users.noreply.github.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49092: [C++][FlightRPC][CI] Nightly Packaging: Add `dev-yyyy-mm-dd` to ODBC MSI name (#49151)

### Rationale for this change
#49092

### What changes are included in this PR?

-  Add `dev-yyyy-mm-dd` to ODBC MSI name. This is a similar approach to R nightly.

Before: `Apache Arrow Flight SQL ODBC-1.0.0-win64.msi`. After: `Apache Arrow Flight SQL ODBC-1.0.0-dev-2026-02-04-win64.msi`.

### Are these changes tested?

Tested in CI. Successfully renamed file: https://github.com/apache/arrow/actions/runs/21686252848/job/62534629714?pr=49151#step:3:26

### Are there any user-facing changes?

Yes, the nightly ODBC file names will be changed as described above. 

* GitHub Issue: #49092

Authored-by: Alina (Xi) Li <alina.li@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49156: [Python] Require GIL for string comparison (#49161)

### Rationale for this change

With Cython 3.3.0.a0 this failed. After some discussion it seems that this should have always had to require the GIL.

### What changes are included in this PR?

Moving statement out of the `with nogil` context manager.

### Are these changes tested?

Existing CI builds pyarrow.

### Are there any user-facing changes?

No
* GitHub Issue: #49156

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com>

* GH-48575: [C++][FlightRPC] Standalone ODBC macOS CI (#48577)

### Rationale for this change
#48575

### What changes are included in this PR?
- Add new ODBC workflow for macOS Intel 15 and 14 arm64.
- Added ODBC build fixes to enable build on macOS CI.
### Are these changes tested?
Tested in CI and local macOS Intel and M1 environments.
### Are there any user-facing changes?
N/A

* GitHub Issue: #48575

Lead-authored-by: Alina (Xi) Li <alina.li@improving.com>
Co-authored-by: justing-bq <62349012+justing-bq@users.noreply.github.com>
Co-authored-by: Victor Tsang <victor.tsang@improving.com>
Co-authored-by: Alina (Xi) Li <alinal@bitquilltech.com>
Co-authored-by: vic-tsang <victor.tsang@improving.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49164: [C++] Avoid invalid if() args in cmake when arrow is a subproject (#49165)

### Rationale for this change

Ref #49164: In subproject builds, `DefineOptions.cmake` sets `ARROW_DEFINE_OPTIONS_DEFAULT` to OFF, so `ARROW_SIMD_LEVEL` is never defined. The `if()` at `cpp/src/arrow/io/CMakeLists.txt:48` uses `${ARROW_SIMD_LEVEL}` and expands to empty, leading to invalid `if()` arguments.

### What changes are included in this PR?

Use the variable name directly (no `${}`).

### Are these changes tested?

Yes.

### Are there any user-facing changes?

None.
* GitHub Issue: #49164

Authored-by: Rossi Sun <zanmato1984@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-48132: [Ruby] Add support for writing dictionary array (#49175)

### Rationale for this change

Delta dictionary message support is out of scope.

### What changes are included in this PR?

* Add `ArrowFormat::DictionaryArray#each_buffer`
* Add `ArrowFormat::DictionaryType#build_fb_type`
* Add support for dictionary message in `ArrowFormat::StreamingWriter`
* Add support for writing dictionary message blocks in footer in `ArrowFormat::FileWriter`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #48132

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* GH-49081: [C++][Parquet] Correct variant's extension name (#49082)

### Rationale for this change

Correct variant extension according to arrow's specification.

### What changes are included in this PR?

Modified variant's hardcoded extension name.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #49081

Authored-by: Zehua Zou <zehuazou2000@gmail.com>
Signed-off-by: Gang Wu <ustcwg@gmail.com>

* GH-49102: [CI] Add type checking infrastructure and CI workflow for type annotations (#48618)

### Rationale for this change

This is the first in series of PRs adding type annotations to pyarrow and resolving #32609.

### What changes are included in this PR?

This PR establishes infrastructure for type checking:

- Adds CI workflow for running mypy, pyright, and ty type checkers on linux, macos and windows
- Configures type checkers to validate stub files (excluding source files for now)
- Adds PEP 561 `py.typed` marker to enable type checking
- Updates wheel build scripts to include stub files in distributions
- Creates initial minimal stub directory structure
- Updates developer documentation with type checking workflow

### Are these changes tested?

No. This is mostly a CI change.

### Are there any user-facing changes?

This does not add any actual annotations (only `py.typed` marker) so user should not be affected.
* GitHub Issue: #32609
* GitHub Issue: #49102

Lead-authored-by: Rok Mihevc <rok@mihevc.org>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Rok Mihevc <rok@mihevc.org>

* GH-49190: [C++][CI] Fix `unknown job 'odbc' error` in C++ Extra Workflow (#49192)

### Rationale for this change
See #49190

### What changes are included in this PR?

Fix `unknown job 'odbc' error` caused by typo

### Are these changes tested?

Tested in CI

### Are there any user-facing changes?

N/A

* GitHub Issue: #49190

Authored-by: Alina (Xi) Li <alinal@bitquilltech.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>

* MINOR: [CI] Bump docker/login-action from 3.6.0 to 3.7.0 (#49191)

Bumps [docker/login-action](https://github.com/docker/login-action) from 3.6.0 to 3.7.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/docker/login-action/releases">docker/login-action's releases</a>.</em></p>
<blockquote>
<h2>v3.7.0</h2>
<ul>
<li>Add <code>scope</code> input to set scopes for the authentication token by <a href="https://github.com/crazy-max"><code>@​crazy-max</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/912">docker/login-action#912</a></li>
<li>Add support for AWS European Sovereign Cloud ECR by <a href="https://github.com/dphi"><code>@​dphi</code></a> in <a href="https://redirect.github.com/docker/login-action/pull/914">docker/login-action#914</a></li>
<li>Ensure passwords are redacted with <code>registry-auth</code> input by <a href="https://github.com/crazy-max"><code>@​crazy-max</code></a> in <a href="https://redirect…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: C++ Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants