PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data#10627
PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data#10627shanhuuang wants to merge 9 commits intoapache:masterfrom
Conversation
c0b68b2 to
e01f698
Compare
…ACKED. Only works for int32/int64 when delta_bit_width_<=32
0512004 to
058bc0e
Compare
2b3b1d1 to
69fbac2
Compare
| explicit DeltaBitPackDecoder(const ColumnDescriptor* descr, | ||
| MemoryPool* pool = ::arrow::default_memory_pool()) | ||
| : DecoderImpl(descr, Encoding::DELTA_BINARY_PACKED), pool_(pool) { | ||
| if (DType::type_num != Type::INT32 && DType::type_num != Type::INT64) { |
There was a problem hiding this comment.
I thought that DeltaBitPackDecoder object would only be created in function parquet::MakeDecoder, so no more check is needed here. It's ok to restore this code.
|
Is it possible for delta_bit_width_<=32? if so it seems like we should probably cover that case while updating? |
|
CC @zeroshade note bugs fixed in ZigZag/VLQ int, not sure if these got propagated to Go. |
|
@emkornfield I added the tests that were added here to the Go side, including the comparison of the actual bytes generated and the Golang implementation passes all the tests as currently written without any modifications needed. it appears that the Golang version doesn't have this bug (I remember coming across a bug in the ZigZag/VLQ int stuff a while back where I had to manage the signs and such to get the right output so I likely fixed it without realizing). |
OK, Thanks a lot :) |
|
Could we create a separate PR for the test files directly to the test file repo? |
|
I might be missing it but is there a test demonstrating current behavior for bit_widths >= 32? |
|
Mostly looks OK, I'm going to try to do a second more careful pass this week. Also CC @pitrou in case he wants to take a look. |
|
Thank you for your patience. |
pitrou
left a comment
There was a problem hiding this comment.
Thank you for this PR. Are you planning to support encoding in another PR?
| print(" w{0} = arrow::BitUtil::FromLittleEndian(w{0});".format(k)) | ||
| print(" in += 8;".format(k)) | ||
| else: | ||
| print(" uint64_t w{0} = util::SafeLoadAs<uint32_t>(in);".format(k)) |
There was a problem hiding this comment.
I find this implementation weird. Why does it not unpack 64 entries at a time? I'm afraid this may make writing the SIMD implementations more difficult.
There was a problem hiding this comment.
Ok, is this because DELTA_BINARY_PACKED outputs values in multiples of 32? Slightly annoying :-/
There was a problem hiding this comment.
Yes. The number of values(including padding) in a miniblock is always a multiple of 32.
cpp/src/parquet/encoding.cc
Outdated
| values_current_mini_block_ = values_per_mini_block_; | ||
| } else { | ||
| // mini_block_idx_ > mini_blocks_per_block_ only if last_value_ is stored in | ||
| // header |
There was a problem hiding this comment.
I don't understand this comment. last_value_ is always stored in header, right?
What do you mean? Are delta bit widths > 32 not supported? I don't see check in the code. |
shanhuuang
left a comment
There was a problem hiding this comment.
Only works for int32/int64 when delta_bit_width_<=32.
What do you mean? Are delta bit widths > 32 not supported? I don't see check in the code.
Sorry, this is the commit message when the code was submitted in the first time. delta bit widths > 32 have also been supported now in this PR.
| print(" w{0} = arrow::BitUtil::FromLittleEndian(w{0});".format(k)) | ||
| print(" in += 8;".format(k)) | ||
| else: | ||
| print(" uint64_t w{0} = util::SafeLoadAs<uint32_t>(in);".format(k)) |
There was a problem hiding this comment.
Yes. The number of values(including padding) in a miniblock is always a multiple of 32.
Yes. I'm also working on PARQUET-491 and PARQUET-492 which depend on code in this PR. And I would like to create a new PR after this one is merged. |
A boolean member named "block_initialized_" is added in class DeltaLengthByteArrayDecoder to indicate whether function "InitBlock" has been called once.
| add_definitions(-DARROW_WITH_ZSTD) | ||
| endif() | ||
|
|
||
| if(ARROW_CSV) |
There was a problem hiding this comment.
this seems out of place?
There was a problem hiding this comment.
Do you means that I should add the code in function ADD_PARQUET_TEST? Something like:
if(REL_TEST_NAME STREQUAL "arrow-test" AND ARROW_CSV)
add_definitions(-DARROW_CSV)
endif()
This one also works :)
pitrou
left a comment
There was a problem hiding this comment.
This is starting to look quite good. Can you update the parquet-testing submodule now that apache/parquet-testing#19 has been merged?
Also, here are a couple more comments.
| i += num_unpacked; | ||
| byte_offset += num_unpacked * num_bits / 8; | ||
| } else if (sizeof(T) == 8 && num_bits > 32) { | ||
| // Use unpack64 only if num_bits is larger then 32 |
There was a problem hiding this comment.
I hadn't noticed previously, but why? Is it just a performance issue?
There was a problem hiding this comment.
Yes, I think it is just a performance issue. If num_bits is no larger than 32, I guess that using unpack32 will achieve better performance with function such as unpack32_avx2/unpack32_avx512.
The result of using unpack32 or unpack64 are the same if num_bits <= 32
| add_definitions(-DARROW_WITH_ZSTD) | ||
| endif() | ||
|
|
||
| if(ARROW_CSV) |
|
@shanhuuang Do you have a user id on the Apache JIRA tracker, so that I can assign the issue to you? |
Yes, my username on JIRA is "shanhuang". Thanks! :) |
… only one value (#15124) This patch trying to fix #15052 . The problem is mentioned here: #15052 (comment) When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`. Seems the problem is introduced in #10627 and amol-@d982bed I will add some test tonight * Closes: #15052 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…eading only one value (apache#15124) This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment) When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`. Seems the problem is introduced in apache#10627 and amol-@d982bed I will add some test tonight * Closes: apache#15052 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Only works for int32/int64 when delta_bit_width_<=32.