Skip to content

Comments

PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data#10627

Closed
shanhuuang wants to merge 9 commits intoapache:masterfrom
shanhuuang:ARROW-13206
Closed

PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data#10627
shanhuuang wants to merge 9 commits intoapache:masterfrom
shanhuuang:ARROW-13206

Conversation

@shanhuuang
Copy link
Contributor

@shanhuuang shanhuuang commented Jun 30, 2021

Only works for int32/int64 when delta_bit_width_<=32.

  1. Update the implement of DeltaBitPackDecoder
  2. Revise the code of putting/getting ZigZag int and the unit test
  3. Add the implement of putting/getting ZigZag int64 and unit test

@github-actions
Copy link

…ACKED. Only works for int32/int64 when delta_bit_width_<=32
@shanhuuang shanhuuang force-pushed the ARROW-13206 branch 6 times, most recently from 0512004 to 058bc0e Compare July 5, 2021 07:52
@jorisvandenbossche jorisvandenbossche changed the title ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_P… ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED Jul 5, 2021
@shanhuuang shanhuuang force-pushed the ARROW-13206 branch 2 times, most recently from 2b3b1d1 to 69fbac2 Compare July 6, 2021 13:34
explicit DeltaBitPackDecoder(const ColumnDescriptor* descr,
MemoryPool* pool = ::arrow::default_memory_pool())
: DecoderImpl(descr, Encoding::DELTA_BINARY_PACKED), pool_(pool) {
if (DType::type_num != Type::INT32 && DType::type_num != Type::INT64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that DeltaBitPackDecoder object would only be created in function parquet::MakeDecoder, so no more check is needed here. It's ok to restore this code.

@emkornfield emkornfield changed the title ARROW-13206: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED Jul 6, 2021
@github-actions
Copy link

github-actions bot commented Jul 6, 2021

@emkornfield
Copy link
Contributor

Is it possible for delta_bit_width_<=32? if so it seems like we should probably cover that case while updating?

@emkornfield
Copy link
Contributor

CC @zeroshade note bugs fixed in ZigZag/VLQ int, not sure if these got propagated to Go.

@zeroshade
Copy link
Member

@emkornfield I added the tests that were added here to the Go side, including the comparison of the actual bytes generated and the Golang implementation passes all the tests as currently written without any modifications needed. it appears that the Golang version doesn't have this bug (I remember coming across a bug in the ZigZag/VLQ int stuff a while back where I had to manage the signs and such to get the right output so I likely fixed it without realizing).

@shanhuuang
Copy link
Contributor Author

sorry I have had less time then I would have liked recently for Arrow reviews, will try to get to this soon.

OK, Thanks a lot :)

@emkornfield
Copy link
Contributor

Could we create a separate PR for the test files directly to the test file repo?

@emkornfield
Copy link
Contributor

I might be missing it but is there a test demonstrating current behavior for bit_widths >= 32?

@emkornfield
Copy link
Contributor

Mostly looks OK, I'm going to try to do a second more careful pass this week. Also CC @pitrou in case he wants to take a look.

@emkornfield
Copy link
Contributor

Thank you for your patience.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR. Are you planning to support encoding in another PR?

print(" w{0} = arrow::BitUtil::FromLittleEndian(w{0});".format(k))
print(" in += 8;".format(k))
else:
print(" uint64_t w{0} = util::SafeLoadAs<uint32_t>(in);".format(k))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this implementation weird. Why does it not unpack 64 entries at a time? I'm afraid this may make writing the SIMD implementations more difficult.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, is this because DELTA_BINARY_PACKED outputs values in multiples of 32? Slightly annoying :-/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The number of values(including padding) in a miniblock is always a multiple of 32.

values_current_mini_block_ = values_per_mini_block_;
} else {
// mini_block_idx_ > mini_blocks_per_block_ only if last_value_ is stored in
// header
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. last_value_ is always stored in header, right?

@pitrou pitrou changed the title PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED PARQUET-490: [C++][Parquet] Basic support for reading DELTA_BINARY_PACKED data Aug 17, 2021
@pitrou
Copy link
Member

pitrou commented Aug 17, 2021

Only works for int32/int64 when delta_bit_width_<=32.

What do you mean? Are delta bit widths > 32 not supported? I don't see check in the code.

Copy link
Contributor Author

@shanhuuang shanhuuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only works for int32/int64 when delta_bit_width_<=32.

What do you mean? Are delta bit widths > 32 not supported? I don't see check in the code.

Sorry, this is the commit message when the code was submitted in the first time. delta bit widths > 32 have also been supported now in this PR.

print(" w{0} = arrow::BitUtil::FromLittleEndian(w{0});".format(k))
print(" in += 8;".format(k))
else:
print(" uint64_t w{0} = util::SafeLoadAs<uint32_t>(in);".format(k))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The number of values(including padding) in a miniblock is always a multiple of 32.

@shanhuuang
Copy link
Contributor Author

Thank you for this PR. Are you planning to support encoding in another PR?

Yes. I'm also working on PARQUET-491 and PARQUET-492 which depend on code in this PR. And I would like to create a new PR after this one is merged.

A boolean member named "block_initialized_" is added in class DeltaLengthByteArrayDecoder to indicate whether function "InitBlock" has been called once.
add_definitions(-DARROW_WITH_ZSTD)
endif()

if(ARROW_CSV)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems out of place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you means that I should add the code in function ADD_PARQUET_TEST? Something like:

  if(REL_TEST_NAME STREQUAL "arrow-test" AND ARROW_CSV)
    add_definitions(-DARROW_CSV)
  endif()

This one also works :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems ok to me currently.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to look quite good. Can you update the parquet-testing submodule now that apache/parquet-testing#19 has been merged?

Also, here are a couple more comments.

i += num_unpacked;
byte_offset += num_unpacked * num_bits / 8;
} else if (sizeof(T) == 8 && num_bits > 32) {
// Use unpack64 only if num_bits is larger then 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't noticed previously, but why? Is it just a performance issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is just a performance issue. If num_bits is no larger than 32, I guess that using unpack32 will achieve better performance with function such as unpack32_avx2/unpack32_avx512.
The result of using unpack32 or unpack64 are the same if num_bits <= 32

add_definitions(-DARROW_WITH_ZSTD)
endif()

if(ARROW_CSV)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems ok to me currently.

@pitrou pitrou closed this in f406b53 Aug 19, 2021
@pitrou
Copy link
Member

pitrou commented Aug 19, 2021

@shanhuuang Do you have a user id on the Apache JIRA tracker, so that I can assign the issue to you?

@shanhuuang
Copy link
Contributor Author

@shanhuuang Do you have a user id on the Apache JIRA tracker, so that I can assign the issue to you?

Yes, my username on JIRA is "shanhuang". Thanks! :)

@shanhuuang shanhuuang deleted the ARROW-13206 branch August 20, 2021 02:02
pitrou added a commit that referenced this pull request Jan 4, 2023
… only one value (#15124)

This patch trying to fix #15052 . The problem is mentioned here: #15052 (comment)

When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`.

Seems the problem is introduced in #10627 and amol-@d982bed

I will add some test tonight
* Closes: #15052

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this pull request Jan 5, 2023
…eading only one value (apache#15124)

This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment)

When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`.

Seems the problem is introduced in apache#10627 and amol-@d982bed

I will add some test tonight
* Closes: apache#15052

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants