PARQUET-492: [C++][Parquet] Basic support for reading DELTA_BYTE_ARRAY data.#10978
PARQUET-492: [C++][Parquet] Basic support for reading DELTA_BYTE_ARRAY data.#10978shanhuuang wants to merge 12 commits intoapache:masterfrom
Conversation
…YTE_ARRAY and DELTA_BYTE_ARRAY. TODO: read corrupted files written with bug(PARQUET-246)
270e08b to
9629cf7
Compare
9629cf7 to
4f4300e
Compare
e8a751e to
0bc78bb
Compare
pitrou
left a comment
There was a problem hiding this comment.
Thank you very much. I haven't read everything in detail but here are some things that stand out.
cpp/src/parquet/encoding.cc
Outdated
| prefix_len_offset_ = 0; | ||
| num_valid_values_ = num_prefix; | ||
|
|
||
| suffix_decoder_.SetData(num_values, decoder_); |
There was a problem hiding this comment.
Here as well, it's weird to have the same BitReader shared by the two decoders.
The spec says:
This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
This means you should ideally compute the position of the encoded suffixes and then call suffix_decoder_.SetData with the right (data, len) values.
There was a problem hiding this comment.
I think that we cannot get the end position of DELTA_BINARY_PACKED data unless we've read and inited all the blocks that are distributed in this data space. Because in DELTA_BINARY_PACKED encoding, the bitwidth of each miniblock is stored in blocks rather than in the header.
Do you mean that a method like DeltaBitPackDecoder::InitAllBlocks should be added? In this method we may init all the blocks in DELTA_BINARY_PACKED data, compute the total bytes as a return value and finally reset decoder_ for follow-up DeltaBitPackDecoder::Decode
There was a problem hiding this comment.
I don't know how it should be done exactly, but as I said Decode(T* buffer, int max_values) can be called incrementally (and you're taking care to support that already in the DeltaBitPackDecoder). You can compute the total bytes upfront in Init, or you can do it lazily...
Ideally, we would have unit tests for incremental decoding...
There was a problem hiding this comment.
Or as an alternative, you can raise an NYI when max_values is not equal to total_value_count_.
There was a problem hiding this comment.
Here as well, it's weird to have the same
BitReadershared by the two decoders.The spec says:
This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
This means you should ideally compute the position of the encoded suffixes and then call
suffix_decoder_.SetDatawith the right(data, len)values.
I agree that this way will make the code in DeltaByteArrayDecoder more straightforward. However, current implement also works when max_values is not equal to num_valid_values_. I add a unit test for incremental encoding and some notes :)
Maybe I should raise an NYI to compute the start position of the encoded suffixes upfront?
There was a problem hiding this comment.
I'm not sure I understand your question. Did you check your incremental test does pass a smaller max_values?
There was a problem hiding this comment.
Yes. I printed max_values and num_valid_values_ in function DeltaByteArrayDecoder::GetInternal at a local test. max_values is close to 100. num_valid_values_ is close to 1000 at first.
pitrou
left a comment
There was a problem hiding this comment.
Thanks for the update again.
- Can you look at the following comments?
- Can you rebase on a more recent git master?
| suffix_decoder_.SetDecoder(num_values, decoder_); | ||
|
|
||
| // TODO: read corrupted files written with bug(PARQUET-246). last_value_ should be set | ||
| // to last_value_in_previous_page_ when decoding a new page(except the first page) |
There was a problem hiding this comment.
Do you intend to work on this? If so, can you open a new JIRA for it?
There was a problem hiding this comment.
Yes, I'm willing to solve this problem. But I'm not sure how to solve it. I opened a new JIRA here
|
@shanhuuang Sorry, I had missed that you had requested another review! I'll try to do that soon. |
pitrou
left a comment
There was a problem hiding this comment.
LGTM. Sorry for the delay and thank you very much for working on this!
|
I'll wait for CI now. |
|
Benchmark runs are scheduled for baseline = b97965a and contender = 00c94e0. 00c94e0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
Can the list of supported codecs at https://arrow.apache.org/docs/cpp/parquet.html#encodings be updated? |
|
It is on git master, but unfortunately only docs for the released versions are published online. |
|
OK, thanks @pitrou . Will all three DELTA_ codecs be in the next release? |
|
I have no idea if @shanhuuang intends to implement |
|
Got it, thanks again. |
TODO: