GH-15052: [C++][Parquet] Fix DELTA_BINARY_PACKED decoder when reading only one value#15124
GH-15052: [C++][Parquet] Fix DELTA_BINARY_PACKED decoder when reading only one value#15124pitrou merged 12 commits intoapache:masterfrom
Conversation
|
|
|
|
|
|
| buffer[i++] = last_value_; | ||
| if (ARROW_PREDICT_FALSE(i == max_values)) break; | ||
| if (ARROW_PREDICT_FALSE(i == max_values)) { | ||
| if (i != static_cast<int>(total_value_count_)) { | ||
| InitBlock(); | ||
| } | ||
| break; | ||
| } |
There was a problem hiding this comment.
I find this difficult to read, could you add a comment about what this logic is doing?
There was a problem hiding this comment.
To be honest, the logic is a bit trickey here, I will try to explain it. I'm bad in English, so maybe you can edit it directly if you can explain it better.
31f7558 to
0abb863
Compare
|
@tachyonwill I think this patch might fix |
Change the comments for fixing Co-authored-by: Rok Mihevc <rok@mihevc.org>
|
Retrigger CI because I think the fail is not caused by me, but still it looks like a bug: https://github.com/apache/arrow/actions/runs/3808810467/jobs/6479699758 By the way, is there any method that I can retrigger the CI? |
642d779 to
a019574
Compare
|
@pitrou Mind take a look here? |
In a few days certainly :-) |
Once your first PR is merged CI will start automatically. |
|
oops, I got a Seems that maybe we should rerun it? |
No worries, this is unrelated to his PR. |
So how can I organize that code? I didn't make it clear. Is it like: case 1template <typename Type>
class TestEncodingBase : public ::testing::Test {
public:
...
private:
std::vector<int> small_read_batch_size_;
};and TYPED_TEST(TestDeltaBitPackEncoding, SmallBatchRoundTrip) {
// .. set draws_ 1, 3
ASSERT_NO_FATAL_FAILURE(this->Execute(1, 1));
ASSERT_NO_FATAL_FAILURE(this->Execute(1, 2));
ASSERT_NO_FATAL_FAILURE(this->Execute(2, 2));
ASSERT_NO_FATAL_FAILURE(this->ExecuteSteps(10, 10));
ASSERT_NO_FATAL_FAILURE(this->ExecuteSteps(10, 10));
ASSERT_NO_FATAL_FAILURE(this->ExecuteSteps(1, 10));
}Or case 2template <typename Type>
class TestDeltaBitPackEncoding : public TestEncodingBase<Type> {
public:
using c_type = typename Type::c_type;
static constexpr int TYPE = Type::type_num;
static constexpr size_t ROUND_TRIP_TIMES = 3;
void InitReadBatchSizedData(...);
};Can you help me make it clear? @pitrou |
|
You could have something like: const std::vector<int> kBatchSizes = {...};
void CheckRoundtrip() override {
auto encoder =
MakeTypedEncoder<Type>(Encoding::DELTA_BINARY_PACKED, false, descr_.get());
auto decoder = MakeTypedDecoder<Type>(Encoding::DELTA_BINARY_PACKED, descr_.get());
for (size_t i = 0; i < ROUND_TRIP_TIMES; ++i) {
for (const auto batch_size : kBatchSizes} {
... |
|
Seems that would causing our testing run a long time, should I make template <typename Type>
class TestDeltaBitPackEncoding : public TestEncodingBase<Type> {
public:
using c_type = typename Type::c_type;
static constexpr int TYPE = Type::type_num;
static constexpr size_t ROUND_TRIP_TIMES = 3;
void InitReadBatchSizedData(...);
};
void CheckRoundtrip() override {
...
auto read_batch_sizes = extra_read_batch_sizes;
read_batch_sizes.push_back(num_values_);
for (size_t i = 0; i < ROUND_TRIP_TIMES; ++i) {
for (int read_batch_size: read_batch_sizes) {
encoder->Put(draws_, num_values_);
encode_buffer_ = encoder->FlushValues();
decoder->SetData(num_values_, encode_buffer_->data(),
static_cast<int>(encode_buffer_->size()));
int values_decoded_sum = 0;
while (values_decoded_sum < num_values_) {
int values_decoded =
decoder->Decode(decode_buf_ + values_decoded_sum, read_batch_size);
values_decoded_sum += values_decoded;
}
ASSERT_EQ(num_values_, values_decoded_sum);
ASSERT_NO_FATAL_FAILURE(VerifyResults<c_type>(decode_buf_, draws_, num_values_));
}
}
}and I'm arguing this, because when I set |
|
@pitrou comment fixed and merged your patch which could make test faster. Mind take a look? |
|
Benchmark runs are scheduled for baseline = ec9a8a3 and contender = 53d73f8. 53d73f8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
…eading only one value (apache#15124) This patch trying to fix apache#15052 . The problem is mentioned here: apache#15052 (comment) When read 1 value, DeltaBitPackDecoder will not call `InitBlock`, causing it always read `last_value_`. Seems the problem is introduced in apache#10627 and amol-@d982bed I will add some test tonight * Closes: apache#15052 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This patch trying to fix #15052 . The problem is mentioned here: #15052 (comment)
When read 1 value, DeltaBitPackDecoder will not call
InitBlock, causing it always readlast_value_.Seems the problem is introduced in #10627 and amol-@d982bed
I will add some test tonight