-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48741: [C++] Fix deadlock in CSV AsyncThreadedTableReader destructor #48742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
932821c to
cdf530b
Compare
|
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
|
Why is the deadlock only happening when early error? Also, is it possible to construct a dedicated test case? |
Because otherwise the task group is already finished (
Hmm, that should hopefully be possible. |
I see, thanks.
Yeah, let's have one please. |
zanmato1984
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix LGTM, only that a test might be needed.
a8c352d to
12a8dea
Compare
|
@github-actions crossbow submit -g cpp |
|
@zanmato1984 I added a test and also had to strengthen object ownership to avoid another regression, could you take another look? |
|
Revision: 12a8dea Submitted crossbow builds: ursacomputing/crossbow @ actions-7019b48426 |
| return Status::OK(); | ||
| }; | ||
|
|
||
| return VisitAsyncGenerator(std::move(block_generator), block_visitor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I have to double check the deadlock reasoning.
Seems that self is well captured by the returned future. So the deadlock happens only when: 1) early error happens and FinishAsync() is skipped; 2) the returned future is marked finished (by error) and released in its owning thread (the "main" thread), self ref count decreased but not to 0 (because of); 3) some task in the task group is still running, and the task holds a strong ref to self captured in block_visitor (so far the task group, tasks, and the self it owned are fully detached from the main thread?); 4) as the remaining tasks finish, self ref count goes to 0 and invokes the AsyncThreadedTableReader dtor, which in turn drains the task group in which the dtor is being invoked - deadlock?
If my understanding is correct, it doesn't seem that we have ownership issues:
- If no error, then the main thread owns the future, which owns
self, until all tasks are finished. - If early error, then the tasks own
self. selfowns all the things: options, task group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as the remaining tasks finish,
selfref count goes to0and invokes theAsyncThreadedTableReaderdtor, which in turn drains the task group in which the dtor is being invoked - deadlock?
Exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my understanding is correct, it doesn't seem that we have ownership issues:
- If no error, then the main thread owns the future, which owns
self, until all tasks are finished.- If early error, then the tasks own
self.selfowns all the things: options, task group.
Please also see my this question :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what is the this question? I'm probably dense, but I can't find anywhere in the comments page :-S
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, don't know what's going on. Let me ask this: Given that my assumption of the cause of deadlock is correct, it doesn't seem to me that we have ownership issues:
- If no error, then the main thread owns the future, which owns
self, until all tasks are finished. - If early error, then the tasks own
self. - In either case,
selfowns all the things: options, task group.
So why do we have to make all the ownership related changes in the rest of the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I may have an answer to my last question.
- If early error, then the tasks own
self.
self will be destructed in the last task, by when the task_group must be still alive. This explains the changes to the task_group_ ownership.
But why the changes to the options_ ownership? It doesn't have to out-live self, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why the changes to the
options_ownership? It doesn't have to out-liveself, does it?
It does, because the ColumnBuilder can schedule its own tasks without owning the table reader:
arrow/cpp/src/arrow/csv/column_builder.cc
Lines 259 to 261 in fe5e0e5
| void InferringColumnBuilder::ScheduleConvertChunk(int64_t chunk_index) { | |
| task_group_->Append([this, chunk_index]() { return TryConvertChunk(chunk_index); }); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. Thank you.
12a8dea to
530942d
Compare
zanmato1984
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with some nits.
| protected: | ||
| // Make column builders from conversion schema | ||
| Status MakeColumnBuilders() { | ||
| // This is making a single copy of ConvertOptions, which is reasonable even if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is n shared_ptr VS. n copies of ConvertOptions? (Which makes sense.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
| return Status::OK(); | ||
| }; | ||
|
|
||
| return VisitAsyncGenerator(std::move(block_generator), block_visitor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. Thank you.
530942d to
87316ce
Compare
|
I'm going to merge as CI is green. Thank you for the reviews @zanmato1984 ! |
Rationale for this change
When reading a CSV file encounters an early error, and the
AsyncThreadedTableReaderis only kept alive through the chain of futures and their callbacks, the~AsyncThreadedTableReaderdestructor will be executed at the end of aTaskGrouptask and try to wait for theTaskGroupitself for finish. This will obviously deadlock.This issue was discovered by OSS-Fuzz in https://issues.oss-fuzz.com/issues/467451924
What changes are included in this PR?
Instead of waiting for the
TaskGroupto finish in theAsyncThreadedTableReader, make sure that all async callbacks involved in CSV reading own their captured variables, to avoid use-after-free problems.This has the side effect of keeping the
AsyncThreadedTableReaderalive until all relevant async callbacks have executed.Are these changes tested?
Yes, by existing tests and a new fuzz regression test.
Are there any user-facing changes?
No.