Add NT_SIGINFO NOTE to ELF dumps#83059
Conversation
Linux Watson needs this to better triage ELF dumps. Add CreateDumpOptions helper struct to pass all the command options around. Add the "--code", "--errno", "--address" command line options used to fill the NT_SIGINFO NOTE. The runtime passes to createdump on a crash. Added "ExceptionType" field to "Parameters" section of the Linux crash report json.
|
Tagging subscribers to this area: @tommcdon Issue DetailsCustomer ImpactLinux Watson needs this to better triage ELF dumps. 1st party teams have asked for this. Add CreateDumpOptions helper struct to pass all the command options around. Add the "--code", "--errno", "--address" command line options used to fill the NT_SIGINFO NOTE. The runtime passes to createdump on a crash. Added "ExceptionType" field to "Parameters" section of the Linux crash report json. TestingAll the SOS diagnostics tests pass with these changes. RiskLow. Createdump/core generation only.
|
jeffschwMSFT
left a comment
There was a problem hiding this comment.
approved. we will take for consideration in 7.0.x. please get a code review
|
Approved by Tactics. |
Customer Impact
Linux Watson needs this to better triage ELF dumps. 1st party teams have asked for this.
Issue: #40958
This change update createdump which allows windbg/Watson to determine which thread actually crashed (via the .lastevent command). The NT_SIGINFO record has been missing from Linux core dumps causing the wrong thread (startup thread) to be blamed for the crash. This breaks Watson bucketing.
The underlying issue here that we do not put enough data in Linux coredumps: The “crashing thread” isn’t marked as the one of interest. Without this information, the debugger assumes that the 0th thread (usually the startup thread) is the guilty party.
When an automated debugging service comes along, like Watson/!analyze, they cannot properly triage the bug. Instead of properly blaming the correct thread (with the correct exception), it will try to blame the non-crashing “crashing thread” (usually just the main function doing nothing, sitting in a wait call). In effect, this renders all of our Azure Watson bucketing for all of our partner teams and customers useless.
Added "ExceptionType" field to "Parameters" section of the Linux crash report json.
Testing
All the SOS diagnostics tests pass with these changes.
Risk
Low. Createdump/core generation only.