Summary: | Problem with breakpoint condition calling a function in multi-threaded program | ||
---|---|---|---|
Product: | gdb | Reporter: | Simon Marchi <simon.marchi> |
Component: | gdb | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | aburgess, mingwei.zhang, ppluzhnikov, simark, ssbssa, tankut.baris.aktemur, tromey |
Priority: | P2 | ||
Version: | HEAD | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Last reconfirmed: | 2022-03-04 00:00:00 | |
Attachments: | A WIP patch |
Description
Simon Marchi
2022-03-03 19:40:58 UTC
Wow, it's a small world. I literally just started looking at this same issue this week. The whole thread not marked resumed issue is fixed by this excellent patch: https://sourceware.org/pipermail/gdb-patches/2022-January/185109.html Which you know as you already posted a link to this bug to that thread. However, there are so many other problem related to this issue. The first thing I noticed is that run_inferior_call calls clear_proceed_status, which in all-stop mode calls clear_proceed_status_thread for each thread. Once the above patch is merged I plan to add an assert to clear_proceed_status_thread that the thread we are clearing is not resumed and not executing. Currently the not-executing assert will fail, but (due to the above patch being missing) the not-resumed assert will only fail sometimes. If we ignore the clear_proceed_status issue, then with the above patch the resumed flag will be correct, and GDB will not try to start the already resumed threads as part of the inferior call. However, after the call, as we're in all-stop mode, GDB will stop all threads. However, if the breakpoint condition doesn't segfault, but instead just returns false, then GDB will resume the single thread that stopped for the breakpoint - leaving all the other threads stopped. I'm currently working on the idea that when we evaluate the breakpoint condition we temporarily place GDB into non-stop mode, this would mean that, when we evaluate the b/p condition we only restart the one thread, and afterwards, we only expect the one thread to stop, but I need to do lots more testing yet - maybe this is a really bad idea. The only other option I can think of is to somehow have the infcall code figure out that we are in all-stop mode, but some threads are already running. Then, after making the inferior call we only stop the set of threads that we started. However, this has a massive problem; how to handle new threads? I'll clean up my correct patch and post it to this bug later today in case anyone wants to try it. I'll also add your crashing function test to my working branch to make sure that is handled too. Created attachment 14005 [details]
A WIP patch
Here's the patch I'm currently working on. This should apply to current master and resolves the issue in this bug, as well as the original issue I was working on. I've run the complete testsuite on GNU/Linux x86-64 with no regressions.
I still need to do lots more testing, especially around things like handling targets that don't support non-stop mode, and what happens if some other thread stops while we are evaluating the breakpoint condition.
But any initial thoughts are welcome.
(In reply to Andrew Burgess from comment #1) > Wow, it's a small world. I literally just started looking at this same > issue this week. > > The whole thread not marked resumed issue is fixed by this excellent patch: > > https://sourceware.org/pipermail/gdb-patches/2022-January/185109.html > > Which you know as you already posted a link to this bug to that thread. > > However, there are so many other problem related to this issue. > > The first thing I noticed is that run_inferior_call calls > clear_proceed_status, which in all-stop mode calls > clear_proceed_status_thread for each thread. > > Once the above patch is merged I plan to add an assert to > clear_proceed_status_thread that the thread we are clearing is not resumed > and not executing. > > Currently the not-executing assert will fail, but (due to the above patch > being missing) the not-resumed assert will only fail sometimes. > > If we ignore the clear_proceed_status issue, then with the above patch the > resumed flag will be correct, and GDB will not try to start the already > resumed threads as part of the inferior call. > > However, after the call, as we're in all-stop mode, GDB will stop all > threads. > > However, if the breakpoint condition doesn't segfault, but instead just > returns false, then GDB will resume the single thread that stopped for the > breakpoint - leaving all the other threads stopped. Yeah, the fact that the breakpoint condition function caused a segfault is just another difficulty on top. You can ignore that part. > I'm currently working on the idea that when we evaluate the breakpoint > condition we temporarily place GDB into non-stop mode, this would mean that, > when we evaluate the b/p condition we only restart the one thread, and > afterwards, we only expect the one thread to stop, but I need to do lots > more testing yet - maybe this is a really bad idea. > > The only other option I can think of is to somehow have the infcall code > figure out that we are in all-stop mode, but some threads are already > running. Then, after making the inferior call we only stop the set of > threads that we started. However, this has a massive problem; how to handle > new threads? When thinking about this, my intuition was more like the later. In all-stop over a non-stop target: 1. A thread hits a breakpoint, only that thread is stopped while we process the breakpoint hit 2. When doing the infcall in the breakpoint condition, only that thread is resumed (the other threads already are) 3. When the infcall is done, only that thread is stopped 4a. If the condition is true, then GDB stops all threads 4b. if the condition is false, that thread is resumed In all-stop over an all-stop target: 1. A thread hits a breakpoint, all threads are stopped while we process the breakpoint hit 2. When doing the infcall in the breakpoint condition, all threads are resumed (is this what would happen if the user were to do a manual infcall?) 3. When the infcall is done, all threads are stopped 4a. If the condition is true, all threads remain stopped 4b. If the condition is false, all threads are resumed In non-stop over a non-stop target, then it looks like "all-stop-on-top-of-non-stop", except that not all threads are stopped in step 4a. I didn't really think through what would happen to new threads, I suppose they would just keep running. > > I'll clean up my correct patch and post it to this bug later today in case > anyone wants to try it. I'll also add your crashing function test to my > working branch to make sure that is handled too. Thanks, that's some really quick customer service. A highly-related patch series was this: https://sourceware.org/pipermail/gdb-patches/2021-March/176654.html Perhaps there are a few useful things that still apply to the current master. > In all-stop over an all-stop target: > > 1. A thread hits a breakpoint, all threads are stopped while we process > the breakpoint hit > 2. When doing the infcall in the breakpoint condition, all threads are > resumed (is this what would happen if the user were to do a manual infcall?) I think GDB should act like the "scheduler-locking on" mode in this case, because if another thread has a pending event, the condition evaluation could be dismissed. This is what distinguishes an infcall in condition evaluation from a manual infcall. The series linked above introduced an `in_cond_eval` flag to make this distinction. *** Bug 23191 has been marked as a duplicate of this bug. *** *** Bug 28911 has been marked as a duplicate of this bug. *** The master branch has been updated by Andrew Burgess <aburgess@sourceware.org>: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=3df7843699ff3610f89ac880685396b531d8ec1b commit 3df7843699ff3610f89ac880685396b531d8ec1b Author: Andrew Burgess <aburgess@redhat.com> Date: Fri Oct 9 13:27:13 2020 +0200 gdb: fix b/p conditions with infcalls in multi-threaded inferiors This commit fixes bug PR 28942, that is, creating a conditional breakpoint in a multi-threaded inferior, where the breakpoint condition includes an inferior function call. Currently, when a user tries to create such a breakpoint, then GDB will fail with: (gdb) break infcall-from-bp-cond-single.c:61 if (return_true ()) Breakpoint 2 at 0x4011fa: file /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/infcall-from-bp-cond-single.c, line 61. (gdb) continue Continuing. [New Thread 0x7ffff7c5d700 (LWP 2460150)] [New Thread 0x7ffff745c700 (LWP 2460151)] [New Thread 0x7ffff6c5b700 (LWP 2460152)] [New Thread 0x7ffff645a700 (LWP 2460153)] [New Thread 0x7ffff5c59700 (LWP 2460154)] Error in testing breakpoint condition: Couldn't get registers: No such process. An error occurred while in a function called from GDB. Evaluation of the expression containing the function (return_true) will be abandoned. When the function is done executing, GDB will silently stop. Selected thread is running. (gdb) Or, in some cases, like this: (gdb) break infcall-from-bp-cond-simple.c:56 if (is_matching_tid (arg, 1)) Breakpoint 2 at 0x401194: file /tmp/build/gdb/testsuite/../../../src/gdb/testsuite/gdb.threads/infcall-from-bp-cond-simple.c, line 56. (gdb) continue Continuing. [New Thread 0x7ffff7c5d700 (LWP 2461106)] [New Thread 0x7ffff745c700 (LWP 2461107)] ../../src.release/gdb/nat/x86-linux-dregs.c:146: internal-error: x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed. A problem internal to GDB has been detected, further debugging may prove unreliable. The precise error depends on the exact thread state; so there's race conditions depending on which threads have fully started, and which have not. But the underlying problem is always the same; when GDB tries to execute the inferior function call from within the breakpoint condition, GDB will, incorrectly, try to resume threads that are already running - GDB doesn't realise that some threads might already be running. The solution proposed in this patch requires an additional member variable thread_info::in_cond_eval. This flag is set to true (in breakpoint.c) when GDB is evaluating a breakpoint condition. In user_visible_resume_ptid (infrun.c), when the in_cond_eval flag is true, then GDB will only try to resume the current thread, that is, the thread for which the breakpoint condition is being evaluated. This solves the problem of GDB trying to resume threads that are already running. The next problem is that inferior function calls are assumed to be synchronous, that is, GDB doesn't expect to start an inferior function call in thread #1, then receive a stop from thread #2 for some other, unrelated reason. To prevent GDB responding to an event from another thread, we update fetch_inferior_event and do_target_wait in infrun.c, so that, when an inferior function call (on behalf of a breakpoint condition) is in progress, we only wait for events from the current thread (the one evaluating the condition). In do_target_wait I had to change the inferior_matches lambda function, which is used to select which inferior to wait on. Previously the logic was this: auto inferior_matches = [&wait_ptid] (inferior *inf) { return (inf->process_target () != nullptr && ptid_t (inf->pid).matches (wait_ptid)); }; This compares the pid of the inferior against the complete ptid we want to wait on. Before this commit wait_ptid was only ever minus_one_ptid (which is special, and means any process), and so every inferior would match. After this commit though wait_ptid might represent a specific thread in a specific inferior. If we compare the pid of the inferior to a specific ptid then these will not match. The fix is to compare against the pid extracted from the wait_ptid, not against the complete wait_ptid itself. In fetch_inferior_event, after receiving the event, we only want to stop all the other threads, and call inferior_event_handler with INF_EXEC_COMPLETE, if we are not evaluating a conditional breakpoint. If we are, then all the other threads should be left doing whatever they were before. The inferior_event_handler call will be performed once the breakpoint condition has finished being evaluated, and GDB decides to stop or not. The final problem that needs solving relates to GDB's commit-resume mechanism, which allows GDB to collect resume requests into a single packet in order to reduce traffic to a remote target. The problem is that the commit-resume mechanism will not send any resume requests for an inferior if there are already events pending on the GDB side. Imagine an inferior with two threads. Both threads hit a breakpoint, maybe the same conditional breakpoint. At this point there are two pending events, one for each thread. GDB selects one of the events and spots that this is a conditional breakpoint, GDB evaluates the condition. The condition includes an inferior function call, so GDB sets up for the call and resumes the one thread, the resume request is added to the commit-resume queue. When the commit-resume queue is committed GDB sees that there is a pending event from another thread, and so doesn't send any resume requests to the actual target, GDB is assuming that when we wait we will select the event from the other thread. However, as this is an inferior function call for a condition evaluation, we will not select the event from the other thread, we only care about events from the thread that is evaluating the condition - and the resume for this thread was never sent to the target. And so, GDB hangs, waiting for an event from a thread that was never fully resumed. To fix this issue I have added the concept of "forcing" the commit-resume queue. When enabling commit resume, if the force flag is true, then any resumes will be committed to the target, even if there are other threads with pending events. A note on authorship: this patch was based on some work done by Natalia Saiapova and Tankut Baris Aktemur from Intel[1]. I have made some changes to their work in this version. Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=28942 [1] https://sourceware.org/pipermail/gdb-patches/2020-October/172454.html Co-authored-by: Natalia Saiapova <natalia.saiapova@intel.com> Co-authored-by: Tankut Baris Aktemur <tankut.baris.aktemur@intel.com> Reviewed-By: Tankut Baris Aktemur <tankut.baris.aktemur@intel.com> Tested-By: Luis Machado <luis.machado@arm.com> Tested-By: Keith Seitz <keiths@redhat.com> (In reply to Sourceware Commits from comment #8) > The master branch has been updated by Andrew Burgess > <aburgess@sourceware.org>: > > https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git; > h=3df7843699ff3610f89ac880685396b531d8ec1b > > commit 3df7843699ff3610f89ac880685396b531d8ec1b > Author: Andrew Burgess <aburgess@redhat.com> > Date: Fri Oct 9 13:27:13 2020 +0200 > > gdb: fix b/p conditions with infcalls in multi-threaded inferiors > > This commit fixes bug PR 28942, that is, creating a conditional > breakpoint in a multi-threaded inferior, where the breakpoint > condition includes an inferior function call. Is there still something left to do here? |