Bug 28942

Summary: Problem with breakpoint condition calling a function in multi-threaded program
Product: gdb Reporter: Simon Marchi <simon.marchi>
Component: gdbAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: aburgess, mingwei.zhang, ppluzhnikov, simark, tankut.baris.aktemur, tromey
Priority: P2    
Version: HEAD   
Target Milestone: ---   
Host: Target:
Build: Last reconfirmed: 2022-03-04 00:00:00
Attachments: A WIP patch

Description Simon Marchi 2022-03-03 19:40:58 UTC
This program:

---8<---
#include <pthread.h>
#include <unistd.h>

static void
function_that_segfaults (void)
{
  int *p = 0;
  *p = 1;
}

static void
break_here (void)
{}

static void *
thread_func (void *p)
{
  for (;;)
    sleep (1);
  return NULL;
}

static void *
thread_func2 (void *p)
{
  sleep (1);
  break_here ();
  return NULL;
}

int
main (void)
{
  pthread_t threads[10];
  pthread_create (&threads[0], NULL, thread_func, NULL);
  pthread_create (&threads[1], NULL, thread_func, NULL);
  pthread_create (&threads[2], NULL, thread_func, NULL);
  pthread_create (&threads[3], NULL, thread_func, NULL);
  pthread_create (&threads[5], NULL, thread_func, NULL);
  pthread_create (&threads[6], NULL, thread_func, NULL);
  pthread_create (&threads[4], NULL, thread_func2, NULL);
  sleep (60);
  return function_that_segfaults != 0;
}

--->8---


$ gcc test.c  -g3 -O0 -pthread
$ ./gdb -q -nx --data-directory=data-directory a.out -ex "b break_here if function_that_segfaults()"
Reading symbols from a.out...
Breakpoint 1 at 0x11ae: file test.c, line 13.
(gdb) r
Starting program: /home/smarchi/build/binutils-gdb/gdb/a.out 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff7d99700 (LWP 3567019)]
[New Thread 0x7ffff7598700 (LWP 3567020)]
[New Thread 0x7ffff6d97700 (LWP 3567021)]
[New Thread 0x7ffff6596700 (LWP 3567022)]
[New Thread 0x7ffff5d95700 (LWP 3567023)]
[New Thread 0x7ffff5594700 (LWP 3567024)]
[New Thread 0x7ffff4d93700 (LWP 3567025)]
Error in testing breakpoint condition:
Couldn't get registers: No such process.
An error occurred while in a function called from GDB.
Evaluation of the expression containing the function
(function_that_segfaults) will be abandoned.
When the function is done executing, GDB will silently stop.
Selected thread is running.
(gdb) 

The "Couldn't get registers: No such process." is very strange.  We expect GDB to say that the thread received a signal (SIGSEGV) while running the hand-called function.

And then if you continue with:

(gdb) kill                                                                                                                                                                                                                                                                                
Kill the program being debugged? (y or n) y
[Inferior 1 (process 3567034) killed]
(gdb) r                                                                                                                                                                                                                                                                                   
Starting program: /home/smarchi/build/binutils-gdb/gdb/a.out                                                                                                                                                                                                                              
/home/smarchi/src/binutils-gdb/gdb/target.c:2607: internal-error: target_wait: Assertion `!proc_target->commit_resumed_state' failed.                                                                                                                                                     
A problem internal to GDB has been detected,                                                                                                                                                                                                                                              
further debugging may prove unreliable.

Looking at the proceed call here:

(top-gdb) bt
#0  proceed (addr=0x555555555189, siggnal=GDB_SIGNAL_0) at /home/smarchi/src/binutils-gdb/gdb/infrun.c:3046
#1  0x0000558e5d95a128 in run_inferior_call (sm=std::unique_ptr<call_thread_fsm> = {...}, call_thread=0x61700009e680, real_pc=0x555555555189) at /home/smarchi/src/binutils-gdb/gdb/infcall.c:610
#2  0x0000558e5d95ff6e in call_function_by_hand_dummy (function=0x611000489d00, default_return_type=0x0, args=..., dummy_dtor=0x0, dummy_dtor_data=0x0) at /home/smarchi/src/binutils-gdb/gdb/infcall.c:1279
#3  0x0000558e5d95b4be in call_function_by_hand (function=0x611000489d00, default_return_type=0x0, args=...) at /home/smarchi/src/binutils-gdb/gdb/infcall.c:741
#4  0x0000558e5d609a2e in evaluate_subexp_do_call (exp=0x6030001579f0, noside=EVAL_NORMAL, callee=0x611000489d00, argvec=..., function_name=0x0, default_return_type=0x0) at /home/smarchi/src/binutils-gdb/gdb/eval.c:674
#5  0x0000558e5d60a7c5 in expr::operation::evaluate_funcall (this=0x603000157ab0, expect_type=0x0, exp=0x6030001579f0, noside=EVAL_NORMAL, function_name=0x0, args=std::__debug::vector of length 0, capacity 0) at /home/smarchi/src/binutils-gdb/gdb/eval.c:702
#6  0x0000558e5c4090aa in expr::operation::evaluate_funcall (this=0x603000157ab0, expect_type=0x0, exp=0x6030001579f0, noside=EVAL_NORMAL, args=std::__debug::vector of length 0, capacity 0) at /home/smarchi/src/binutils-gdb/gdb/expression.h:136
#7  0x0000558e5d60ad63 in expr::var_value_operation::evaluate_funcall (this=0x603000157ab0, expect_type=0x0, exp=0x6030001579f0, noside=EVAL_NORMAL, args=std::__debug::vector of length 0, capacity 0) at /home/smarchi/src/binutils-gdb/gdb/eval.c:714
#8  0x0000558e5cb8d2be in expr::funcall_operation::evaluate (this=0x607000083f80, expect_type=0x0, exp=0x6030001579f0, noside=EVAL_NORMAL) at /home/smarchi/src/binutils-gdb/gdb/expop.h:2178
#9  0x0000558e5d604e00 in expression::evaluate (During symbol reading: Child DIE 0x8d876c and its abstract origin 0x8f9b2b have different parents
sthis=0x6030001579f0, expect_type=0x0, noside=EVAL_NORMAL) at /home/smarchi/src/binutils-gdb/gdb/eval.c:101
#10 0x0000558e5d604f71 in evaluate_expression (exp=0x6030001579f0, expect_type=0x0) at /home/smarchi/src/binutils-gdb/gdb/eval.c:115
#11 0x0000558e5c8c99b9 in breakpoint_cond_eval (exp=0x6030001579f0) at /home/smarchi/src/binutils-gdb/gdb/breakpoint.c:4739
#12 0x0000558e5c8d1f11 in bpstat_check_breakpoint_conditions (bs=0x6060001b29c0, thread=0x61700009e680) at /home/smarchi/src/binutils-gdb/gdb/breakpoint.c:5303
#13 0x0000558e5c8d4b45 in bpstat_stop_status (aspace=0x603000045a00, bp_addr=0x5555555551ae, thread=0x61700009e680, ws=..., stop_chain=0x6060001b29c0) at /home/smarchi/src/binutils-gdb/gdb/breakpoint.c:5475
#14 0x0000558e5da1f939 in handle_signal_stop (ecs=0x7fff97a4bd50) at /home/smarchi/src/binutils-gdb/gdb/infrun.c:6200
#15 0x0000558e5da19441 in handle_inferior_event (ecs=0x7fff97a4bd50) at /home/smarchi/src/binutils-gdb/gdb/infrun.c:5690
#16 0x0000558e5da05206 in fetch_inferior_event () at /home/smarchi/src/binutils-gdb/gdb/infrun.c:4091
#17 0x0000558e5d94fad4 in inferior_event_handler (event_type=INF_REG_EVENT) at /home/smarchi/src/binutils-gdb/gdb/inf-loop.c:41
#18 0x0000558e5dc29bdd in handle_target_event (error=0, client_data=0x0) at /home/smarchi/src/binutils-gdb/gdb/linux-nat.c:4096
#19 0x0000558e5f4e4dd1 in handle_file_event (file_ptr=0x607000016050, ready_mask=1) at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:574
#20 0x0000558e5f4e562c in gdb_wait_for_event (block=0) at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:700
#21 0x0000558e5f4e343c in gdb_do_one_event () at /home/smarchi/src/binutils-gdb/gdbsupport/event-loop.cc:212
#22 0x0000558e5dd29d99 in start_event_loop () at /home/smarchi/src/binutils-gdb/gdb/main.c:421
#23 0x0000558e5dd2a1df in captured_command_loop () at /home/smarchi/src/binutils-gdb/gdb/main.c:481
#24 0x0000558e5dd2fad9 in captured_main (data=0x7fff97a4c200) at /home/smarchi/src/binutils-gdb/gdb/main.c:1348
#25 0x0000558e5dd2fbc2 in gdb_main (args=0x7fff97a4c200) at /home/smarchi/src/binutils-gdb/gdb/main.c:1363
#26 0x0000558e5c3e1ddd in main (argc=7, argv=0x7fff97a4c378) at /home/smarchi/src/binutils-gdb/gdb/gdb.c:32


We find that GDB tries to resume some other threads than the event thread (for which we evaluate the breakpoint condition), because it thinks they are not resumed. Probably because when the linux-nat target added them, they were added in the non-resumed state and stayed this way.
Comment 1 Andrew Burgess 2022-03-04 11:15:27 UTC
Wow, it's a small world.  I literally  just started looking at this same issue this week.

The whole thread not marked resumed issue is fixed by this excellent patch:

  https://sourceware.org/pipermail/gdb-patches/2022-January/185109.html

Which you know as you already posted a link to this bug to that thread.

However, there are so many other problem related to this issue.

The first thing I noticed is that run_inferior_call calls clear_proceed_status, which in all-stop mode calls clear_proceed_status_thread for each thread.

Once the above patch is merged I plan to add an assert to clear_proceed_status_thread that the thread we are clearing is not resumed and not executing.

Currently the not-executing assert will fail, but (due to the above patch being missing) the not-resumed assert will only fail sometimes.

If we ignore the clear_proceed_status issue, then with the above patch the resumed flag will be correct, and GDB will not try to start the already resumed threads as part of the inferior call.

However, after the call, as we're in all-stop mode, GDB will stop all threads.

However, if the breakpoint condition doesn't segfault, but instead just returns false, then GDB will resume the single thread that stopped for the breakpoint - leaving all the other threads stopped.

I'm currently working on the idea that when we evaluate the breakpoint condition we temporarily place GDB into non-stop mode, this would mean that, when we evaluate the b/p condition we only restart the one thread, and afterwards, we only expect the one thread to stop, but I need to do lots more testing yet - maybe this is a really bad idea.

The only other option I can think of is to somehow have the infcall code figure out that we are in all-stop mode, but some threads are already running.  Then, after making the inferior call we only stop the set of threads that we started.  However, this has a massive problem; how to handle new threads?

I'll clean up my correct patch and post it to this bug later today in case anyone wants to try it.  I'll also add your crashing function test to my working branch to make sure that is handled too.
Comment 2 Andrew Burgess 2022-03-04 14:01:08 UTC
Created attachment 14005 [details]
A WIP patch

Here's the patch I'm currently working on.  This should apply to current master and resolves the issue in this bug, as well as the original issue I was working on.  I've run the complete testsuite on GNU/Linux x86-64 with no regressions.

I still need to do lots more testing, especially around things like handling targets that don't support non-stop mode, and what happens if some other thread stops while we are evaluating the breakpoint condition.

But any initial thoughts are welcome.
Comment 3 Simon Marchi 2022-03-04 14:44:08 UTC
(In reply to Andrew Burgess from comment #1)
> Wow, it's a small world.  I literally  just started looking at this same
> issue this week.
> 
> The whole thread not marked resumed issue is fixed by this excellent patch:
> 
>   https://sourceware.org/pipermail/gdb-patches/2022-January/185109.html
> 
> Which you know as you already posted a link to this bug to that thread.
> 
> However, there are so many other problem related to this issue.
> 
> The first thing I noticed is that run_inferior_call calls
> clear_proceed_status, which in all-stop mode calls
> clear_proceed_status_thread for each thread.
> 
> Once the above patch is merged I plan to add an assert to
> clear_proceed_status_thread that the thread we are clearing is not resumed
> and not executing.
> 
> Currently the not-executing assert will fail, but (due to the above patch
> being missing) the not-resumed assert will only fail sometimes.
> 
> If we ignore the clear_proceed_status issue, then with the above patch the
> resumed flag will be correct, and GDB will not try to start the already
> resumed threads as part of the inferior call.
> 
> However, after the call, as we're in all-stop mode, GDB will stop all
> threads.
> 
> However, if the breakpoint condition doesn't segfault, but instead just
> returns false, then GDB will resume the single thread that stopped for the
> breakpoint - leaving all the other threads stopped.

Yeah, the fact that the breakpoint condition function caused a segfault is just another difficulty on top.  You can ignore that part.

> I'm currently working on the idea that when we evaluate the breakpoint
> condition we temporarily place GDB into non-stop mode, this would mean that,
> when we evaluate the b/p condition we only restart the one thread, and
> afterwards, we only expect the one thread to stop, but I need to do lots
> more testing yet - maybe this is a really bad idea.
> 
> The only other option I can think of is to somehow have the infcall code
> figure out that we are in all-stop mode, but some threads are already
> running.  Then, after making the inferior call we only stop the set of
> threads that we started.  However, this has a massive problem; how to handle
> new threads?

When thinking about this, my intuition was more like the later.

In all-stop over a non-stop target:

1. A thread hits a breakpoint, only that thread is stopped while we process the breakpoint hit
2. When doing the infcall in the breakpoint condition, only that thread is resumed (the other threads already are)
3. When the infcall is done, only that thread is stopped
4a. If the condition is true, then GDB stops all threads
4b. if the condition is false, that thread is resumed

In all-stop over an all-stop target:

1. A thread hits a breakpoint, all threads are stopped while we process the breakpoint hit
2. When doing the infcall in the breakpoint condition, all threads are resumed (is this what would happen if the user were to do a manual infcall?)
3. When the infcall is done, all threads are stopped
4a. If the condition is true, all threads remain stopped
4b. If the condition is false, all threads are resumed

In non-stop over a non-stop target, then it looks like "all-stop-on-top-of-non-stop", except that not all threads are stopped in step 4a.

I didn't really think through what would happen to new threads, I suppose they would just keep running.

> 
> I'll clean up my correct patch and post it to this bug later today in case
> anyone wants to try it.  I'll also add your crashing function test to my
> working branch to make sure that is handled too.

Thanks, that's some really quick customer service.
Comment 4 Baris Aktemur 2022-03-07 07:34:57 UTC
A highly-related patch series was this:

  https://sourceware.org/pipermail/gdb-patches/2021-March/176654.html

Perhaps there are a few useful things that still apply to the current master.

> In all-stop over an all-stop target:
>
> 1. A thread hits a breakpoint, all threads are stopped while we process
> the breakpoint hit
> 2. When doing the infcall in the breakpoint condition, all threads are
> resumed (is this what would happen if the user were to do a manual infcall?)

I think GDB should act like the "scheduler-locking on" mode in this case,
because if another thread has a pending event, the condition evaluation
could be dismissed.  This is what distinguishes an infcall in condition
evaluation from a manual infcall.  The series linked above introduced an
`in_cond_eval` flag to make this distinction.
Comment 6 Tom Tromey 2022-10-21 17:57:30 UTC
*** Bug 23191 has been marked as a duplicate of this bug. ***
Comment 7 Tom Tromey 2022-10-21 17:58:28 UTC
*** Bug 28911 has been marked as a duplicate of this bug. ***