This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: collecting data from a coring process

From: Dmitry Samersoff <dms at samersoff dot net>
To: Paul Marquess <Paul dot Marquess at owmobility dot com>, "gdb at sourceware dot org" <gdb at sourceware dot org>
Date: Thu, 8 Sep 2016 16:14:34 +0300
Subject: Re: collecting data from a coring process
Authentication-results: sourceware.org; auth=none
References: <CY1PR0501MB11783F479AF7D639A82FE02F95EC0@CY1PR0501MB1178.namprd05.prod.outlook.com> <CAKhyrx_9GnLTBDKkhW_y4QG+f3xV_SL-Vtg0WN+vU6UXnY-qLA@mail.gmail.com> <CY1PR0501MB1178A955FBE2AAAE65655EAB95EC0@CY1PR0501MB1178.namprd05.prod.outlook.com> <87b59611-f5d1-628d-fd41-85ce6c6eb50b@samersoff.net> <CY1PR0501MB117800AACB41115C303EB9D495E60@CY1PR0501MB1178.namprd05.prod.outlook.com>

Paul,

> Thanks, will take a look at that. When you say "more or less safely",
> I'm reading that as saying there will be issues with it.  :-)

I don't know a way to do anything with a crashing process with 100%
reliability. Ever coredump. Custom code in signal handler doesn't make
the situation worse.

It's quite often for complicated apps that the crash is result of
something that happens far before crash point. E.g. when you see a
memory corruption you typically interesting where the memory had been
corrupted but not where corrupted memory was hit by the app.

So signal handlers that know application data structure and can print
meaningful information is quite usable and saves a lot of time in debugging.

Also it might be necessary to free some resources before process start
dumping core to allow faster restart.

> Trouble is I soon will not allow a core file to be written -- the
> process is reaching a size where I cannot allow it to be out of
> action for the amount of time it takes to write that to disk.

One of possible solution is to add some keep-alive protocol between
child and parent (e.g. child keep touching file on disk or sending udp
packets), if keep-alive doesn't come in time, parent consider the child
as dead, send abort to it and fire a new process.

This solution also covers the situation when a child process hugs or
deadlocks.

-Dmitry

On 2016-09-05 14:09, Paul Marquess wrote:
> From: Dmitry Samersoff [mailto:dms@samersoff.net]
> 
>> Paul,
>> 
>>>> 1) Why not dump the information that you are looking for into a
>>>> file in the process signal handler ?
>>> 
>>> Would love to, but I have no idea what state the process is in
>>> once the SEGV has been triggered.
>> 
>> If you use altstack and avoid malloc you can dump bunch of
>> information from the signal handler more or less safely.
>> 
>> e.g.
>> 
>> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/tip/src/share/vm/utilities/vmError.cpp
>
>> 
> Thanks, will take a look at that. When you say "more or less safely",
> I'm reading that as saying there will be issues with it.  :-)
> 
> I know we've had problems with signal handlers causing problems, thus
> my preference to find a way to have the signal handler code do as
> little as possible and get all the data collection handled at arm's
> length by gdb.
> 
>>>>> My first thought was to add a script in 
>>>>> /proc/sys/kernel/core_pattern to catch the process as it is
>>>>> coring. Then I get gdb to attach to the PID of the process
>>>>> that is about to core. Unfortunately, when I tried that, gdb
>>>>> gives me this error
>> 
>> One of possible solution is:
>> 
>> 1. Change /proc/sys/kernel/core_pattern to have all coredumps from
>> your app in a separate directory, something like
>> /var/dumps/%e/core.%p
>> 
>> 2. Have a cron job that looks over this directory and run gdb <exe
>> image name> <core_name> < gdb_script > core.%p.out on demand.
> 
> That is exactly what I'm doing at the moment. Trouble is I soon will
> not allow a core file to be written -- the process is reaching a size
> where I cannot allow it to be out of action for the amount of time it
> takes to write that to disk.
> 
> Paul
> 
> P.S. Sorry for the delay in following up. Had no internet access for
> about 10 days.
> 

-- 
Dmitry Samersoff
Saint Petersburg, Russia, http://devnull.samersoff.net
* There will come soft rains  ...

Attachment: signature.asc
Description: OpenPGP digital signature

Follow-Ups:
- RE: collecting data from a coring process
  - From: Paul Marquess

References:
- RE: collecting data from a coring process
  - From: Paul Marquess

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]