This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]


On 05/06/2015 03:19 AM, Florian Weimer wrote:
> On 05/05/2015 10:28 PM, Carlos O'Donell wrote:
>> On 04/24/2015 08:53 AM, Florian Weimer wrote:
>>> The previous implementation could result in silent data corruption,
>>> and this has been observed to happen with application code.
>>
>> In principle I agree with the removal of all of the fallback fallocate
>> code, it simply can't work reliably, and a reliable solution is ridiculously
>> expensive (see Rich's comments in the BZ about CAS over all the mmap'd pages).
> 
> It's also not covered by the memory model, I think.
> 
>> The bug with O_APPEND files is real, and yet another reason to remove the
>> fallback code.
> 
> We should handle that better at the very least.
> 
> We could clear O_APPEND, but only in single-threaded mode; I don't think
> it's worth the effort.  Re-opening the descriptor through /proc/self/fd
> does not work because closing that descriptor would release POSIX
> advisory locks.

I do not think we need to do that, and I agree with some of your comments
below.

Keep in mind that we need only assure that subsequent writes succeed
and that the files is the right length on the filesystem. This in my mind
means we need only call `ftruncate` successfully.

>> What worries me though is that this change could break existing systems
>> that relied on this emulation to do something sensible for filesystems
>> that don't support fallocate. These binaries could easily be single threaded
>> systems with no other process touching their files and writing to filesystems
>> that don't support fallocate. If that is a sensible class of users, then we
>> need to version the interface, with the old version continuing to call the
>> fallback code and the new version not calling the fallback code.
> 
> After sleeping over your comments, I actually did my homework.  The gist
> is that we cannot remove fallback, I think not even with the
> compatibility symbol.
> 
> Various file systems do not support fallocate.  This includes NFS, where
> even the most recent version makes it optional to implement in the server.

OK.

> SQLite ignores the posix_fallocate return value, but MariaDB does not.
> A recompiled MariaDB would suddenly start to fail, and the DBA would
> have to disable pre-allocation in the configuration.  If I read the
> source correctly, systemd-journald will stop logging, and there is no
> knob to turn off fallocate.  Same for libvirt, it will fail to create
> backing files for storage devices.

OK.

> Both MariaDB and libvirt are often run on NFS storage, so a glibc change
> which removes fallback would actually affect them.  For the code we
> ship, we can move the fallback to the applications, but there is no good
> way to make sure that happens with third-party applications.  I do not
> believe the compatibility symbol mechanism is a good alternative because
> the breakage will be file-system-dependent and may not be noticed during
> testing.  (I'm generally skeptical of using compatibility symbols this way.)

That is a difference of opinion, but I buy your analysis, despite our best
efforts with compatibility symbols the NFS use case would remain and users
would see failures everywhere after a recompilation. It would not be prudent
of us to do this, and it is exactly what I worried about.

> Maybe we could remove the write loop and perform only an ftruncate call
> which (hopefully) increases the file size.  This would take care of the
> O_APPEND issue and remove most of the races.  Using posix_fallocate to
> avoid ENOSPC later would not work, but with thin provisioning,
> deduplicating storage and compression going around these days, I don't
> think writing zero blocks has that effect in practice anyway
> (particularly not on NFS).  I'll ask around.

I agree. I was thinking exactly the same thing when I saw the write loop.
Unfortunately only fallocate at the kernel fs layer is going to guarantee
you never see ENOSPC in all reasonable situations.

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]