This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 4/4] Remove broken posix_fallocate, posix_falllocate64 fallback code [BZ#15661]


On 05/18/2015 06:43 AM, Florian Weimer wrote:
> On 05/07/2015 09:05 PM, Florian Weimer wrote:
>> > On 05/07/2015 08:19 PM, Roland McGrath wrote:
>>>> >>> If I'm not mistaken ftruncate could still reduce the file size if it
>>>> >>> races with another operation that would extend the file. This is also
>>>> >>> a data loss bug.
>>> >>
>>> >> I concur.
>> > 
>> > It happens with length == 0.  We could error out with EINVAL instead of
>> > calling ftruncate.
>> > 
>> > Daniel Berrange pointed me to these bugs:
>> > 
>> >   https://sourceware.org/bugzilla/show_bug.cgi?id=17322
>> >   https://bugzilla.redhat.com/show_bug.cgi?id=1140250
>> >   https://bugzilla.redhat.com/show_bug.cgi?id=1077068
> Another very recent example is here:
> 
> 
> https://lists.fedorahosted.org/pipermail/elfutils-devel/2015-May/004868.html
> 
>> > This suggests that people actually rely on the current allocation
>> > behavior.  Combined with my previous analysis that applications will
>> > start to fail if we remove the fallback and return EINVAL, I now think
>> > we need to keep the allocation loop.
> This is the patch I currently have.  It fixes the avoidable bugs.  I
> still think we are in a bad situation here, that even a compatibility
> symbol cannot fix.
> 
> -- Florian Weimer / Red Hat Product Security

OK to checkin.

I believe this is the best patch.

It fixes the O_APPEND case (real bug) and makes the NFS usages of
posix_fallocate sensible.

It leaves the fallback code in place for applications that need it
and do not wish to write their own allocation loops, as all
applications would need to do.

Lastly I believe the race conditions outlined in bug 15661 are 
application bugs.

Given that posix_fallocate is not listed in POSIX section 2.9.7 
"Thread Interactions with Regular File Operations" the call is not
required to be atomic with respect to any other operation. Therefore
it is the responsibility of the "other thread" to synchronize the
file operations with the thread calling posix_fallocate to ensure
posix_fallocate has returned before it starts writing to the file.

Despite the fact that posix_fallocate does not explicitly say that
it will write to the newly allocated storage, it also doesn't say
that it won't, and thus if you are extending the file with posix_fallocate
and also writing to the file, it's safest to be conservative and
synchronize until and when POSIX clarifies that posix_fallocate does
not modify the contents (as it does in at least according to FreeBSD[1]).

[1] https://www.freebsd.org/cgi/man.cgi?query=posix_fallocate&sektion=2&manpath=FreeBSD+8.3-RELEASE
"... Any existing file data in the specified range is unmodified. ..."


> 0001-posix_fallocate-Emulation-fixes-and-documentation-BZ.patch
> 
> 
> From f25f46c0e0223d9f6e9e8ead27a37caa34f33631 Mon Sep 17 00:00:00 2001
> Message-Id: <f25f46c0e0223d9f6e9e8ead27a37caa34f33631.1431945166.git.fweimer@redhat.com>
> From: Florian Weimer <fweimer@redhat.com>
> Date: Mon, 18 May 2015 11:32:44 +0100
> Subject: [PATCH] posix_fallocate: Emulation fixes and documentation [BZ
>  #15661]
> To: libc-alpha@sourceware.org
> 
> Handle signed integer overflow correctly.  Detect and reject O_APPEND.
> Document drawbacks of emulation.
> 
> This does not completely address bug 15661, but improves the situation
> somewhat.
> ---
>  ChangeLog                         |  9 ++++
>  manual/filesys.texi               | 94 +++++++++++++++++++++++++++++++++++++++
>  sysdeps/posix/posix_fallocate.c   | 67 ++++++++++++++++++++--------
>  sysdeps/posix/posix_fallocate64.c | 67 ++++++++++++++++++++--------
>  4 files changed, 199 insertions(+), 38 deletions(-)
> 
> diff --git a/ChangeLog b/ChangeLog
> index 4de8a25..603847b 100644
> --- a/ChangeLog
> +++ b/ChangeLog
> @@ -1,3 +1,12 @@
> +2015-05-18  Florian Weimer  <fweimer@redhat.com>
> +
> +	[BZ #15661]
> +	* sysdeps/posix/posix_fallocate64.c (__posix_fallocate64_l64):
> +	Check for overflow properly.  Check for O_APPEND.  Ignore large
> +	file system block sizes.  Add comments about problems.
> +	* sysdeps/posix/posix_fallocate.c (posix_fallocate): Likewise.
> +	* manual/filesys.texi (Storage Allocation): New node.
> +
>  2015-05-18  Arjun Shankar  <arjun.is@lostca.se>
>  
>  	* include/stdio.h: Define __need_wint_t.
> diff --git a/manual/filesys.texi b/manual/filesys.texi
> index 7d55b43..0f2e3dc 100644
> --- a/manual/filesys.texi
> +++ b/manual/filesys.texi
> @@ -1723,6 +1723,7 @@ modify the attributes of a file.
>                                   access a file.
>  * File Times::                  About the time attributes of a file.
>  * File Size::			Manually changing the size of a file.
> +* Storage Allocation::          Allocate backing storage for files.

OK.

>  @end menu
>  
>  @node Attribute Meanings
> @@ -3233,6 +3234,99 @@ is a requirement of @code{mmap}.  The program has to keep track of the
>  real size, and when it has finished a final @code{ftruncate} call should
>  set the real size of the file.
>  
> +@node Storage Allocation
> +@subsection Storage Allocation
> +@cindex allocating file storage
> +@cindex file allocation
> +@cindex storage allocating
> +
> +@cindex file fragmentation
> +@cindex fragmentation of files
> +@cindex sparse files
> +@cindex files, sparse
> +Most file systems support allocating large files in a non-contiguous
> +fashion: the file is split into @emph{fragments} which are allocated
> +sequentially, but the fragments themselves can be scattered across the
> +disk.  File systems generally try to avoid such fragmentation because it
> +decreases performance, but if a file gradually increases in size, there
> +might be no other option than to fragment it.  In addition, many file
> +systems support @emph{sparse files} with @emph{holes}: regions of null
> +bytes for which no backing storage has been allocated by the file
> +system.  When the holes are finally overwritten with data, fragmentation
> +can occur as well.
> +
> +Explicit allocation of storage for yet-unwritten parts of the file can
> +help the system to avoid fragmentation.  Additionally, if storage
> +pre-allocation fails, it is possible to report the out-of-disk error
> +early, often without filling up the entire disk.  However, due to
> +deduplication, copy-on-write semantics, and file compression, such
> +pre-allocation may not reliably prevent the out-of-disk-space error from
> +occurring later.  Checking for write errors is still required, and
> +writes to memory-mapped regions created with @code{mmap} can still
> +result in @code{SIGBUS}.
> +
> +@deftypefun int posix_fallocate (int @var{fd}, off_t @var{offset}, off_t @var{length})
> +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}

OK.

> +@c If the file system does not support allocation,
> +@c @code{posix_fallocate} has a race with file extension (if
> +@c @var{length} is zero) or with concurrent writes of non-NUL bytes (if
> +@c @var{length} is positive).
> +
> +Allocate backing store for the region of @var{length} bytes starting at
> +byte @var{offset} in the file for the descriptor @var{fd}.  The file
> +length is increased to @samp{@var{length} + @var{offset}} if necessary.
> +
> +@var{fd} must be a regular file opened for writing, or @code{EBADF} is
> +returned.  If there is insufficient disk space to fulfill the allocation
> +request, @code{ENOSPC} is returned.
> +
> +@strong{Note:} If @code{fallocate} is not available (because the file
> +system does not support it), @code{posix_fallocate} is emulated, which
> +has the following drawbacks:
> +
> +@itemize @bullet
> +@item
> +It is very inefficient because all file system blocks in the requested
> +range need to be examined (even if they have been allocated before) and
> +potentially rewritten.  In contrast, with proper @code{fallocate}
> +support (see below), the file system can examine the internal file
> +allocation data structures and eliminate holes directly, maybe even
> +using unwritten extents (which are pre-allocated but uninitialized on
> +disk).
> +
> +@item
> +There is a race condition if another thread or process modifies the
> +underlying file in the to-be-allocated area.  Non-null bytes could be
> +overwritten with null bytes.
> +
> +@item
> +If @var{fd} has been opened with the @code{O_APPEND} flag, the function
> +will fail with an @code{errno} value of @code{EBADF}.
> +
> +@item
> +If @var{length} is zero, @code{ftruncate} is used to increase the file
> +size as requested, without allocating file system blocks.  There is a
> +race condition which means that @code{ftruncate} can accidentally
> +truncate the file if it has been extended concurrently.
> +@end itemize
> +
> +On Linux, if an application does not benefit from emulation or if the
> +emulation is harmful due to its inherent race conditions, the
> +application can use the Linux-specific @code{fallocate} function, with a
> +zero flag argument.  For the @code{fallocate} function, @theglibc{} does
> +not perform allocation emulation if the file system does not support
> +allocation.  Instead, an @code{EOPNOTSUPP} is returned to the caller.
> +
> +@end deftypefun

OK.

> +
> +@deftypefun int posix_fallocate64 (int @var{fd}, off64_t @var{length}, off64_t @var{offset})
> +@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}

OK.

> +
> +This function is a variant of @code{posix_fallocate64} which accepts
> +64-bit file offsets on all platforms.
> +
> +@end deftypefun
> +
>  @node Making Special Files
>  @section Making Special Files
>  @cindex creating special files
> diff --git a/sysdeps/posix/posix_fallocate.c b/sysdeps/posix/posix_fallocate.c
> index d15d603..e7fe201 100644
> --- a/sysdeps/posix/posix_fallocate.c
> +++ b/sysdeps/posix/posix_fallocate.c
> @@ -18,26 +18,36 @@
>  #include <errno.h>
>  #include <fcntl.h>
>  #include <unistd.h>
> +#include <stdint.h>
> +#include <sys/fcntl.h>
>  #include <sys/stat.h>
>  #include <sys/statfs.h>
>  
> -/* Reserve storage for the data of the file associated with FD.  */
> +/* Reserve storage for the data of the file associated with FD.  This
> +   emulation is far from perfect, but the kernel cannot do not much
> +   better for network file systems, either.  */
>  
>  int
>  posix_fallocate (int fd, __off_t offset, __off_t len)
>  {
>    struct stat64 st;
> -  struct statfs f;
>  
> -  /* `off_t' is a signed type.  Therefore we can determine whether
> -     OFFSET + LEN is too large if it is a negative value.  */
>    if (offset < 0 || len < 0)
>      return EINVAL;
> -  if (offset + len < 0)
> +
> +  /* Perform overflow check.  The outer cast relies on a GCC
> +     extension.  */
> +  if ((__off_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
>      return EFBIG;
>  
> -  /* First thing we have to make sure is that this is really a regular
> -     file.  */
> +  /* pwrite below will not do the right thing in O_APPEND mode.  */
> +  {
> +    int flags = __fcntl (fd, F_GETFL, 0);
> +    if (flags < 0 || (flags & O_APPEND) != 0)
> +      return EBADF;
> +  }
> +
> +  /* We have to make sure that this is really a regular file.  */
>    if (__fxstat64 (_STAT_VER, fd, &st) != 0)
>      return EBADF;
>    if (S_ISFIFO (st.st_mode))
> @@ -47,6 +57,8 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
>  
>    if (len == 0)
>      {
> +      /* This is racy, but there is no good way to satisfy a
> +	 zero-length allocation request.  */
>        if (st.st_size < offset)
>  	{
>  	  int ret = __ftruncate (fd, offset);
> @@ -58,19 +70,36 @@ posix_fallocate (int fd, __off_t offset, __off_t len)
>        return 0;
>      }
>  
> -  /* We have to know the block size of the filesystem to get at least some
> -     sort of performance.  */
> -  if (__fstatfs (fd, &f) != 0)
> -    return errno;
> -
> -  /* Try to play safe.  */
> -  if (f.f_bsize == 0)
> -    f.f_bsize = 512;
> -
> -  /* Write something to every block.  */
> -  for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
> +  /* Minimize data transfer for network file systems, by issuing
> +     single-byte write requests spaced by the file system block size.
> +     (Most local file systems have fallocate support, so this fallback
> +     code is not used there.)  */
> +
> +  unsigned increment;
> +  {
> +    struct statfs64 f;
> +
> +    if (__fstatfs64 (fd, &f) != 0)
> +      return errno;
> +    if (f.f_bsize == 0)
> +      increment = 512;
> +    else if (f.f_bsize < 4096)
> +      increment = f.f_bsize;
> +    else
> +      /* NFS does not propagate the block size of the underlying
> +	 storage and may report a much larger value which would still
> +	 leave holes after the loop below, so we cap the increment at
> +	 4096.  */
> +      increment = 4096;

OK.

> +  }
> +
> +  /* Write a null byte to every block.  This is racy; we currently
> +     lack a better option.  Compare-and-swap against a file mapping
> +     might additional local races, but requires interposition of a
> +     signal handler to catch SIGBUS.  */
> +  for (offset += (len - 1) % increment; len > 0; offset += increment)
>      {
> -      len -= f.f_bsize;
> +      len -= increment;
>  
>        if (offset < st.st_size)
>  	{
> diff --git a/sysdeps/posix/posix_fallocate64.c b/sysdeps/posix/posix_fallocate64.c
> index b845df7..ee32679 100644
> --- a/sysdeps/posix/posix_fallocate64.c
> +++ b/sysdeps/posix/posix_fallocate64.c
> @@ -18,26 +18,36 @@
>  #include <errno.h>
>  #include <fcntl.h>
>  #include <unistd.h>
> +#include <stdint.h>
> +#include <sys/fcntl.h>
>  #include <sys/stat.h>
>  #include <sys/statfs.h>
>  
> -/* Reserve storage for the data of the file associated with FD.  */
> +/* Reserve storage for the data of the file associated with FD.  This
> +   emulation is far from perfect, but the kernel cannot do not much
> +   better for network file systems, either.  */
>  
>  int
>  __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
>  {
>    struct stat64 st;
> -  struct statfs64 f;
>  
> -  /* `off64_t' is a signed type.  Therefore we can determine whether
> -     OFFSET + LEN is too large if it is a negative value.  */
>    if (offset < 0 || len < 0)
>      return EINVAL;
> -  if (offset + len < 0)
> +
> +  /* Perform overflow check.  The outer cast relies on a GCC
> +     extension.  */
> +  if ((__off64_t) ((uint64_t) offset) + ((uint64_t) len) < 0)
>      return EFBIG;
>  
> -  /* First thing we have to make sure is that this is really a regular
> -     file.  */
> +  /* pwrite64 below will not do the right thing in O_APPEND mode.  */
> +  {
> +    int flags = __fcntl (fd, F_GETFL, 0);
> +    if (flags < 0 || (flags & O_APPEND) != 0)
> +      return EBADF;
> +  }
> +
> +  /* We have to make sure that this is really a regular file.  */
>    if (__fxstat64 (_STAT_VER, fd, &st) != 0)
>      return EBADF;
>    if (S_ISFIFO (st.st_mode))
> @@ -47,6 +57,8 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
>  
>    if (len == 0)
>      {
> +      /* This is racy, but there is no good way to satisfy a
> +	 zero-length allocation request.  */
>        if (st.st_size < offset)
>  	{
>  	  int ret = __ftruncate64 (fd, offset);
> @@ -58,19 +70,36 @@ __posix_fallocate64_l64 (int fd, __off64_t offset, __off64_t len)
>        return 0;
>      }
>  
> -  /* We have to know the block size of the filesystem to get at least some
> -     sort of performance.  */
> -  if (__fstatfs64 (fd, &f) != 0)
> -    return errno;
> -
> -  /* Try to play safe.  */
> -  if (f.f_bsize == 0)
> -    f.f_bsize = 512;
> -
> -  /* Write something to every block.  */
> -  for (offset += (len - 1) % f.f_bsize; len > 0; offset += f.f_bsize)
> +  /* Minimize data transfer for network file systems, by issuing
> +     single-byte write requests spaced by the file system block size.
> +     (Most local file systems have fallocate support, so this fallback
> +     code is not used there.)  */
> +
> +  unsigned increment;
> +  {
> +    struct statfs64 f;
> +
> +    if (__fstatfs64 (fd, &f) != 0)
> +      return errno;
> +    if (f.f_bsize == 0)
> +      increment = 512;
> +    else if (f.f_bsize < 4096)
> +      increment = f.f_bsize;
> +    else
> +      /* NFS clients do not propagate the block size of the underlying
> +	 storage and may report a much larger value which would still
> +	 leave holes after the loop below, so we cap the increment at
> +	 4096.  */
> +      increment = 4096;
> +  }
> +
> +  /* Write a null byte to every block.  This is racy; we currently
> +     lack a better option.  Compare-and-swap against a file mapping
> +     might address local races, but requires interposition of a signal
> +     handler to catch SIGBUS.  */
> +  for (offset += (len - 1) % increment; len > 0; offset += increment)
>      {
> -      len -= f.f_bsize;
> +      len -= increment;
>  
>        if (offset < st.st_size)
>  	{
> -- 2.1.0

OK.

Cheers,
Carlos.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]