This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: cant access to files more than 128 utf-8 symbol long names


On Dec 10 11:15, Nikolay Ilychev wrote:
> Hello!
> 
> When using cygwin, i can't list, copy, remove files and directories
> with 128 utf-8 symbol long names.
> 
> useless examples that illustrates the problem:
> [...]
> same problem with other tools - find, perl, rsync from cygwin repo.
> 
> Please, make the MAX_PATH not for 260 bytes, but 260 utf-8 symbols.

Easier said than done.

First of all, this is NOT about MAX_PATH.  MAX_PATH (260 chars) is the
number of characters allowed in the Win32 ANSI file API for a complete
path, including the terminating null.  Cygwin is using the native NT API
and, occasionally, the Win32 UNICODE file API, which allows paths of up
to 32767 chars.

The problem here is about NAME_MAX.  NAME_MAX is per POSIX[1] the
"maximum number of bytes in a filename (not including the terminating
null)."

Note the word *bytes*.  Not characters, bytes. UTF-8 chars are 1 to 4
bytes in length.  Thus, the maximum number of UTF-8 chars in a filename
is potentially less than NAME_MAX:

A filename of chars only from the basic latin charset (1 byte in UTF-8)
may consist of NAME_MAX characters, a filename solely constructed from
chars of the latin-1 supplement (2 byte chars) may consist of NAME_MAX /
2 characters, a filename constructed from emoticons (4 byte chars) only
of NAME_MAX / 4 chars.

Ok, so we all know that Windows is not using a byte representation of
filenames, rather the OS uses UTF-16 to store and handle filenames
internally.  Filename on Windows filesystems may consist of 255 UTF-16
chars[2].

How do you represent this in a byte-oriented POSIX system?  What do you
set NAME_MAX to?  You can't get it right due to the unfortunate multibyte
vs. UTF-16 encoding issue.

To cover all UTF-8 chars, NAME_MAX would have to be 1020.  But then,
applications relying on NAME_MAX will be surprised by ENAMETOOLONG
errors for perfectly valid POSIX filenames.

If you make it 255, applications will be surprised by ENAMETOOLONG
errors for perfectly valid Windows filenames.

If you make it 255 on the application level but then return filenames
longer than 255 multibyte chars to the application, they will crash
due to buffer overflow issues.  After all, NAME_MAX is a contractual
obligation.

There was also the backward compatibility issue.  Back in the pre-Cygwin
1.7 days, when Cygwin used the ANSI file API, NAME_MAX was already 255.
Changing that to a bigger value might have resulted in the
aforementioned application crashes due to buffer overflows as well.

So we decided to keep NAME_MAX at the same value as it always was, 255.
This restricts the actual filename length when using multibyte
characters just as on any other POSIX system with the downside that,
occasionally, a Windows filename will be too long to handle.

Sorry if that is frustrating in your current situation, but this
isn't something we can just change at a whim and go ahead.  It would
break compatibility with all existing Cygwin executables.


Corinna


[1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/limits.h.html
[2] However, this does *not* cover NFS or other filesystems using a
    byte representation for storing filenames.


-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

Attachment: pgp7LWoeqaltD.pgp
Description: PGP signature


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]