This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Fw: File name too long problem -- maybe fix coming?


Thank you so much for posting this. It's really nice try to get some idea about what has already been discussed and the proposed directions.

That said, my knowledge of everything that has gone before me is lacking so I hope you'll cut me some slack in my comments.

> From: Corinna Vinschen corinna-cygwin@cygwin.com
>
> My idea what should be done in Cygwin goes roughly like this:
>
> 1. POSIX paths should be handled in the current codepage as before.
> Potentially this is a multibyte codepage like UTF-8. Make sure
> that we handle multibyte paths correctly.
> 
> TBD: Always use UTF-8? What about existing installations with
> symlinks/mount points using arbitrary codepages?

Since Cygwin bills itself as "A DLL (cygwin1.dll) which acts as a Linux API emulation layer providing substantial Linux API functionality" then if there is a way to support UTF-8 that would be great. A config file, an environment variable, the LC_CTYPE env var, anything that would allow cygwin to treat the 8bit strings passed in and out of the APIs as UTF-8. This would solve a variety of problems which can be summed up as helping the goal of providing "a Linux API emulation layer providing substantial Linux API functionality".

Obviously no one wants to see anyone's current installation break. At the same time, AFAIK, many current Linuxes run in UTF-8 mode. I know FC6 does since that's what I'm running. When passing data to and from linux (rsync, unison, ftp, ...) being able to put cygwin in UTF-8 mode would make them more compatible. Filenames passed from one OS to another in different languages would start working.


>
> III.Long path names using the above syntax are obviously always
> absolute paths. Since all other paths are restricted to
> MAX_PATH == 260 chars in Win32, any relative path is restricted
> to 260 chars as well.

Something that appears to maybe have been missed is the distinction between 260 widechar vs multibyte. Here is a perfectly fine multibyte filename. It's in Japanese so it may or may not display in your email client.

const char* filename = "ãããææèãèéãããããåãããããããããããææèãèéãããããåãããããããããããææèãèéãããããåãããããããããããææèãèéãããããåãããããããããããææèãèéãããããåãããããããããããææèãèéãããããåãããããããããããææèãèéãããããåãããããããããããææèãèéãããããåãããããããããããææèãèéãããããåãããããããã.txt";

It is 418 bytes long in iso-2022-jp encoding. If I call MultiByteToWideChar and convert it to UTF-16 it will be 211 wide characters. That same string in UTF-8 is 625 characters.

The point I'd like to make is cygwin/linux/unix/POSIX defines a public interface for the number of 8bit characters passed in for functions like open() and out for functions like readdir(). Those functions take PATH_MAX characters.  PATH_MAX is the PUBLIC size in 8bit characters for those functions to use.  Setting it to 260 is too small because if I pass that filename above to open() it should succeed. If call readdir and it reads a file with that name the buffer in dirent.d_name[PATH_MAX] needs to be big enough to receive the filename, that size in the case above is 411 if the current codepage is iso-2022-jp and it's 625 if it's CP_UTF8.

As far as I know the largest any widechar character will be expanded to is 4 bytes so that means PATH_MAX has to be set to either (260 * 4) or if you're going to go for the 32767 NT limit then PATH_MAX needs to be set to (32767*4) because that's the Linux Compatible public interface limits that will make cygwin work regardless of the current codepage.

That also means internally cygwin seems to need to use a different constant like CYGWIN_INTERNAL_PATH_MAX if it's going to use widechars internally. Internally it would use 32767 widechar size filename buffers. Externally it would accept and supply 128k 8bit character filename buffers.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]