This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

UTF-8: Invalid multibyte sequence


hello,

I am trying to experiment with utf-8 in glibc 2.13 (Debian testing).
For this purpose, I created a simple multibyte utf-8 sequence using
gedit:
----------
aÃaÃ
----------
(a followed by a-umlaut followed by a followed by a-umlaut)

The following program:
----------
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>

int main()
{
  setlocale(LC_ALL,"en_US.UTF-8");

  FILE* f = (FILE*)fopen("utf-8.txt", "r");
  char buffer[1024];
  fscanf(f, "%s", buffer);
  //buffer[6] = 0xC0;
  //buffer[7] = 0x80;
  buffer[6] = '\0';
  close(f);
  
  printf("buffer='%s' strlen(buffer)=%d, numChars=%d\n",
         buffer,
         strlen(buffer),
         mbstowcs(NULL, buffer, 0));

  return 0;
}
----------

outputs:
----------
buffer='aÃaÃ' strlen(buffer)=6, numChars=-1
----------

mbstowcs(NULL, buffer, 0) is a standard solution for getting the number
of characters in a multibyte string. -1 return value means "An invalid
multibyte sequence has been encountered".
Could the problem be the termination sequence? I tried both 0x00 and
0xC0,0x80...

Next, I tried to generate a widechar-sequence using L"..." and use
wcsrtombs() to convert it to a multibyte sequence:
----------
#include <wchar.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <errno.h>

int main()
{
  setlocale(LC_ALL,"en_US.UTF-8");

  char buffer[1024];

  const wchar_t* WCS = L"aÃaÃ";
  size_t result = wcsrtombs(buffer, &WCS, 1024, NULL);
  printf("result=%d, errno=%d\n", result, errno);
  wprintf(WCS);

  printf("buffer='%s' strlen(buffer)=%d, numChars=%d\n",
         buffer,
         strlen(buffer),
         mbstowcs(NULL, buffer, 0));

  return 0;
}
----------
The output is:

----------
result=-1, errno=84
buffer='a' strlen(buffer)=1, numChars=1
----------

errno=84 means EILSEQ = "Illegal byte sequence (POSIX.1, C99)"

What am I doing wrong? What's the best way to generate a valid
multibyte sequence?

Thanks,
-- 
Felix Natter


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]