This is the mail archive of the guile@sourceware.cygnus.com mailing list for the Guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: binary-io, opposable-thumb, pack/unpack (was Re: binary-io (was Re: rfc 2045 base64 encoding/decoding module))


> From: Per Bothner <per@bothner.com>
> Date: 16 Feb 2000 18:12:38 -0800
> 
> > Do you think combined input/output ports are more trouble than they're
> > worth?
> 
> I don't understand the question:  Do you mean ports that are combined
> input/output or combined byte/character?

Actually I meant what I wrote: combined input/output ports. I should
have given more warning that the question was somewhat off-topic.  The
answer you gave is still valuable though.

>  Assuming you mean a port
> that is at the same time a byte-sequence and a character-sequence,
> I think that encourages slopping programming and typing discipline.
> 
> In any case, I think we are stuck with the standard Scheme ports being
> character sequences, not byte sequences.  We also have the fact
> low-level files and network protocals need to work with bytes.  So it
> seems inescapable that a file port is something that works on a
> underlying byte-sequence, and decodes that to an appropriate
> character-sequence.

Yes, this seems reasonable to me.  I thought it would be helpful to
introduce a new primitive that could read a byte from a port and
return an integer, but it doesn't seem to be necessary.

> We need to allow people to use the existing
> functions for binary I/O.  I think the right model is that in that
> case the character sequence owuld be using the "trivial" encoding
> (integer->char and char->integer).  If the internal character set is
> Unicode or a superset thereof, then that is equivalent to using the
> ISO-Latin-1 encoding (plus disabling CR/LF processing).

Yes.  It would be best to give it a name that's an abbreviation for
"the encoding that maps 8-bit bytes to the first 256 characters in the
internal character set." rather than just using ISO-Latin-1.  But
defining this as a completely different port type seems like overkill.

> The real argument I think is whether programmers should be allowed
> to change the encoding function of a port *on the fly*.  My point
> is you don't really need it, and it has some troubling semantics.
> However, I won't necessarily say it's the wrong thing to do.  It
> might be the simplest extension to Scheme.  You do have to be very
> careful about how you define this extension:  What happens in the
> various cases, shift states, etc.

This seems reasonable to me.  It only becomes much of an issue when
dealing with stateful encodings doesn't it? 

Would it seem reasonable, when defining "unpack" routines like
(read-foobar-16 port), to throw an error if the port was not in "the
encoding that maps 8-bit bytes to the first 256 characters in the
internal character set."?

> > It seems a bit restrictive to allow only meaningful and reliable
> > formats.  Examples would be things like reading a binary database
> > record with string fields or decoding network protocols (I'm not
> > sure which ones off hand.
> 
> You're confusing files and ports.  A file contains bytes.  Reading
> a file is essentially parsing it, which requires knowing the grammar
> and encoding of the file.  Reading a file that contains a mix of
> binary data and strings requires being able to delimit the strings.
> This has to be well-defined in the file format.  One clean solution
> when reading a string in a binary file is to read the bytes until the
> termination condition (either a count or a delimiter) and then
> converting the bytes read to a string, using the appropriate conversion
> function.  You only use a byte-input-port, and not a char-input-port.

I guess it would make sense, if you had a vast byte/character
conversion library accessible to the port implementation, to make it
available in Scheme too.

> > Doesn't HTTP start with an ASCII header and
> > switch to a character set specified in the header?)
> 
> The clean way to do that is open a byte-input-port, and read enough
> to determine the encoding.  At that point, you create a char-input-port
> that indirects to the byte-input-port to read the read of the response
> or file.  You can either rewind the byte-input-port, or (better)
> have the char-input-port start reading bytes at a well-defined point
> in the byte stream.

In this case though, ASCII isn't stateful so I don't see the harm
in simply switching the port encoding.

> In any case, if you have a mix between binary and character data, or
> between different character encodings, you have to be careful about
> properly synchronizing the character stream with the underlying byte
> stream.  You have to do this whether you have a single combined
> port object, or you use character ports that forwards requests to
> a byte port.

True, when dealing with stateful encodings.

> > Your system could make read and read-line simpler or more efficient, I
> > think, by allowing them to scan the buffer without needing to decode
> > the bytes.
> 
> But you can't do that!  The scanning is defined in terms of character
> delimiters, not bytes.  For certain encoding, and certain sets of
> delimiters characters, you can make some optimizations, but those are
> special-case hacks.  The system could do that *behind the scenes*,
> butthe pulished semantics need to handle the general case in a
> clean and consistent manner.

I'm trying to get a grasp on how your "dual port" system actually
works.  I guessed you meant the character port has a buffer using the
internal encoding: the conversion from the byte stream into the
internal encoding is done when the buffer is filled.

> > Maybe not in general, it would be up to the user not to mess it up.
> > Banning it completely seems like overkill.
> 
> Guile is meant to be a scripting language, not a systems-programming
> language.  It should make it easy to write correct and general code,
> and harder to write possibly-faster but inccorect code, not the
> other way around.

Guile is intended primarily to be embeddable into C applications
I believe.  But in any case it would be inconvenient to need to
switch to a completely different I/O system when moving between
Guile and a Scheme intended for systems-programming.

> > I was thinking of where strings are passed to various system call and
> > gh_ interfaces, so reading a string (of arbitrary bytes) with read-line
> > and writing it to the interface would end up modifying the bytes.
> 
> Passing a string containg characters to system routines that expect
> bytes is not something you can expect to work.  It may work for strings
> that use appropriate stateless multi-byte encodings, such as UTF8,
> which I believe has been proposed for Guile.

Yes, that's what I was saying, I wouldn't expect it to work.  But it
is interesting that quite a few things could work with UTF-8 without
any special conversion, provided people would be happy with UTF-8
filenames etc.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]