This is the mail archive of the kawa@sourceware.org mailing list for the Kawa project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Questions about SRFI-38

From: Jamison Hope <jrh at theptrgroup dot com>
To: Per Bothner <per at bothner dot com>
Cc: Charles Turner <chturne at gmail dot com>, kawa at sourceware dot org
Date: Wed, 11 May 2011 18:06:48 -0400
Subject: Re: Questions about SRFI-38
References: <4DBDFDBB.3000704@gmail.com> <4DBE06D4.6040300@bothner.com> <4DBF49DA.9050408@gmail.com> <4DBFACD6.50100@bothner.com> <4DC9CE45.2020306@gmail.com> <4DC96E75.2050400@bothner.com>

On May 10, 2011, at 12:57 PM, Per Bothner wrote:

On 05/10/2011 04:46 PM, Charles Turner wrote:
Sorry for the late reply, I've been away. My response beneath Per's
email below is mostly irrelevant to the main thread of conversation.
I'm struggling to understand how the pretty printer works. I've done the easy part of traversing and gathering information from a possibly cyclic list, but I'm having a hard time choosing how to modify the pretty printer's data structures (and hence, what information I should keep around in Info objects, I don't really get why your example was putting a hash table into them).

I think there are at least two reasons for this. 1.) I don't understand how the queueInts information is used in tandem with the buffer. My impression is that it provides information about various "formatting" tokens, such as newlines tabs and "blocks" to aid the writer in populating the buffer.
Actually, once you figure this all out, if you could add comments etc
so other people can understand it, that would be cool ...

+1

Per, a comment near the top of gnu.text.PrettyWriter mentions that it's based upon SBCL's pprint.lisp; how slight is that connection at this point? Would reading the analogous Lisp source help Charles to understand what's going on, or just confuse him more?

I haven't thought this all the way through, due to point 1, but I suppose I could add two more QITEM's, one for position markers, and one for back references with offsets into the queueInts (or maybe use the queueStrings) array suggesting to the buffer how much space each might occupy.
Some variation of that, yes.

I don't follow the logic as to how all the QITEM's offsets are computed. Why is QITEM_BASE_SIZE = 2? Are 0, 1 used for the QITEM type and size information or something similar?
Correct.
Think of queueInts as a sequence of "formatting token".  There are
various kinds of token, of different sizes.  For example, one type
is QITEM_TAB_TYPE - this is a typecode.  Each token of type
QITEM_TAB_TYPE has "size" QITEM_TAB_SIZE (which is 5).  This means
it uses 5 ints in the queueInts buffer.  Also consider there is an
inheritance tree between the various kinds, so QITEM_TAB_TYPE
inherits from the base type.  The size of the base is QITEM_BASE_SIZE
because all QITEM take at least two items: First a QITEM_TYPE_AND_SIZE
(at offset 0) then a QITEM_POSN (at offset 1).
2.) I could abstract away from a character buffer to an object buffer. I'm worried this will be rejected due to the extra overhead of object allocation, but it would be easier to stuff position marker / back reference objects into the buffer, and they might then have a toString method's which either return the empty string or print-circle notation.
I'd rather not.  Objects are going to be quite a bit more expensive:
object allocation, worse memory locality, and needing more space for
object overhead.
OTOH you could argue this is a symptom of my penchent for micro- optimization, and there is no evidence it buys much. However, you can imagine printing out a large data structure and having to keep the entire "formatting token" structure in memory, so it is good to not be too wasteful about it. Especially given that the existing implementation works this way.

The blocks array doesn't seem to concern me, though there are various calls like out.startLogicalBlock("(", false, ")") which would have to be changed.
Even if you change the internal data structures, there is no reason why these calls would need to be changed.
 It's quite unfortunate that the mechanics aren't in place to do a
generic dispatch here. There seems to be many uses of the instanceof
operator in Kawa for dispatches like this. Is it just because Kawa
pre-dates Generics? Or is there a deeper reason?
Generics (in the Java sense) is a pure compile-time thing.  Generics
(in the Common Lisp sense) or gnu.expr.GenericProc might help, but
they're not really optimized, and regardless of performance it's
still a non-trivial problem: How do you specify and select the
most specific formatting method?
What one could do is have various objects implement gnu.text.Printable or some other interface. However, that doesn't help with standard Java classes, which we can't modify. Plus you'd really want to customize formatting depending on context and programming language. So some kind of dynamic dispatch table would be preferable. But I haven't figured out how to do that efficiently. Maybe Java7 MethodHandles and dynamic dispatch may come to the rescue.

Also consider the Swing ReplPane console. We'd like to re-do the line-breaking of *prior* output lines, if the window width is changed. We do that by repeating the second pass mentioned above. So the buffer data structure needs to be preserved. This is similar to re-breaking a paragraph when a browser window is re-sized - but the pretty-printing is a bit more complicated than the typical line-break algorithm that a browser uses. (The TeX line-break algorithm is a bit more clever - and could be enhanced to do full pretty-printing.) I don't know if anyone does this kind of re-flowable pretty-printing - I think it would be quite cool. (Perhaps DrRacket does this?)
It certainly would be cool! Though, it would require a complete rewrite of the current implementation and a fair amount of work modifying the traditional algorithm.
I don't think it would be that much of a change. In fact, I think it might simplify the algorithm: The first pass has the formatting methods call methods like startLogicalBlock - which just append formatting tokens to queueInt. No attempt to do line-breaking during this pass, which would happen in a completely different section of the code.


Will the output of the first pass be saved somewhere for the entire
contents of the ReplDocument, or will the pretty printer just take
the document contents as previously displayed (i.e. one huge String),
remove all the line breaks, and insert new line breaks appropriate for
the new width?

I suppose my real question is: what information is there in the output
of the first pass which is lost by the time s-expressions (or whatever)
are written to the Document?

I don't think with my skills I'd have time to do that and work on serialisation/editing commands + error linkage for the REPL later. It depends what's more critical, having a customisable (see below) re-flowable pretty-printer, or acquiring more superficial improvements to the REPL. I could always look at it after my proposed project, unless someone else gets interested in it :-).

Having now perused DrRacket's pretty printer[1], I'm under the impression that nothing particularly fancy is going on. One nice feature is a user-editable hash for changing various parameters of the pretty printer, like maximum line length, what level of list nesting to print out (if the level is exceeded, a "..." is printed instead of the extra nesting) and provisions are provided to change white-space surrounding various symbols (e.g, 'lambda, 'let, etc). IIUC, if the output becomes to small to fit within a reduced window width, DrRacket just breaks the line, similar to vanilla Emacs when you resize a text buffer.

DrRacket does enforce a minimum REPL window width of around 68 characters, which seems sensible. I've been reading Knuth's "Breaking paragraphs into lines" paper to try and see how such an algorithm for Scheme code could be derived. He makes note of the tremendous difficulty in generating non-ragged paragraphs when printing thin columns for newspapers and such. If it's hard for English, it would be pointless to try and get fancy with Scheme IMO, just enforce a minimum width.
Note "non-ragged paragraphs" is the goal of "justification". We're not attempting that - just "line-breaking".

Also note that enforcing a minimal width isn't really possible unless you also support something akin to "hyphenation". I think it's to tolerate over-ong lines when the "words" are over-long, though one can imagine an option which also splits words. (I don't think Kawa actually supports that at the moment: You'd need to allow "hyphenated" numeric tokens, probably using some backslash-newline sequences.)

Ugh, even if Kawa supported such a thing, I think I would rather leave words intact. Just have the pretty printer adjust white space; if some- really-long-symbol extends beyond the desired width, either move the whole thing to the next line, or just make sure the next token is on the next line. (Of those two options, the second is probably simpler, as you might otherwise end up attempting to insert an infinite number of newlines before an 81- character symbol...

Jamie

--
Jamison Hope
The PTR Group
www.theptrgroup.com

Follow-Ups:
- Re: Questions about SRFI-38
  - From: Per Bothner

References:
- Questions about SRFI-38
  - From: Charles Turner
- Re: Questions about SRFI-38
  - From: Per Bothner
- Re: Questions about SRFI-38
  - From: Charles Turner
- Re: Questions about SRFI-38
  - From: Per Bothner
- Re: Questions about SRFI-38
  - From: Per Bothner

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]