This is the mail archive of the kawa@sourceware.org mailing list for the Kawa project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Questions about SRFI-38


On May 10, 2011, at 12:57 PM, Per Bothner wrote:

On 05/10/2011 04:46 PM, Charles Turner wrote:
Sorry for the late reply, I've been away. My response beneath Per's
email below is mostly irrelevant to the main thread of conversation.

I'm struggling to understand how the pretty printer works. I've done the
easy part of traversing and gathering information from a possibly cyclic
list, but I'm having a hard time choosing how to modify the pretty
printer's data structures (and hence, what information I should keep
around in Info objects, I don't really get why your example was putting
a hash table into them).


I think there are at least two reasons for this.
1.) I don't understand how the queueInts information is used in tandem
with the buffer. My impression is that it provides information about
various "formatting" tokens, such as newlines tabs and "blocks" to aid
the writer in populating the buffer.

Actually, once you figure this all out, if you could add comments etc so other people can understand it, that would be cool ...

+1


Per, a comment near the top of gnu.text.PrettyWriter mentions that it's
based upon SBCL's pprint.lisp; how slight is that connection at this point?
Would reading the analogous Lisp source help Charles to understand what's
going on, or just confuse him more?


I haven't thought this all the way through, due to point 1, but I
suppose I could add two more QITEM's, one for position markers, and one
for back references with offsets into the queueInts (or maybe use the
queueStrings) array suggesting to the buffer how much space each might
occupy.

Some variation of that, yes.


I don't follow the logic as to how all the QITEM's offsets are computed.
Why is QITEM_BASE_SIZE = 2? Are 0, 1 used for the QITEM type and size
information or something similar?

Correct.


Think of queueInts as a sequence of "formatting token".  There are
various kinds of token, of different sizes.  For example, one type
is QITEM_TAB_TYPE - this is a typecode.  Each token of type
QITEM_TAB_TYPE has "size" QITEM_TAB_SIZE (which is 5).  This means
it uses 5 ints in the queueInts buffer.  Also consider there is an
inheritance tree between the various kinds, so QITEM_TAB_TYPE
inherits from the base type.  The size of the base is QITEM_BASE_SIZE
because all QITEM take at least two items: First a QITEM_TYPE_AND_SIZE
(at offset 0) then a QITEM_POSN (at offset 1).

2.) I could abstract away from a character buffer to an object buffer.
I'm worried this will be rejected due to the extra overhead of object
allocation, but it would be easier to stuff position marker / back
reference objects into the buffer, and they might then have a toString
method's which either return the empty string or print-circle notation.

I'd rather not. Objects are going to be quite a bit more expensive: object allocation, worse memory locality, and needing more space for object overhead.

OTOH you could argue this is a symptom of my penchent for micro- optimization,
and there is no evidence it buys much. However, you can imagine printing out
a large data structure and having to keep the entire "formatting token"
structure in memory, so it is good to not be too wasteful about it.
Especially given that the existing implementation works this way.


The blocks array doesn't seem to concern me, though there are various
calls like out.startLogicalBlock("(", false, ")") which would have to be
changed.

Even if you change the internal data structures, there is no reason why
these calls would need to be changed.


 It's quite unfortunate that the mechanics aren't in place to do a
generic dispatch here. There seems to be many uses of the instanceof
operator in Kawa for dispatches like this. Is it just because Kawa
pre-dates Generics? Or is there a deeper reason?

Generics (in the Java sense) is a pure compile-time thing. Generics (in the Common Lisp sense) or gnu.expr.GenericProc might help, but they're not really optimized, and regardless of performance it's still a non-trivial problem: How do you specify and select the most specific formatting method?

What one could do is have various objects implement gnu.text.Printable
or some other interface. However, that doesn't help with standard
Java classes, which we can't modify. Plus you'd really want to customize
formatting depending on context and programming language. So some
kind of dynamic dispatch table would be preferable. But I haven't
figured out how to do that efficiently. Maybe Java7 MethodHandles
and dynamic dispatch may come to the rescue.



Also consider the Swing ReplPane console. We'd like to re-do the
line-breaking of *prior* output lines, if the window width is changed.
We do that by repeating the second pass mentioned above. So the
buffer data structure needs to be preserved. This is similar to
re-breaking a paragraph when a browser window is re-sized - but
the pretty-printing is a bit more complicated than the typical
line-break algorithm that a browser uses. (The TeX line-break
algorithm is a bit more clever - and could be enhanced to do
full pretty-printing.) I don't know if anyone does this
kind of re-flowable pretty-printing - I think it would be
quite cool. (Perhaps DrRacket does this?)

It certainly would be cool! Though, it would require a complete rewrite
of the current implementation and a fair amount of work modifying the
traditional algorithm.

I don't think it would be that much of a change. In fact, I think it
might simplify the algorithm: The first pass has the formatting methods
call methods like startLogicalBlock - which just append formatting tokens
to queueInt. No attempt to do line-breaking during this pass, which
would happen in a completely different section of the code.

Will the output of the first pass be saved somewhere for the entire contents of the ReplDocument, or will the pretty printer just take the document contents as previously displayed (i.e. one huge String), remove all the line breaks, and insert new line breaks appropriate for the new width?

I suppose my real question is: what information is there in the output
of the first pass which is lost by the time s-expressions (or whatever)
are written to the Document?

I don't think with my skills I'd have time to do
that and work on serialisation/editing commands + error linkage for the
REPL later. It depends what's more critical, having a customisable (see
below) re-flowable pretty-printer, or acquiring more superficial
improvements to the REPL. I could always look at it after my proposed
project, unless someone else gets interested in it :-).


Having now perused DrRacket's pretty printer[1], I'm under the
impression that nothing particularly fancy is going on. One nice feature
is a user-editable hash for changing various parameters of the pretty
printer, like maximum line length, what level of list nesting to print
out (if the level is exceeded, a "..." is printed instead of the extra
nesting) and provisions are provided to change white-space surrounding
various symbols (e.g, 'lambda, 'let, etc). IIUC, if the output becomes
to small to fit within a reduced window width, DrRacket just breaks the
line, similar to vanilla Emacs when you resize a text buffer.


DrRacket does enforce a minimum REPL window width of around 68
characters, which seems sensible. I've been reading Knuth's "Breaking
paragraphs into lines" paper to try and see how such an algorithm for
Scheme code could be derived. He makes note of the tremendous difficulty
in generating non-ragged paragraphs when printing thin columns for
newspapers and such. If it's hard for English, it would be pointless to
try and get fancy with Scheme IMO, just enforce a minimum width.

Note "non-ragged paragraphs" is the goal of "justification". We're not
attempting that - just "line-breaking".


Also note that enforcing a minimal width isn't really possible unless
you also support something akin to "hyphenation". I think it's to tolerate
over-ong lines when the "words" are over-long, though one can imagine an
option which also splits words. (I don't think Kawa actually supports that
at the moment: You'd need to allow "hyphenated" numeric tokens, probably
using some backslash-newline sequences.)

Ugh, even if Kawa supported such a thing, I think I would rather leave words
intact. Just have the pretty printer adjust white space; if some- really-long-symbol
extends beyond the desired width, either move the whole thing to the next
line, or just make sure the next token is on the next line. (Of those two
options, the second is probably simpler, as you might otherwise end up
attempting to insert an infinite number of newlines before an 81- character
symbol...


Jamie

--
Jamison Hope
The PTR Group
www.theptrgroup.com




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]