This is the mail archive of the guile@cygnus.com mailing list for the Guile project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: New shared substring implementation

To: forcer <forcer@mindless.com>
Subject: Re: New shared substring implementation
From: Greg Harvey <Greg.Harvey@thezone.net>
Date: 31 May 1999 20:47:21 -0230
Cc: guile@cygnus.com
References: <199905310759.JAA17866@forcix.roof.lan>

forcer <forcer@mindless.com> writes:

> Lately, i was wondering why there was the need for (explicitely)
> shared substrings at all.
> As hinted in guile-ref, a copy-on-write strategy might make
> make-shared-substring useless, since normal substring could
> return a shared substring which will be copied on the first
> write. This would add a simple check on each string mutation
> function, but could increase performance drastically otherwise
> and retain a very simple interface.
> 
> I liked the idea and wanted to implement it in guile. Taking a
> look at strings.[ch] (and symbol.h), i noticed that the length of
> a string is encoded in it's SCM object, while the SCM obj's CDR
> has to point to the char* directly.
> 
> If i wanted to change this, i had to change the implementation of
> symbols as well (and probably take in alot of incompatibilities
> with other parts of guile that use strings)
> 
> I assume that such pointer plays - negative indices to store
> information, etc. - are common in guile. One question arises:
> why?
> 
> Is (SCM_CAR(val)>>8) that much faster than SCM_CDR(val)->length?

Depends an awful lot on the usage patterns of strings in the
program. It shouldn't be too much of a pain to keep with that, though.

> Is SCM_CDR(val)[-1] that much faster than SCM_CDR(val)->something?

Should be pretty much the same thing.

> Well, my thoughts about the copy-on-write idea stumbled on one
> problem.
> (define str1 "foo")		   ; a string is allocated as usual
> (define str2 (substring str1 0 1)) ; str2 is now a shared
> 			           ; substring of str1
> (string-set! str1 0 #\b)	   ; since str1 is not a shared
> 				   ; substring, it's not copied,
> 				   ; and thus str2 => "b"
> 
> A solution to this problem would be to have (substring) mark the
> string it's applied to to be a "substring" as well. The problem
> then is that:
> (define str1 "foo")
> (define str2 (substring str1 2 3)) ; str1 and str2 are marked as
> 			           ; shared substrings
> (string-set! str1 0 #\b)	   ; str1 now points to a newly
> 				   ; malloc'd copy of the string,
> 				   ; and str2 points to the
> 				   ; original string[2]
> (string-set! str2 0 #\b)	   ; str2 is now a copy of the
> 				   ; string as well, and there's
> 				   ; no pointer to the original
> 				   ; string left :]
> 
> Does anyone have an idea on how to solve this elegantly?

Assuming that it's using a structure rather than the current
cdr->(char *) (probably not a bad idea, though you'd ideally like to
have the structure as a block that contained the string as well, but
that's probably a bit messy ;), you could just keep track of
substrings that are pointing to it. Then a modification function
checks the string for shared substrings, and if it does have 'em, do
the copying (then chuck all references). ref counting can't really
help out here, since you'd have no way of finding those strings
pointing in without scanning the entire heap (you probably don't want
to be doing that ;). One thing you'd want to look out for here is to
copy all very small strings (<=4 or 8 bytes of string data... aren't
those already kept on the cdr, tho, or did I dream that?), rather than
creating a shared string.

-- 
Greg

References:
- New shared substring implementation
  - From: forcer <forcer@mindless.com>

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]