This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: avl-trees vs. hashes


On 28 Oct 1998, Harvey J. Stein wrote:

> Jay Glascoe <jglascoe@jay.giss.nasa.gov> writes:
> 
>  > 1.  not storing the key's hash in the entry would be a big win for
>  > deletion in certain, relatively common situations.  For example, if the
> 
> What do you mean, "for deletion"?  Because of the GC, more memory =
> more CPU = slower in general (because more memory gets scanned), so
> including the hash value makes the program slower in general.
> 

All I know is that my hash tables outperform (speed-wise) Guile's hash
tables at everything except deletion.  If I test

          (set! mytab (make-proc))
	  (for-each (insert-entry mytab insert-proc) lines) 
	  (for-each (find-entry mytab find-proc) lines)     
 	  (for-each (delete-entry mytab delete-proc) lines) 

auto-shrink enabled
*time: 557* 
(gc-time-taken . 332)
(cells-allocated . 35957)
(cell-heap-size . 98304)
(bytes-malloced . 93495)
(gc-malloc-threshold . 150000)
(cell-heap-segments (537251512 . 536989368) (537909288 . 537385000))

auto-shrink disabled
*time: 564* 
(gc-time-taken . 323)
(cells-allocated . 35945)
(cell-heap-size . 98304)
(bytes-malloced . 93497)
(gc-malloc-threshold . 150000)
(cell-heap-segments (537251512 . 536989368) (537963016 . 537438728))

Guile hash
*time: 448* 
(gc-time-taken . 241)
(cells-allocated . 36002)
(cell-heap-size . 98304)
(bytes-malloced . 126346)
(gc-malloc-threshold . 238159)
(cell-heap-segments (537251512 . 536989368) (537962712 . 537438424))

in fact, Guile's hashes are faster at deletion than they are at insertion!
if I test

	  (set! mytab (make-proc))
	  (for-each (insert-entry mytab insert-proc) lines) 
	  (for-each (find-entry mytab find-proc) lines)     
	  (for-each (insert-entry mytab insert-proc) lines) 

auto-shrink enabled
*time: 398* 
(gc-time-taken . 185)
(cells-allocated . 65968)
(cell-heap-size . 294912)
(bytes-malloced . 109871)
(gc-malloc-threshold . 150000)
(cell-heap-segments (537251512 . 536989368) (537909288 . 537385000)
                    (539510888 . 537938024))

Guile hash
*time: 635* 
(gc-time-taken . 433)
(cells-allocated . 56009)
(cell-heap-size . 98304)
(bytes-malloced . 126346)
(gc-malloc-threshold . 238159)
(cell-heap-segments (537251512 . 536989368) (537962712 . 537438424))


> OTOH, all hash operations requiring looking for items might be faster
> because you can test for hash equality before testing key equality &
> the hash equality test might be faster.

yes, in almost all cases.  it's faster, e.g. to check whether two symbols
are eq? because then, in the C code, you're comparing two long's,
whereas testing two SCM numbers requires that you do two SCM_INUM()'s

#define SCM_INUM(x) (((x)>>1)>>1)

> I can't imagine the hasher being faster than just looking up the
> number, so it should always be faster to look them up than to
> recompute all the hashes.  However, it might be a minor point for
> small keys because of the other work needed when rehashing.

the problem is that a full size long won't fit into a SCM INUM.  I'm
currently masking (and'ing) my hash with 2,097,157 which does fit, but
obviously won't work work for extremely huge tables (>2,000,000 entries). 
If I insist on a range (- 2^31, 2^31 - 1) for my hashes, then they'll be
stored as "big numbers".  Converting a big number to a long is like
converting a string to a number (not very fast).