Hmmm... I think what is going on is that clustering is not letting the
dictionary do its job quickly. In other words, the hash of y being left
alone results in chunks of 1000 points wanting to be together, and the
position of these chunks in the hashed collection is controlled by x
hash hashMultiply.
There are a number of things to observe. First is that doing this,
Time millisecondsToRun:[
d := Dictionary new: 1300000.
1 to: 3 do:[:x | 1 to: 1000 do: [:y |
d at: x@y put: #foo]].
1 to: 3 do:[:x | 1 to: 1000 do: [:y |
d removeKey: x@y]].
]
cuts down the running time to under 100 ms. The second thing is that
the suggestion, now being symmetric on x and y, has other issues as
well. For example, now it will always be the case that
x@y hash = y@x hash
no matter how well the hashes of x and y are mixed / post processed.
This is the motivation for using non symmetric hash functions when order
is important for comparison. In fact, using the Hash Analysis Tool
reveals that the current implementation has these properties:
Collision rate: 1.
Chi square: 0.
Chi square mod p average: 0.343364.
On the other hand, the suggested hash function has these properties:
Collision rate: 2.346245 (anything over 1.01 is bad)
Chi square: 2.384693 (anything significantly over 0 is bad)
Chi square mod p average: 2.856256 (anything significantly over 0.5 is
bad)
For the sake of completeness, the original VW 7.5 implementation that
was replaced produced these numbers.
Collision rate: 244.140625
Chi square: 242.524372
Chi square mod p average: 242.524372 (bonus bad points for not
distributing well mod p)
When doing the hash book, one of my data sets was indeed the 1000x1000
point data set. It gave me headaches like you wouldn't believe. I
tried your suggestion, as well as many others, and discarded them
because of their collision rate. Of the hash functions that gave
perfect collision rate, chi square, and chi square mod p averages, what
happened was that individual chi square mod p values had issues. This
is indicative of clustering, and this is what you found.
While I ended up creating hash functions that had nice chi square mod p
values at the cost of about 25k collisions in a data set of 10^6 points,
the issue is that they require considerable more work than a single line
of code. Rather, they are about 5 long lines of stuff that *looks* like
magic constants and unwarranted complication. The value judgment here
was that in practice it would be rare to see a dictionary of all the
integer points between 0@0 and 999@999 (or 1@1 and 1000@1000), and so
the simpler hash function already provides far more value than the old
one.
The last observation to make is that nothing pays more than knowing your
data set. For example, if one really wanted to put the 1000x1000 square
point data set in a dictionary, then one could actually *use* the
characteristics of the data set. What could be done? Something like
subclassing Point and reimplementing hash like this:
hash
^x * 1000 + y
In the face of this implementation, you could even claim the 5 lines of
densely packed code look ridiculous for this specific application. On
the other hand, the hash function above uses information that is not
available to those that try to provide a hash function that will
actually work in most generic circumstances.
Andres.
PS: did you get to the section of the hash book that talks about
Point>>hash yet? :)
-----Original Message-----
From:
vwnc-bounces@... [mailto:
vwnc-bounces@...] On
Behalf Of Reinout Heeck
Sent: Thursday, May 08, 2008 5:57 AM
To: 'VWNC'
Cc: Andres Valloud
Subject: [vwnc] [vw76] Point>>hash still not good enough
Point>>hash is not defined symmetrically over its x and y variables:
^x hash hashMultiply bitXor: y hash
As a result performance of hashed collections containing points shows
very asymmetrical run times:
Time millisecondsToRun:[
d := Dictionary new.
1 to: 3 do:[:x | 1 to: 1000 do: [:y |
d at: x@y put: #foo]].
1 to: 3 do:[:x | 1 to: 1000 do: [:y |
d removeKey: x@y]].
]
"30532 30022 29883"
Time millisecondsToRun:[
d := Dictionary new.
1 to: 3 do:[:y | 1 to: 1000 do: [:x |
d at: x@y put: #foo]].
1 to: 3 do:[:y | 1 to: 1000 do: [:x |
d removeKey: x@y]].
]
"8 8 8"
If I change the implementation of Point>>hash to be symmetric:
^x hash hashMultiply bitXor: y hash hashMultiply
the run times reported are
" 9 10 9 "
and
" 9 10 10"
HTH,
Reinout
--------
_______________________________________________
vwnc mailing list
vwnc@...
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc_______________________________________________
vwnc mailing list
vwnc@...
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc