Reinout,
In the past, your comments made me conclude that you prefer hash
functions that take little time to evaluate over hash functions that
distribute hash values very well but take their time to do so.
However, now that Point>>hash substantially improves what was shipped
with VW 7.5 and before in terms of distribution while keeping
computational complexity at some sort of bare minimum, it is also a
problem because it does not distribute well enough in cases in which it
seems that a 2d array (or hash bucket strategy as opposed to open
addressing linear probing) would do better for you.
What I am basically hearing in response to this situation is "make the
hash function more expensive so my problem goes away". However, if
Point>>hash is modified to do more work in order to distribute better,
then it starts becoming a target of your previous admonitions regarding
execution speed.
While I do not necessarily disagree with improving the distribution of
Point>>hash further at some expense of hash value computation, my
concern is that it is not possible to win.
Regarding the hash analysis tool, the stated measurements are direct
predictors of hashed collection performance. See chapter 2 of the hash
book.
Andres.
Reinout Heeck wrote:
> On May 9, 2008, at 9:33 PM, Andres Valloud wrote:
>
>
>> Reinout Heeck wrote:
>>
>>> Valloud, Andres wrote:
>>>
>>>
>>>> Hmmm... I think what is going on is that clustering is not letting
>>>> the
>>>> dictionary do its job quickly.
>>>>
>>>>
>>> Yes, it surprised me that this case was not fixed in the hashing
>>> overhaul.
>>>
>>>
>>>
>>>
>> For all hash functions there will be datasets that cause them to
>> behave
>> badly. This cannot be avoided.
>>
>>
>>> I had hoped these [value judgment] days would be over...
>>>
>
> Heh :-)
>
> What I meant there was the days of worrying about clustered values.
>
>
>
> More generally, before the hashing overhaul I had run into three
> problems repeatedly:
> 1) Poor performance of hashed collections holding strings.
> 2) Poor performance of hashed collections holding clustered integer
> values.
> 3) Poor abstraction of hash combination 'for dummies' leading to poor
> #hash implementations on domain objects.
>
> An earlier conversation we had extinguished my hopes for any solution
> to 3) in the overhaul but I had fully expected both 1) and 2) to be
> resolved by the redesign. Now I get the impression only 1) has been
> addressed adequately.
>
> You seem to be dismissing 2) as being a special case that needs not
> perform with the base collection implementations. Since this is a
> value judgment we can keep debating this, instead I can provide you a
> simple datapoint: my shop would be relieved when 2) is addressed.
>
>
> I had a quick look (for the first time) at the hash analysis tool in
> order to measure performance of some hashed collection access
> patterns. I only found quality assessments of #hash implementations
> for integer based values but no benchmark tool to measure run time of
> collection accesses - it seems this is not supported. Is there some
> other tool I need to load in order to measure the performance of
> hashed collections?
>
>
>
> Thanks for your suggestions regarding Point>>hash, I'll experiment
> with them when I'm back at work (bank holiday here today).
>
> Cheers,
>
> Reinout
> -------
>
> _______________________________________________
> vwnc mailing list
>
vwnc@...
>
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc>
>
_______________________________________________
vwnc mailing list
vwnc@...
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc