NavigableMap implementation bugs

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

NavigableMap implementation bugs

by Martin Buchholz-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Doug and Josh and Chris,

(Chris, please file a bug)

TreeMap.navigableKeySet().subSet(x,y).remove(o)
fails because TreeMap.KeySet.subset calls the
TreeSet(NavigableMap<E,Object>)
constructor, and that requires an argument map
not containing arbitrary value objects for correctness.
This correctness is probably only
noticed in the return value from remove being incorrect.

In the case of ConcurrentSkipListMap, the correctness
issue is even more serious, since remove(existing element)
fails to remove it.

It's not obvious how to best fix it.  The TreeSet(NavigableMap)
and ConcurrentSkipListMap(NavigableMap) constructors
simply cannot be used to get views of non-empty maps.
It's clear that the usual straightforward but tedious solution
will work, but is there an elegant solution?

(Curiously, this is one of the rare times where
generics compiler warnings caught a genuine bug!
It's a mistake in this case to force a NavigableMap<K,V>
into a NavigableMap<K,Object>)

Most of the hits below are bugs:

 $ find -name '*Map.java' | xargs grep -E 'new (TreeSet|ConcurrentSkipListSet)'
./TreeMap.java:            return new TreeSet<E>(m.subMap(fromElement,
fromInclusive,
./TreeMap.java:            return new TreeSet<E>(m.headMap(toElement,
inclusive));
./TreeMap.java:            return new
TreeSet<E>(m.tailMap(fromElement, inclusive));
./TreeMap.java:            return new TreeSet(m.descendingMap());
./concurrent/ConcurrentSkipListMap.java:            return new
ConcurrentSkipListSet<E>
./concurrent/ConcurrentSkipListMap.java:            return new
ConcurrentSkipListSet<E>(m.headMap(toElement, inclusive));
./concurrent/ConcurrentSkipListMap.java:            return new
ConcurrentSkipListSet<E>(m.tailMap(fromElement, inclusive));
./concurrent/ConcurrentSkipListMap.java:            return new
ConcurrentSkipListSet(m.descendingMap());

Test case follows:

/*
 * Copyright 2008 Sun Microsystems, Inc.  All Rights Reserved.
 * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
 *
 * This code is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License version 2 only, as
 * published by the Free Software Foundation.
 *
 * This code is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
 * version 2 for more details (a copy is included in the LICENSE file that
 * accompanied this code).
 *
 * You should have received a copy of the GNU General Public License version
 * 2 along with this work; if not, write to the Free Software Foundation,
 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
 *
 * Please contact Sun Microsystems, Inc., 4150 Network Circle, Santa Clara,
 * CA 95054 USA or visit www.sun.com if you need additional information or
 * have any questions.
 */

/*
 * @test
 * @summary navigableMap.keyset().subSet().remove(x)
 * @author  Martin Buchholz
 */

import java.util.Collection;
import java.util.NavigableMap;
import java.util.TreeMap;
import java.util.concurrent.ConcurrentSkipListMap;

public class Bug12 {
    void test(String[] args) throws Throwable {
        testNavigableMap(new TreeMap<Integer,Integer>());
        testNavigableMap(new ConcurrentSkipListMap<Integer,Integer>());
    }

    void checkEmpty(Collection<?> c) {
        check(c.isEmpty());
        check(c.size() == 0);
        check(! c.iterator().hasNext());
    }

    void checkEmpty(NavigableMap<?,?> m) {
        check(m.isEmpty());
        check(m.size() == 0);
        checkEmpty(m.keySet());
        checkEmpty(m.values());
        checkEmpty(m.entrySet());
    }

    void testNavigableMap(NavigableMap<Integer,Integer> m) {
        System.out.printf("Testing %s%n", m.getClass());
        checkEmpty(m);
        equal(m.put(1,2),null);
        equal(m.put(1,3),2);
        equal(m.remove(1),3);
        checkEmpty(m);
        equal(m.put(1,2),null);
        check(m.navigableKeySet().remove(1));
        checkEmpty(m);
        equal(m.put(1,2),null);
        check(m.navigableKeySet().headSet(3).remove(1));
        checkEmpty(m);
        equal(m.put(1,2),null);
        check(m.navigableKeySet().tailSet(-3).remove(1));
        checkEmpty(m);
        equal(m.put(1,2),null);
        check(m.navigableKeySet().subSet(-3,3).remove(1));
        checkEmpty(m);
        equal(m.put(1,2),null);
        check(m.isEmpty());
    }

    //--------------------- Infrastructure ---------------------------
    volatile int passed = 0, failed = 0;
    void pass() {passed++;}
    void fail() {failed++; Thread.dumpStack();}
    void fail(String msg) {System.err.println(msg); fail();}
    void unexpected(Throwable t) {failed++; t.printStackTrace();}
    void check(boolean cond) {if (cond) pass(); else fail();}
    void equal(Object x, Object y) {
        if (x == null ? y == null : x.equals(y)) pass();
        else fail(x + " not equal to " + y);}
    public static void main(String[] args) throws Throwable {
        Class<?> k = new Object(){}.getClass().getEnclosingClass();
        try {k.getMethod("instanceMain",String[].class)
                .invoke( k.newInstance(), (Object) args);}
        catch (Throwable e) {throw e.getCause();}}
    public void instanceMain(String[] args) throws Throwable {
        try {test(args);} catch (Throwable t) {unexpected(t);}
        System.out.printf("%nPassed = %d, failed = %d%n%n", passed, failed);
        if (failed > 0) throw new AssertionError("Some tests failed");}
}
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: NavigableMap implementation bugs

by Doug Lea :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Martin Buchholz wrote:

> TreeMap.navigableKeySet().subSet(x,y).remove(o)
> fails because TreeMap.KeySet.subset calls the
> TreeSet(NavigableMap<E,Object>)
> constructor

Thanks for finding this!

> In the case of ConcurrentSkipListMap, the correctness
> issue is even more serious, since remove(existing element)
> fails to remove it.
>
> It's not obvious how to best fix it.  The TreeSet(NavigableMap)
> and ConcurrentSkipListMap(NavigableMap) constructors
> simply cannot be used to get views of non-empty maps.
> It's clear that the usual straightforward but tedious solution
> will work, but is there an elegant solution?

I don't think so. Relaying the subset methods to the *Set classes
was just an implementation expediency under the mis-thought that they
would take care of recursive subsets etc rather than needing
special implementations for KeySets. But the byproduct for remove()
is clearly wrong so they do need separate implementations.

-Doug



_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: NavigableMap implementation bugs

by Osvaldo Pinali Doederlein :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Doug Lea wrote:
> I don't think so. Relaying the subset methods to the *Set classes
> was just an implementation expediency under the mis-thought that they
> would take care of recursive subsets etc rather than needing
> special implementations for KeySets. But the byproduct for remove()
> is clearly wrong so they do need separate implementations.
>    
This remembers me from an old pet peeve, the fact that java.util.HashSet
is implemented as a wrapper over java.util.HashMap. This sucks
performance-wise, due to the additional indirections everywhere, and
even more important, because every entry single object is larger than
necessary - it contains a key and a value fields, but a single field
would be necessary for HashSet. If you have a Set with millions of
entries, the overhead of one useless pointer per entry is significant.
Plus, the wrapping implementation uses a dummy, fixed Object instance as
the value for all entries - and this may create another subtle
performance problem: large numbers of "cross-space" references in the
heap. I know that in most GCs only old-to-young refs are evil, but it
seems that in some GCs any cross-space reference is bad (requires remset
tracking), like Detlef et al's new "Garbage-First" collector (if I
understood it).

Back in J2SE1.2 time, the runtime savings from the wrapping
implementation could be significant... but, today? The HashMap* classes
inside the latest rt.jar are 18,5Kb total (uncompressed), versus 3Kb
from HashSet. That's a 12Kb of space saving, plus some small saving in
JIT time too. I'd say that this is an irrelevant advantage, even in
JavaME. The performance bonus of a tuned HashSet, however modest, would
certainly be much more significant. I know it's ugly from a maintenance
POV to have two large classes that will be 90% identical, but to
paraphrase the movie Some Like It Hot, nothing is perfect...

A+
Osvaldo

--
-----------------------------------------------------------------------
Osvaldo Pinali Doederlein                   Visionnaire Informática S/A
osvaldo@...                http://www.visionnaire.com.br
Arquiteto de Tecnologia                          +55 (41) 337-1000 #223

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: NavigableMap implementation bugs

by Joshua Bloch-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Osvaldo,
 
Yes, I must say that I'm surprised that my original implementation of HashSet still stands.  It would be interesting to see what sort of performance improvement (time and space) we could get with a freestanding HashSet implementation.
 
     Josh

On Mon, Jun 9, 2008 at 2:09 PM, Osvaldo Pinali Doederlein <osvaldo@...> wrote:
Doug Lea wrote:
> I don't think so. Relaying the subset methods to the *Set classes
> was just an implementation expediency under the mis-thought that they
> would take care of recursive subsets etc rather than needing
> special implementations for KeySets. But the byproduct for remove()
> is clearly wrong so they do need separate implementations.
>
This remembers me from an old pet peeve, the fact that java.util.HashSet
is implemented as a wrapper over java.util.HashMap. This sucks
performance-wise, due to the additional indirections everywhere, and
even more important, because every entry single object is larger than
necessary - it contains a key and a value fields, but a single field
would be necessary for HashSet. If you have a Set with millions of
entries, the overhead of one useless pointer per entry is significant.
Plus, the wrapping implementation uses a dummy, fixed Object instance as
the value for all entries - and this may create another subtle
performance problem: large numbers of "cross-space" references in the
heap. I know that in most GCs only old-to-young refs are evil, but it
seems that in some GCs any cross-space reference is bad (requires remset
tracking), like Detlef et al's new "Garbage-First" collector (if I
understood it).

Back in J2SE1.2 time, the runtime savings from the wrapping
implementation could be significant... but, today? The HashMap* classes
inside the latest rt.jar are 18,5Kb total (uncompressed), versus 3Kb
from HashSet. That's a 12Kb of space saving, plus some small saving in
JIT time too. I'd say that this is an irrelevant advantage, even in
JavaME. The performance bonus of a tuned HashSet, however modest, would
certainly be much more significant. I know it's ugly from a maintenance
POV to have two large classes that will be 90% identical, but to
paraphrase the movie Some Like It Hot, nothing is perfect...

A+
Osvaldo

--
-----------------------------------------------------------------------
Osvaldo Pinali Doederlein                   Visionnaire Informática S/A
osvaldo@...                http://www.visionnaire.com.br
Arquiteto de Tecnologia                          +55 (41) 337-1000 #223

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: NavigableMap implementation bugs

by Martin Buchholz-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Osvaldo, I fundamentally agree with you that
optimizing HashSet in the manner you describe
is worthwhile, but because the implementation
needs to "parallel" HashMap, it would be good
to do this after the current hacking activity on
HashMap quiesces.

Unfortunately, this kind of programming
is pure tedium - who will volunteer?

Martin

On Mon, Jun 9, 2008 at 2:09 PM, Osvaldo Pinali Doederlein
<osvaldo@...> wrote:

> Doug Lea wrote:
>>
>> I don't think so. Relaying the subset methods to the *Set classes
>> was just an implementation expediency under the mis-thought that they
>> would take care of recursive subsets etc rather than needing
>> special implementations for KeySets. But the byproduct for remove()
>> is clearly wrong so they do need separate implementations.
>>
>
> This remembers me from an old pet peeve, the fact that java.util.HashSet is
> implemented as a wrapper over java.util.HashMap. This sucks
> performance-wise, due to the additional indirections everywhere, and even
> more important, because every entry single object is larger than necessary -
> it contains a key and a value fields, but a single field would be necessary
> for HashSet. If you have a Set with millions of entries, the overhead of one
> useless pointer per entry is significant. Plus, the wrapping implementation
> uses a dummy, fixed Object instance as the value for all entries - and this
> may create another subtle performance problem: large numbers of
> "cross-space" references in the heap. I know that in most GCs only
> old-to-young refs are evil, but it seems that in some GCs any cross-space
> reference is bad (requires remset tracking), like Detlef et al's new
> "Garbage-First" collector (if I understood it).
>
> Back in J2SE1.2 time, the runtime savings from the wrapping implementation
> could be significant... but, today? The HashMap* classes inside the latest
> rt.jar are 18,5Kb total (uncompressed), versus 3Kb from HashSet. That's a
> 12Kb of space saving, plus some small saving in JIT time too. I'd say that
> this is an irrelevant advantage, even in JavaME. The performance bonus of a
> tuned HashSet, however modest, would certainly be much more significant. I
> know it's ugly from a maintenance POV to have two large classes that will be
> 90% identical, but to paraphrase the movie Some Like It Hot, nothing is
> perfect...
>
> A+
> Osvaldo
>
> --
> -----------------------------------------------------------------------
> Osvaldo Pinali Doederlein                   Visionnaire Informática S/A
> osvaldo@...                http://www.visionnaire.com.br
> Arquiteto de Tecnologia                          +55 (41) 337-1000 #223
>
>

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: NavigableMap implementation bugs

by Joshua Bloch-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Oooh, ooh, pick me!
 
       Josh
 
P.S. It isn't as tedious as you think.  Doing the obvious transformation on HashMap gives you a benchmark.  Then you try to beat it.


 
On Mon, Jun 9, 2008 at 3:01 PM, Martin Buchholz <martinrb@...> wrote:
Osvaldo, I fundamentally agree with you that
optimizing HashSet in the manner you describe
is worthwhile, but because the implementation
needs to "parallel" HashMap, it would be good
to do this after the current hacking activity on
HashMap quiesces.

Unfortunately, this kind of programming
is pure tedium - who will volunteer?

Martin

On Mon, Jun 9, 2008 at 2:09 PM, Osvaldo Pinali Doederlein
<osvaldo@...> wrote:
> Doug Lea wrote:
>>
>> I don't think so. Relaying the subset methods to the *Set classes
>> was just an implementation expediency under the mis-thought that they
>> would take care of recursive subsets etc rather than needing
>> special implementations for KeySets. But the byproduct for remove()
>> is clearly wrong so they do need separate implementations.
>>
>
> This remembers me from an old pet peeve, the fact that java.util.HashSet is
> implemented as a wrapper over java.util.HashMap. This sucks
> performance-wise, due to the additional indirections everywhere, and even
> more important, because every entry single object is larger than necessary -
> it contains a key and a value fields, but a single field would be necessary
> for HashSet. If you have a Set with millions of entries, the overhead of one
> useless pointer per entry is significant. Plus, the wrapping implementation
> uses a dummy, fixed Object instance as the value for all entries - and this
> may create another subtle performance problem: large numbers of
> "cross-space" references in the heap. I know that in most GCs only
> old-to-young refs are evil, but it seems that in some GCs any cross-space
> reference is bad (requires remset tracking), like Detlef et al's new
> "Garbage-First" collector (if I understood it).
>
> Back in J2SE1.2 time, the runtime savings from the wrapping implementation
> could be significant... but, today? The HashMap* classes inside the latest
> rt.jar are 18,5Kb total (uncompressed), versus 3Kb from HashSet. That's a
> 12Kb of space saving, plus some small saving in JIT time too. I'd say that
> this is an irrelevant advantage, even in JavaME. The performance bonus of a
> tuned HashSet, however modest, would certainly be much more significant. I
> know it's ugly from a maintenance POV to have two large classes that will be
> 90% identical, but to paraphrase the movie Some Like It Hot, nothing is
> perfect...
>
> A+
> Osvaldo
>
> --
> -----------------------------------------------------------------------
> Osvaldo Pinali Doederlein                   Visionnaire Informática S/A
> osvaldo@...                http://www.visionnaire.com.br
> Arquiteto de Tecnologia                          +55 (41) 337-1000 #223
>
>

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

HashSet performance

by Mark Thornton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Osvaldo Pinali Doederlein wrote:
> This remembers me from an old pet peeve, the fact that java.util.HashSet
> is implemented as a wrapper over java.util.HashMap. This sucks
> performance-wise, due to the additional indirections everywhere, and
> even more important, because every entry single object is larger than
> necessary - it contains a key and a value fields, but a single field
> would be necessary for HashSet. If you have a Set with millions of
> entries, the overhead of one useless pointer per entry is significant.
>  

When I have millions of entries I usually use an implementation with
open addressing and double hashing. This eliminates all the Entry
objects and thus saves a LOT more space. It is tempting to implement
Map's in the same way, except for the need to present Map.Entry objects
in the interface. I haven't compared performance for a while, but when I
did there wasn't much difference in time (those indirections aren't
particularly expensive). Space use is reduced and for large maps this
has an indirect impact on speed.

Mark Thornton

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: HashSet performance

by Martin Buchholz-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I believe it is imposssible to use double hashing
in a general-purpose Map implementation,
because we are basically stuck with the one hash
function provided by hashCode().  Right or wrong?

Martin

On Tue, Jun 10, 2008 at 12:48 AM, Mark Thornton > When I have millions
of entries I usually use an implementation with
> open addressing and double hashing. This eliminates all the Entry
> objects and thus saves a LOT more space. It is tempting to implement
> Map's in the same way, except for the need to present Map.Entry objects
> in the interface.
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: HashSet performance

by Marcelo Fukushima :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

you can apply some function over the result of the hashCode, if the
hashCode's of colliding entries are not identical

now regarding the Map.Entry, theres a library (i forgot witch) that
builds the Set<Entry> lazily when its required, hence extra memory is
required only while the set is being used.

On 6/10/08, Martin Buchholz <martinrb@...> wrote:

> I believe it is imposssible to use double hashing
>  in a general-purpose Map implementation,
>  because we are basically stuck with the one hash
>  function provided by hashCode().  Right or wrong?
>
>  Martin
>
>  On Tue, Jun 10, 2008 at 12:48 AM, Mark Thornton > When I have millions
>
> of entries I usually use an implementation with
>  > open addressing and double hashing. This eliminates all the Entry
>  > objects and thus saves a LOT more space. It is tempting to implement
>  > Map's in the same way, except for the need to present Map.Entry objects
>  > in the interface.
>
> _______________________________________________
>  Concurrency-interest mailing list
>  Concurrency-interest@...
>  http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest
>


--
[]'s
Marcelo Takeshi Fukushima
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: HashSet performance

by Dave Griffith :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The GNU Trove library is probably what you're thinking of.  I've used it successfully in several performance-critical circumstances.

http://trove4j.sourceforge.net/



On Tue, Jun 10, 2008 at 1:28 PM, Marcelo Fukushima <takeshi10@...> wrote:
you can apply some function over the result of the hashCode, if the
hashCode's of colliding entries are not identical

now regarding the Map.Entry, theres a library (i forgot witch) that
builds the Set<Entry> lazily when its required, hence extra memory is
required only while the set is being used.

On 6/10/08, Martin Buchholz <martinrb@...> wrote:
> I believe it is imposssible to use double hashing
>  in a general-purpose Map implementation,
>  because we are basically stuck with the one hash
>  function provided by hashCode().  Right or wrong?
>
>  Martin
>
>  On Tue, Jun 10, 2008 at 12:48 AM, Mark Thornton > When I have millions
>
> of entries I usually use an implementation with
>  > open addressing and double hashing. This eliminates all the Entry
>  > objects and thus saves a LOT more space. It is tempting to implement
>  > Map's in the same way, except for the need to present Map.Entry objects
>  > in the interface.
>
> _______________________________________________
>  Concurrency-interest mailing list
>  Concurrency-interest@...
>  http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest
>


--
[]'s
Marcelo Takeshi Fukushima
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: HashSet performance

by Martin Buchholz-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Jun 10, 2008 at 10:28 AM, Marcelo Fukushima <takeshi10@...> wrote:
> you can apply some function over the result of the hashCode, if the
> hashCode's of colliding entries are not identical

Good point.

The two hash functions won't be completely independent,
but this may be good enough in practice.  We know the
low order bits are colliding, so perhaps just shift them
away to the right, and use the remaining ones?

Martin
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: HashSet performance

by Mark Thornton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Martin Buchholz wrote:
> I believe it is imposssible to use double hashing
> in a general-purpose Map implementation,
> because we are basically stuck with the one hash
> function provided by hashCode().  Right or wrong?
>
> Martin
>  
If the hash table has more than 65536 entries then the two hash values
won't be completely independent, but in practice it doesn't seem to matter.

Mark Thornton

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: HashSet performance

by Osvaldo Pinali Doederlein :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Tue, Jun 10, 2008 at 12:48 AM, Mark Thornton > When I have millions
of entries I usually use an implementation with
  
open addressing and double hashing. This eliminates all the Entry
objects and thus saves a LOT more space. It is tempting to implement
Map's in the same way, except for the need to present Map.Entry objects
in the interface.
    
You don't always get to choose your collections. For example, when using most ORM tools, the relationships of persistent entities are often mapped to collections of List, Set or Map types (typically, wrappers over java.util's collections with extra stuff like lazy loading). There are many other examples out there, collections are pretty popular objects and there are tons of libraries that manipulate collections, often creating them without any mechanism to override the concrete types they use.

Not to mention that the Java runtime itself (i.e. the JCL) makes heavy use of basic types like collections. I'd bet some serious money that if I could make a significant, no-compromise optimization in a class like java.util.HashMap, this would provide important performance benefits in many places (reflection metadata, String internalization, etc.). HashSet is a more niche class, but it still seems to be used very frequently in the JCL: a quick grep for HashSet inside the JDK's src.zip reveals significant usage in serialization, swing, security, CORBA, reflection/classloading, JMX, and in java.util itself (EnumSet, ThreadPoolExecutor, ResourceBundle and several other classes). So, it's very likely that the suggested free-standing optimization of HashSet would benefit the JRE performance (not to mention JavaEE and other libs/APIs/frameworks), even before considering any application code.

Now, while we are playing with HashSet optimization, another idea is: we could have a method like get(), that looks for an object and if it's found, returns it. The Set interface only offers contains(), which returns a boolean - definitely a cleaner interface for the most common usage of Set, but then you can't use a Set for the VERY important use-case of canonicalization. Canonicalization would only require a Set, but when the canonic object is found, if want to return it, so the caller can use it in lieu of the original object. We are forced to use an explicit Map collection just because it offers the get() lookup method. A Set.get() method would be easily defined: public T get (T obj), checks if the set contains an object that's equal to obj, and returns that object (or null if not found). Now, we obviously cannot add new methods to the Set interface, but we could do that in concrete implementations like HashSet. Implementation of such method would be trivial (the existing contains() would me morphed into that new method - requiring only to change a return statement and the signature - and the new contains() would return "get(obj) != null").

P.S.: Sun doesn't need a HashSet that allows efficient canonicalization because they cheat - String.intern() is a native method, probably implemented with a more efficient collection in the native layer. :)

A+
Osvaldo
 
-----------------------------------------------------------------------
Osvaldo Pinali Doederlein                   Visionnaire Informática S/A
osvaldo@...                http://www.visionnaire.com.br
Arquiteto de Tecnologia                          +55 (41) 337-1000 #223

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: HashSet performance

by Remi Forax :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Martin Buchholz a écrit :
> I believe it is imposssible to use double hashing
> in a general-purpose Map implementation,
> because we are basically stuck with the one hash
> function provided by hashCode().  Right or wrong?
>
> Martin
>  
Martin, take a look to implementation. of IdentityHashMap
It already uses a double hashing.

And like trove, Map.Entry(ies) are created when needed (at least in 1.6),
see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6312706
> On Tue, Jun 10, 2008 at 12:48 AM, Mark Thornton > When I have millions
> of entries I usually use an implementation with
>  
>> open addressing and double hashing. This eliminates all the Entry
>> objects and thus saves a LOT more space. It is tempting to implement
>> Map's in the same way, except for the need to present Map.Entry objects
>> in the interface.
>>    
Rémi
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: HashSet performance

by Mark Thornton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Rémi Forax wrote:

> Martin Buchholz a écrit :
>> I believe it is imposssible to use double hashing
>> in a general-purpose Map implementation,
>> because we are basically stuck with the one hash
>> function provided by hashCode().  Right or wrong?
>>
>> Martin
>>  
> Martin, take a look to implementation. of IdentityHashMap
> It already uses a double hashing.
>
> And like trove, Map.Entry(ies) are created when needed (at least in 1.6),
> see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6312706

Much as I like optimisation, reusing the Map.Entry object was always a
step too far for me.

Mark Thornton

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest@...
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest

Re: HashSet performance

by studdugie :: Rate this Message:

Reply to Author