Indexing distinct-values?

View: New views
13 Messages — Rating Filter:   Alert me  

Indexing distinct-values?

by Chris Carlin :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Is there a way to use indexes to speed up calculations of
distinct-values? I thought perhaps with string equality it would be
possible, but I haven't been able to find any way to increase performance.

All of my docs have a <category>foo</category>, and I need to display a
list of used categories. I was using the following query, but it's just
too slow:

distinct-values(collection('coll.dbxml')/doc/category/string())

There will be tens of thousands of documents, but perhaps twenty
categories if that helps.

~Chris


------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by Fabrice Desré - France Telecom DR&D/MAPS/AMS :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Chris Carlin wrote:

> Is there a way to use indexes to speed up calculations of
> distinct-values? I thought perhaps with string equality it would be
> possible, but I haven't been able to find any way to increase performance.
>
> All of my docs have a <category>foo</category>, and I need to display a
> list of used categories. I was using the following query, but it's just
> too slow:
>
> distinct-values(collection('coll.dbxml')/doc/category/string())
>
> There will be tens of thousands of documents, but perhaps twenty
> categories if that helps.

  Hi Chris,

I'm facing the very same issue here. Looking at the source code for the
distinct-values() function (in
dbxml-2.2.13/pathan/src/functions/FunctionDistinctValues.cpp) we learn
two things :
- It doesn't try to use indexes.
- It uses a vector to store the current list of already seen items. This
means that depending on the shape of the datas, algorithmic complexity
is at best O(N), and at worst O(N^2). It would be much more efficient
here to use a hash map.

I'd like to help patching this (in principle this doesn't seem to be too
complicated), but I'm not familiar with pathan internals - If anyone can
help start on this, this would be great !

        Fabrice
--
Fabrice Desré
France Télécom R&D/MAPS/AMS
Tél: +(33) (0)2 96 05 31 43
Fax: +(33) (0)2 96 05 39 45
http://www.francetelecom.com/rd


------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by volkris-2 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

I'm currently thinking about workarounds, seeing if I could write an
XQuery function that would use other functions that themselves would
hopefully use indexes. Reimplementing distinct-values() sounds harsh, but
perhaps the reimplementation would hit indexes.

I'm also thinking about ways to use indexLookup instead, but it seems PHP
can't really use it because of limitations on XmlValue.

Anyway, I haven't had time to actually type anything in and see if it
runs. It's at least nice to know that distinct-values() doesn't use
indexes so that it COULD be faster someday.

As I get into actually using dbxml I'm completely amazed at the amount of
work and good things that have gone into it. I hate to sound like I'm
nitpicking at this stuff. I certainly appreciate the task that this was.

~Chris

>   Hi Chris,
>
> I'm facing the very same issue here. Looking at the source code for the
> distinct-values() function (in
> dbxml-2.2.13/pathan/src/functions/FunctionDistinctValues.cpp) we learn
> two things :
> - It doesn't try to use indexes.
> - It uses a vector to store the current list of already seen items. This
> means that depending on the shape of the datas, algorithmic complexity
> is at best O(N), and at worst O(N^2). It would be much more efficient
> here to use a hash map.
>
> I'd like to help patching this (in principle this doesn't seem to be too
> complicated), but I'm not familiar with pathan internals - If anyone can
> help start on this, this would be great !
>
> Fabrice
> --
> Fabrice Desré
> France Télécom R&D/MAPS/AMS
> Tél: +(33) (0)2 96 05 31 43
> Fax: +(33) (0)2 96 05 39 45
> http://www.francetelecom.com/rd
>
>
> ------------------------------------------
> To remove yourself from this list, send an
> email to xml-unsubscribe@...
>
>




------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by Pavel Smerk-2 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

On Fri, Jun 30, 2006 at 09:41:47AM -0500, volkris@... wrote:
> I'm currently thinking about workarounds, seeing if I could write an
> XQuery function that would use other functions that themselves would
> hopefully use indexes. Reimplementing distinct-values() sounds harsh, but
> perhaps the reimplementation would hit indexes.
>
> I'm also thinking about ways to use indexLookup instead, but it seems PHP
> can't really use it because of limitations on XmlValue.

I'm afraid indexlookup wouldn't be much efficient. I'm not sure, whether
it's a common feature, or whether it's on account of Ruby interface to
BDBXML, but indexlookup returns whole documents to me, not only the indexed
values. Thus, if each document has an category element, the list returned by
indexlookup would contain all documents of the container.

May be if you use DBXML_INDEX_NODES when creating container --- but it may
affect some other parts of your code or decrease performance in some cases.
http://www.sleepycat.com/xmldocs/gsg_xml/cxx/containers.html
http://www.sleepycat.com/xmldocs/gsg_xml/cxx/nodeindex.html

But what about some small container with values of category element? It
would be simple, fast, although redundant a bit.


I have also one related question --- if one want to use indexlookup to solve
this kind of problem, how could s/he retrieve only indexed vaules? Not
nodes, or even whole docs, but only the values.

With regards,

P.


------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by George Feinberg-3 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Pavel,


> On Fri, Jun 30, 2006 at 09:41:47AM -0500, volkris@... wrote:
>> I'm currently thinking about workarounds, seeing if I could write an
>> XQuery function that would use other functions that themselves would
>> hopefully use indexes. Reimplementing distinct-values() sounds  
>> harsh, but
>> perhaps the reimplementation would hit indexes.
>>
>> I'm also thinking about ways to use indexLookup instead, but it  
>> seems PHP
>> can't really use it because of limitations on XmlValue.
>
> I'm afraid indexlookup wouldn't be much efficient. I'm not sure,  
> whether
> it's a common feature, or whether it's on account of Ruby interface to
> BDBXML, but indexlookup returns whole documents to me, not only the  
> indexed
> values. Thus, if each document has an category element, the list  
> returned by
> indexlookup would contain all documents of the container.
>
> May be if you use DBXML_INDEX_NODES when creating container --- but  
> it may
> affect some other parts of your code or decrease performance in  
> some cases.
> http://www.sleepycat.com/xmldocs/gsg_xml/cxx/containers.html
> http://www.sleepycat.com/xmldocs/gsg_xml/cxx/nodeindex.html

DBXML_INDEX_NODES will result in returning nodes, not documents.
The extra cost of the indexes is not all that large, depending
on the nature of your documents.  The cost is more space than
time - the indexing itself is similar in cost between the two
alternatives.

>
> But what about some small container with values of category  
> element? It
> would be simple, fast, although redundant a bit.
>
>
> I have also one related question --- if one want to use indexlookup  
> to solve
> this kind of problem, how could s/he retrieve only indexed vaules? Not
> nodes, or even whole docs, but only the values.

That is not possible in the current API, but it is a feature addition
that is on the list for the future.

Regards,

George

>
> With regards,
>
> P.
>
>
> ------------------------------------------
> To remove yourself from this list, send an
> email to xml-unsubscribe@...
>



------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by Pavel Smerk-2 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

On Sat, Jul 01, 2006 at 10:07:52AM -0400, George Feinberg wrote:

> >On Fri, Jun 30, 2006 at 09:41:47AM -0500, volkris@... wrote:
> >>I'm currently thinking about workarounds, seeing if I could write an
> >>XQuery function that would use other functions that themselves would
> >>hopefully use indexes. Reimplementing distinct-values() sounds harsh,
> >>but perhaps the reimplementation would hit indexes.
> >>
> >>I'm also thinking about ways to use indexLookup instead, but it seems
> >>PHP can't really use it because of limitations on XmlValue.
> >
> >I'm afraid indexlookup wouldn't be much efficient. I'm not sure, whether
> >it's a common feature, or whether it's on account of Ruby interface to
> >BDBXML, but indexlookup returns whole documents to me, not only the
> >indexed values. Thus, if each document has an category element, the list
> >returned by indexlookup would contain all documents of the container.
> >
> >May be if you use DBXML_INDEX_NODES when creating container --- but it
> >may affect some other parts of your code or decrease performance in some
> >cases. http://www.sleepycat.com/xmldocs/gsg_xml/cxx/containers.html
> >http://www.sleepycat.com/xmldocs/gsg_xml/cxx/nodeindex.html
>
> DBXML_INDEX_NODES will result in returning nodes, not documents. The extra

I know: that was why I wrote it could affect some other parts of Chris'
code, for example, I have some pieces of code which rely on the default
state --- returning documents, thus if I would change indexing to
non-default, I would have not to forget to rewrite these pieces.

> The extra cost of the indexes is not all that large, depending on the
> nature of your documents.  The cost is more space than time - the indexing
> itself is similar in cost between the two alternatives.

I'm afraid, this is not that case. Chris would need to return whole index,
if I understood his problem well. Without DBXML_INDEX_NODES (and assuming
that each document has a category element) it would result in returning
whole container, which could be very large. With INDEX_NODES he would get
only probably much smaller list of nodes, which would be faster.

                                                        Regards, P.


------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by Chris Carlin :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

> I'm afraid, this is not that case. Chris would need to return whole index,
> if I understood his problem well. Without DBXML_INDEX_NODES (and assuming
> that each document has a category element) it would result in returning
> whole container, which could be very large. With INDEX_NODES he would get
> only probably much smaller list of nodes, which would be faster.

As a quick follow up, I tried Pavel's suggestion using INDEX_NODES and
doing index queries. These cut the query time in half, but it was still
too long, upwards of ten seconds for around two thousand small documents.

 From the list of nodes I cut out the text and used PHP functions to
find the distinct values. I could have cut some time off by looking up
the number of distinct indexes and ending the search when that number is
reached, but that wouldn't help enough with this dataset.

I'll have to keep a separate list of values in a different document,
which is disappointing. dbxml can immediately indicate how many distinct
key values there are which suggests to me that it should know internally
what these distinct values are. Not that I know anything of its internals...

Thanks for the suggestions, though.

~Chris


------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by George Feinberg-3 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Chris,

>> I'm afraid, this is not that case. Chris would need to return  
>> whole index,
>> if I understood his problem well. Without DBXML_INDEX_NODES (and  
>> assuming
>> that each document has a category element) it would result in  
>> returning
>> whole container, which could be very large. With INDEX_NODES he  
>> would get
>> only probably much smaller list of nodes, which would be faster.
>
> As a quick follow up, I tried Pavel's suggestion using INDEX_NODES and
> doing index queries. These cut the query time in half, but it was  
> still
> too long, upwards of ten seconds for around two thousand small  
> documents.

What do you mean by "index queries?"
Are you using XmlIndexLookup?
That would be your best alternative at this point.  You mentioned
an issue with XmlValue in a previous email, but didn't fully explain.

>
> From the list of nodes I cut out the text and used PHP functions to
> find the distinct values. I could have cut some time off by looking  
> up the number of distinct indexes and ending the search when that  
> number is reached, but that wouldn't help enough with this dataset.
>
> I'll have to keep a separate list of values in a different  
> document, which is disappointing. dbxml can immediately indicate  
> how many distinct key values there are which suggests to me that it  
> should know internally what these distinct values are. Not that I  
> know anything of its internals...

The number of distinct key values is maintained as a statistic for
optimization purposes; however, the information is
not currently available in a form that helps the actual evaluation.

Regards,

George



------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by Chris Carlin :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

>> As a quick follow up, I tried Pavel's suggestion using INDEX_NODES and
>> doing index queries. These cut the query time in half, but it was still
>> too long, upwards of ten seconds for around two thousand small documents.
>
> What do you mean by "index queries?"
> Are you using XmlIndexLookup?
> That would be your best alternative at this point.  You mentioned
> an issue with XmlValue in a previous email, but didn't fully explain.

Well I mean container->lookupIndex since XmlIndexLookup isn't in the PHP
API as far as I can tell.

The issue with XmlValue is that in PHP it seems to work only for
XmlDocuments. I can't run $xv = new XmlValue('string'); for example.

Later I realized that the XmlValue is optional for use with
XmlIndexLookup, so I just didn't use that parameter. Like I said, it
worked to go through all of the indexed values, but it was still too
slow to iterate through the list of values, seeing if each is in my
known value list.

~Chris


------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by George Feinberg-3 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Chris,

>>> As a quick follow up, I tried Pavel's suggestion using  
>>> INDEX_NODES and
>>> doing index queries. These cut the query time in half, but it was  
>>> still
>>> too long, upwards of ten seconds for around two thousand small  
>>> documents.
>> What do you mean by "index queries?"
>> Are you using XmlIndexLookup?
>> That would be your best alternative at this point.  You mentioned
>> an issue with XmlValue in a previous email, but didn't fully explain.
>
> Well I mean container->lookupIndex since XmlIndexLookup isn't in  
> the PHP API as far as I can tell.

XmlIndexLookup is there.

>
> The issue with XmlValue is that in PHP it seems to work only for  
> XmlDocuments. I can't run $xv = new XmlValue('string'); for example.

This should work.  How does that fail for you?

>
> Later I realized that the XmlValue is optional for use with  
> XmlIndexLookup, so I just didn't use that parameter. Like I said,  
> it worked to go through all of the indexed values, but it was still  
> too slow to iterate through the list of values, seeing if each is  
> in my known value list.

If you can specify the range you want using XmlValue, above, you can  
restrict
the return values to just what you want using XmlIndexLookup.

George



------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by Chris Carlin :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

> XmlIndexLookup is there.

Oh, I see it now. I had tried to instantiate the class directly for some
reason, and PHP said there was no such class as XmlIndexLookup.

>> The issue with XmlValue is that in PHP it seems to work only for
>> XmlDocuments. I can't run $xv = new XmlValue('string'); for example.
>
> This should work.  How does that fail for you?

Again, my mistake. PHP gave the following (errant?) warning, and I
didn't notice that it was only a warning.

Warning: xmlvalue::xmlvalue() expects parameter 1 to be xmldocument,
string given in php shell code on line 1

>> Later I realized that the XmlValue is optional for use with
>> XmlIndexLookup, so I just didn't use that parameter. Like I said, it
>> worked to go through all of the indexed values, but it was still too
>> slow to iterate through the list of values, seeing if each is in my
>> known value list.
>
> If you can specify the range you want using XmlValue, above, you can
> restrict
> the return values to just what you want using XmlIndexLookup.

I don't believe this would help as I don't know values ahead of time. Or
perhaps I don't understand what you're saying...

~Chris


------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by George Feinberg-3 :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Chris,

>> XmlIndexLookup is there.
>
> Oh, I see it now. I had tried to instantiate the class directly for  
> some reason, and PHP said there was no such class as XmlIndexLookup.
>
>>> The issue with XmlValue is that in PHP it seems to work only for  
>>> XmlDocuments. I can't run $xv = new XmlValue('string'); for example.
>> This should work.  How does that fail for you?
>
> Again, my mistake. PHP gave the following (errant?) warning, and I  
> didn't notice that it was only a warning.
>
> Warning: xmlvalue::xmlvalue() expects parameter 1 to be  
> xmldocument, string given in php shell code on line 1
>
>>> Later I realized that the XmlValue is optional for use with  
>>> XmlIndexLookup, so I just didn't use that parameter. Like I said,  
>>> it worked to go through all of the indexed values, but it was  
>>> still too slow to iterate through the list of values, seeing if  
>>> each is in my known value list.
>> If you can specify the range you want using XmlValue, above, you  
>> can restrict
>> the return values to just what you want using XmlIndexLookup.
>
> I don't believe this would help as I don't know values ahead of  
> time. Or perhaps I don't understand what you're saying...

I think you're right that it won't help, but for closure...
This thread has gotten sufficiently long that we need to pop back to the
original email.  You wrote this:

         All of my docs have a <category>foo</category>, and I need to  
display a list of used categories.
        I was using the following query, but it's just too slow:
                distinct-values(collection('coll.dbxml')/doc/category/string())

XmlIndexLookup can quickly get you all index entries for an index
on "category," or an exact match or range of values,
but it won't give you distinct-values(), unfortunately.
You'd still need to walk the result set, determining the unique  
values yourself.

It may be a reasonable enhancement to the XmlIndexLookup
interface to give you a cursor that runs over unique values in a  
result set
rather than the entire thing.  I'll think about that.

The index statistics that tell you that there are (e.g.) 20 unique  
values
do not actually contain the values, just the number, so it can't be used
for this operation.

Regards,

George






------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...


Re: Indexing distinct-values?

by Chris Carlin :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

> It may be a reasonable enhancement to the XmlIndexLookup
> interface to give you a cursor that runs over unique values in a result set
> rather than the entire thing.  I'll think about that.
>
> The index statistics that tell you that there are (e.g.) 20 unique values
> do not actually contain the values, just the number, so it can't be used
> for this operation.

Well to REALLY get back to the original issue, if the XQuery
distinct-values function was sufficiently optimized using indexes, at
least in simple cases, then perhaps no new XmlIndexLookup interface
would be called for at all. I figure you guys are going to be doing this
at some point anyway...

It seems to me that this is a pretty common need where web interfaces
are concerned.

~Chris


------------------------------------------
To remove yourself from this list, send an
email to xml-unsubscribe@...

LightInTheBox - Buy quality products at wholesale price!