|
View:
New views
13 Messages
—
Rating Filter:
Alert me
|
|
|
Indexing distinct-values?
by Chris Carlin
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message Is there a way to use indexes to speed up calculations of
distinct-values? I thought perhaps with string equality it would be possible, but I haven't been able to find any way to increase performance. All of my docs have a <category>foo</category>, and I need to display a list of used categories. I was using the following query, but it's just too slow: distinct-values(collection('coll.dbxml')/doc/category/string()) There will be tens of thousands of documents, but perhaps twenty categories if that helps. ~Chris ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by Fabrice Desré - France Telecom DR&D/MAPS/AMS
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message Chris Carlin wrote:
> Is there a way to use indexes to speed up calculations of > distinct-values? I thought perhaps with string equality it would be > possible, but I haven't been able to find any way to increase performance. > > All of my docs have a <category>foo</category>, and I need to display a > list of used categories. I was using the following query, but it's just > too slow: > > distinct-values(collection('coll.dbxml')/doc/category/string()) > > There will be tens of thousands of documents, but perhaps twenty > categories if that helps. Hi Chris, I'm facing the very same issue here. Looking at the source code for the distinct-values() function (in dbxml-2.2.13/pathan/src/functions/FunctionDistinctValues.cpp) we learn two things : - It doesn't try to use indexes. - It uses a vector to store the current list of already seen items. This means that depending on the shape of the datas, algorithmic complexity is at best O(N), and at worst O(N^2). It would be much more efficient here to use a hash map. I'd like to help patching this (in principle this doesn't seem to be too complicated), but I'm not familiar with pathan internals - If anyone can help start on this, this would be great ! Fabrice -- Fabrice Desré France Télécom R&D/MAPS/AMS Tél: +(33) (0)2 96 05 31 43 Fax: +(33) (0)2 96 05 39 45 http://www.francetelecom.com/rd ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by volkris-2
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message I'm currently thinking about workarounds, seeing if I could write an
XQuery function that would use other functions that themselves would hopefully use indexes. Reimplementing distinct-values() sounds harsh, but perhaps the reimplementation would hit indexes. I'm also thinking about ways to use indexLookup instead, but it seems PHP can't really use it because of limitations on XmlValue. Anyway, I haven't had time to actually type anything in and see if it runs. It's at least nice to know that distinct-values() doesn't use indexes so that it COULD be faster someday. As I get into actually using dbxml I'm completely amazed at the amount of work and good things that have gone into it. I hate to sound like I'm nitpicking at this stuff. I certainly appreciate the task that this was. ~Chris > Hi Chris, > > I'm facing the very same issue here. Looking at the source code for the > distinct-values() function (in > dbxml-2.2.13/pathan/src/functions/FunctionDistinctValues.cpp) we learn > two things : > - It doesn't try to use indexes. > - It uses a vector to store the current list of already seen items. This > means that depending on the shape of the datas, algorithmic complexity > is at best O(N), and at worst O(N^2). It would be much more efficient > here to use a hash map. > > I'd like to help patching this (in principle this doesn't seem to be too > complicated), but I'm not familiar with pathan internals - If anyone can > help start on this, this would be great ! > > Fabrice > -- > Fabrice Desré > France Télécom R&D/MAPS/AMS > Tél: +(33) (0)2 96 05 31 43 > Fax: +(33) (0)2 96 05 39 45 > http://www.francetelecom.com/rd > > > ------------------------------------------ > To remove yourself from this list, send an > email to xml-unsubscribe@... > > ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by Pavel Smerk-2
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message On Fri, Jun 30, 2006 at 09:41:47AM -0500, volkris@... wrote:
> I'm currently thinking about workarounds, seeing if I could write an > XQuery function that would use other functions that themselves would > hopefully use indexes. Reimplementing distinct-values() sounds harsh, but > perhaps the reimplementation would hit indexes. > > I'm also thinking about ways to use indexLookup instead, but it seems PHP > can't really use it because of limitations on XmlValue. I'm afraid indexlookup wouldn't be much efficient. I'm not sure, whether it's a common feature, or whether it's on account of Ruby interface to BDBXML, but indexlookup returns whole documents to me, not only the indexed values. Thus, if each document has an category element, the list returned by indexlookup would contain all documents of the container. May be if you use DBXML_INDEX_NODES when creating container --- but it may affect some other parts of your code or decrease performance in some cases. http://www.sleepycat.com/xmldocs/gsg_xml/cxx/containers.html http://www.sleepycat.com/xmldocs/gsg_xml/cxx/nodeindex.html But what about some small container with values of category element? It would be simple, fast, although redundant a bit. I have also one related question --- if one want to use indexlookup to solve this kind of problem, how could s/he retrieve only indexed vaules? Not nodes, or even whole docs, but only the values. With regards, P. ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by George Feinberg-3
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message Pavel,
> On Fri, Jun 30, 2006 at 09:41:47AM -0500, volkris@... wrote: >> I'm currently thinking about workarounds, seeing if I could write an >> XQuery function that would use other functions that themselves would >> hopefully use indexes. Reimplementing distinct-values() sounds >> harsh, but >> perhaps the reimplementation would hit indexes. >> >> I'm also thinking about ways to use indexLookup instead, but it >> seems PHP >> can't really use it because of limitations on XmlValue. > > I'm afraid indexlookup wouldn't be much efficient. I'm not sure, > whether > it's a common feature, or whether it's on account of Ruby interface to > BDBXML, but indexlookup returns whole documents to me, not only the > indexed > values. Thus, if each document has an category element, the list > returned by > indexlookup would contain all documents of the container. > > May be if you use DBXML_INDEX_NODES when creating container --- but > it may > affect some other parts of your code or decrease performance in > some cases. > http://www.sleepycat.com/xmldocs/gsg_xml/cxx/containers.html > http://www.sleepycat.com/xmldocs/gsg_xml/cxx/nodeindex.html DBXML_INDEX_NODES will result in returning nodes, not documents. The extra cost of the indexes is not all that large, depending on the nature of your documents. The cost is more space than time - the indexing itself is similar in cost between the two alternatives. > > But what about some small container with values of category > element? It > would be simple, fast, although redundant a bit. > > > I have also one related question --- if one want to use indexlookup > to solve > this kind of problem, how could s/he retrieve only indexed vaules? Not > nodes, or even whole docs, but only the values. That is not possible in the current API, but it is a feature addition that is on the list for the future. Regards, George > > With regards, > > P. > > > ------------------------------------------ > To remove yourself from this list, send an > email to xml-unsubscribe@... > ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by Pavel Smerk-2
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message On Sat, Jul 01, 2006 at 10:07:52AM -0400, George Feinberg wrote:
> >On Fri, Jun 30, 2006 at 09:41:47AM -0500, volkris@... wrote: > >>I'm currently thinking about workarounds, seeing if I could write an > >>XQuery function that would use other functions that themselves would > >>hopefully use indexes. Reimplementing distinct-values() sounds harsh, > >>but perhaps the reimplementation would hit indexes. > >> > >>I'm also thinking about ways to use indexLookup instead, but it seems > >>PHP can't really use it because of limitations on XmlValue. > > > >I'm afraid indexlookup wouldn't be much efficient. I'm not sure, whether > >it's a common feature, or whether it's on account of Ruby interface to > >BDBXML, but indexlookup returns whole documents to me, not only the > >indexed values. Thus, if each document has an category element, the list > >returned by indexlookup would contain all documents of the container. > > > >May be if you use DBXML_INDEX_NODES when creating container --- but it > >may affect some other parts of your code or decrease performance in some > >cases. http://www.sleepycat.com/xmldocs/gsg_xml/cxx/containers.html > >http://www.sleepycat.com/xmldocs/gsg_xml/cxx/nodeindex.html > > DBXML_INDEX_NODES will result in returning nodes, not documents. The extra I know: that was why I wrote it could affect some other parts of Chris' code, for example, I have some pieces of code which rely on the default state --- returning documents, thus if I would change indexing to non-default, I would have not to forget to rewrite these pieces. > The extra cost of the indexes is not all that large, depending on the > nature of your documents. The cost is more space than time - the indexing > itself is similar in cost between the two alternatives. I'm afraid, this is not that case. Chris would need to return whole index, if I understood his problem well. Without DBXML_INDEX_NODES (and assuming that each document has a category element) it would result in returning whole container, which could be very large. With INDEX_NODES he would get only probably much smaller list of nodes, which would be faster. Regards, P. ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by Chris Carlin
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message > I'm afraid, this is not that case. Chris would need to return whole index,
> if I understood his problem well. Without DBXML_INDEX_NODES (and assuming > that each document has a category element) it would result in returning > whole container, which could be very large. With INDEX_NODES he would get > only probably much smaller list of nodes, which would be faster. As a quick follow up, I tried Pavel's suggestion using INDEX_NODES and doing index queries. These cut the query time in half, but it was still too long, upwards of ten seconds for around two thousand small documents. From the list of nodes I cut out the text and used PHP functions to find the distinct values. I could have cut some time off by looking up the number of distinct indexes and ending the search when that number is reached, but that wouldn't help enough with this dataset. I'll have to keep a separate list of values in a different document, which is disappointing. dbxml can immediately indicate how many distinct key values there are which suggests to me that it should know internally what these distinct values are. Not that I know anything of its internals... Thanks for the suggestions, though. ~Chris ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by George Feinberg-3
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message Chris,
>> I'm afraid, this is not that case. Chris would need to return >> whole index, >> if I understood his problem well. Without DBXML_INDEX_NODES (and >> assuming >> that each document has a category element) it would result in >> returning >> whole container, which could be very large. With INDEX_NODES he >> would get >> only probably much smaller list of nodes, which would be faster. > > As a quick follow up, I tried Pavel's suggestion using INDEX_NODES and > doing index queries. These cut the query time in half, but it was > still > too long, upwards of ten seconds for around two thousand small > documents. What do you mean by "index queries?" Are you using XmlIndexLookup? That would be your best alternative at this point. You mentioned an issue with XmlValue in a previous email, but didn't fully explain. > > From the list of nodes I cut out the text and used PHP functions to > find the distinct values. I could have cut some time off by looking > up the number of distinct indexes and ending the search when that > number is reached, but that wouldn't help enough with this dataset. > > I'll have to keep a separate list of values in a different > document, which is disappointing. dbxml can immediately indicate > how many distinct key values there are which suggests to me that it > should know internally what these distinct values are. Not that I > know anything of its internals... The number of distinct key values is maintained as a statistic for optimization purposes; however, the information is not currently available in a form that helps the actual evaluation. Regards, George ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by Chris Carlin
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message >> As a quick follow up, I tried Pavel's suggestion using INDEX_NODES and
>> doing index queries. These cut the query time in half, but it was still >> too long, upwards of ten seconds for around two thousand small documents. > > What do you mean by "index queries?" > Are you using XmlIndexLookup? > That would be your best alternative at this point. You mentioned > an issue with XmlValue in a previous email, but didn't fully explain. Well I mean container->lookupIndex since XmlIndexLookup isn't in the PHP API as far as I can tell. The issue with XmlValue is that in PHP it seems to work only for XmlDocuments. I can't run $xv = new XmlValue('string'); for example. Later I realized that the XmlValue is optional for use with XmlIndexLookup, so I just didn't use that parameter. Like I said, it worked to go through all of the indexed values, but it was still too slow to iterate through the list of values, seeing if each is in my known value list. ~Chris ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by George Feinberg-3
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message Chris,
>>> As a quick follow up, I tried Pavel's suggestion using >>> INDEX_NODES and >>> doing index queries. These cut the query time in half, but it was >>> still >>> too long, upwards of ten seconds for around two thousand small >>> documents. >> What do you mean by "index queries?" >> Are you using XmlIndexLookup? >> That would be your best alternative at this point. You mentioned >> an issue with XmlValue in a previous email, but didn't fully explain. > > Well I mean container->lookupIndex since XmlIndexLookup isn't in > the PHP API as far as I can tell. XmlIndexLookup is there. > > The issue with XmlValue is that in PHP it seems to work only for > XmlDocuments. I can't run $xv = new XmlValue('string'); for example. This should work. How does that fail for you? > > Later I realized that the XmlValue is optional for use with > XmlIndexLookup, so I just didn't use that parameter. Like I said, > it worked to go through all of the indexed values, but it was still > too slow to iterate through the list of values, seeing if each is > in my known value list. If you can specify the range you want using XmlValue, above, you can restrict the return values to just what you want using XmlIndexLookup. George ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by Chris Carlin
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message > XmlIndexLookup is there.
Oh, I see it now. I had tried to instantiate the class directly for some reason, and PHP said there was no such class as XmlIndexLookup. >> The issue with XmlValue is that in PHP it seems to work only for >> XmlDocuments. I can't run $xv = new XmlValue('string'); for example. > > This should work. How does that fail for you? Again, my mistake. PHP gave the following (errant?) warning, and I didn't notice that it was only a warning. Warning: xmlvalue::xmlvalue() expects parameter 1 to be xmldocument, string given in php shell code on line 1 >> Later I realized that the XmlValue is optional for use with >> XmlIndexLookup, so I just didn't use that parameter. Like I said, it >> worked to go through all of the indexed values, but it was still too >> slow to iterate through the list of values, seeing if each is in my >> known value list. > > If you can specify the range you want using XmlValue, above, you can > restrict > the return values to just what you want using XmlIndexLookup. I don't believe this would help as I don't know values ahead of time. Or perhaps I don't understand what you're saying... ~Chris ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by George Feinberg-3
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message Chris,
>> XmlIndexLookup is there. > > Oh, I see it now. I had tried to instantiate the class directly for > some reason, and PHP said there was no such class as XmlIndexLookup. > >>> The issue with XmlValue is that in PHP it seems to work only for >>> XmlDocuments. I can't run $xv = new XmlValue('string'); for example. >> This should work. How does that fail for you? > > Again, my mistake. PHP gave the following (errant?) warning, and I > didn't notice that it was only a warning. > > Warning: xmlvalue::xmlvalue() expects parameter 1 to be > xmldocument, string given in php shell code on line 1 > >>> Later I realized that the XmlValue is optional for use with >>> XmlIndexLookup, so I just didn't use that parameter. Like I said, >>> it worked to go through all of the indexed values, but it was >>> still too slow to iterate through the list of values, seeing if >>> each is in my known value list. >> If you can specify the range you want using XmlValue, above, you >> can restrict >> the return values to just what you want using XmlIndexLookup. > > I don't believe this would help as I don't know values ahead of > time. Or perhaps I don't understand what you're saying... I think you're right that it won't help, but for closure... This thread has gotten sufficiently long that we need to pop back to the original email. You wrote this: All of my docs have a <category>foo</category>, and I need to display a list of used categories. I was using the following query, but it's just too slow: distinct-values(collection('coll.dbxml')/doc/category/string()) XmlIndexLookup can quickly get you all index entries for an index on "category," or an exact match or range of values, but it won't give you distinct-values(), unfortunately. You'd still need to walk the result set, determining the unique values yourself. It may be a reasonable enhancement to the XmlIndexLookup interface to give you a cursor that runs over unique values in a result set rather than the entire thing. I'll think about that. The index statistics that tell you that there are (e.g.) 20 unique values do not actually contain the values, just the number, so it can't be used for this operation. Regards, George ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
|
|
Re: Indexing distinct-values?
by Chris Carlin
::
Rate this Message:
Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message > It may be a reasonable enhancement to the XmlIndexLookup
> interface to give you a cursor that runs over unique values in a result set > rather than the entire thing. I'll think about that. > > The index statistics that tell you that there are (e.g.) 20 unique values > do not actually contain the values, just the number, so it can't be used > for this operation. Well to REALLY get back to the original issue, if the XQuery distinct-values function was sufficiently optimized using indexes, at least in simple cases, then perhaps no new XmlIndexLookup interface would be called for at all. I figure you guys are going to be doing this at some point anyway... It seems to me that this is a pretty common need where web interfaces are concerned. ~Chris ------------------------------------------ To remove yourself from this list, send an email to xml-unsubscribe@... |
| Free Forum Powered by Nabble | Forum Help |