|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
Explanation and solutions of some Jackrabbit queries regarding performanceHello Martin Zdila regarding JCR-1196 et al,
from time to time I see mails regarding performance of queries and slow things like queryResult.getNodes().hasNext(). There are queries which can be slow, there are data modelling structures which might be slow, and there are seemingly trivial things like queryResult.getNodes().hasNext() which might be slow. I write 'might' all the time, because everything can and must be blistering fast with millions of documents, and most of the time, solutions are extremely simple to achieve this. We just have to document some pitfalls of easy made mistakes. I'll try to find some time in the near future to document some parts I am aware of in the form of a FAQ, like the rest of this mail will be. For now just some frequently made mistakes from the top of my head: @Martin Zdila : if you are not interested in reading the rest of this mail, just add <param name="respectDocumentOrder" value="false"/> to the <SearchIndex> element of your workspace.xml (and repository.xml). Also try to avoid 4000 node childs (certainly same name nodes) under one node, try to create a larger tree where nodes to not contains many child nodes. This is just like your filesystem not fast Question 1: why is search for xpath '/jcr:root/a/b/c' slower than '//c' or '//*[@someprop]' ? Answer 1: When using a path like '/jcr:root/a/b/c' or '/jcr:root/a//*/c' will be executed, the hierarchy manager has to check all found nodes wether their parents are correct. Since Jackrabbit does not store hierarchical data (if it would, it could not efficiently move a node anymore, at least in the current architecture), hierarchies need to be checked by iterating through the lucene indexes to find parent nodes of a result. This is cpu consuming. Although since Jackrabbit 1.4 the hierarchy is cached properly, returning many results is still an expensive operation. The first execution of a query might be slow because the hierarchy cache needs to be build up. Queries like '//c' or '//*[@someprop]' do not need to check hierarchies, because results do not need to check wether they are allowed according their parent node. Conclusion 1: When the resultset of the search is expected to be large, try to avoid path info in the xpath. Try to distinguish based on for example nodetype or some property. Question 2: My xpath was '//c' and the result size is 10.000 nodes. When I call queryResult.getNodes().hasNext() it takes up to minutes to complete this call. Answer 2: For Jackrabbit version < 1.5 , the default setting in the <SearchIndex> configuration in repository.xml is <param name="respectDocumentOrder" value="true"/>. This means that when a query does *not* have a 'order by' clause, result nodes will be in document order. Returning nodes in document order for many results (> 1000) will become increasingly slow. You can fix this by either setting respectDocumentOrder to false in your repository.xml (and in workspace.xml if you have an existing workspace already) *or* by adding an 'order by' clause in your query. Minutes delay will be decreased to 0-15ms Conclusion 2: When you have a lot of results, either include an 'order by' clause or set respectDocumentOrder to false. Modelling your content in having many child nodes below one single node will make the problem even larger when you have respectDocumentOrder = true and do not define an 'order by' clause Question 3: My xpath is '//*[jcr:like(@propertyName, '%somevalue%')]' and it takes minutes to complete. Answer 3: a jcr:like with % will be translated to a WildcardQuery lucene query. In order to prevent extremely slow WildcardQueries, a Wildcard term should not start with one of the wildcards * or ?. So this is not a Jackrabbit implementation detail, but a general Lucene (and I think inverted indexes in general) issue [1] Conclusion 3: Avoid % prefixes in jcr:like. Use jcr:contains when searching for a specific word. If jcr:contains is not suitable, you can work around the problem by creating a custom lucene analyzer for the specific propery (see IndexingConfiguration [2] at Index Analyzers). Question 4: I am not searching through nodes, but traversing, and this is slow Answer 4: Model your repository to not have very many child nodes directly below a node. Try to structure your repository to have not extremely 'large folders', comparable to how your FileSystem would become slow This mail is getting to long :-) I'll come up with ssome extra FAQ's from time to time, and if people are interested I will make a (wiki?) document for it. I though might need some help because at some parts my knowledge might be insufficient To be continued, Regards Ard [1] http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or g/apache/lucene/search/WildcardQuery.html [2] http://wiki.apache.org/jackrabbit/IndexingConfiguration -- Hippo Oosteinde 11 1017WT Amsterdam The Netherlands Tel +31 (0)20 5224466 ------------------------------------------------------------- a.schrijvers@... / ard@... / http://www.hippo.nl -------------------------------------------------------------- |
|
|
Re: Explanation and solutions of some Jackrabbit queries regarding performance+1 for putting this in the wiki. It's the better explanation i have
read insofar on how to optimize some queries on jackrabbit and why some behave unexpectedly. The //foo being faster of /bar/baz/foo was one of them. Thanks! Alessandro On Jan 22, 2008 4:17 PM, Ard Schrijvers <a.schrijvers@...> wrote: > Hello Martin Zdila regarding JCR-1196 et al, > > from time to time I see mails regarding performance of queries and slow > things like queryResult.getNodes().hasNext(). There are queries which > can be slow, there are data modelling structures which might be slow, > and there are seemingly trivial things like > queryResult.getNodes().hasNext() which might be slow. I write 'might' > all the time, because everything can and must be blistering fast with > millions of documents, and most of the time, solutions are extremely > simple to achieve this. We just have to document some pitfalls of easy > made mistakes. I'll try to find some time in the near future to document > some parts I am aware of in the form of a FAQ, like the rest of this > mail will be. For now just some frequently made mistakes from the top of > my head: > > @Martin Zdila : if you are not interested in reading the rest of this > mail, just add <param name="respectDocumentOrder" value="false"/> to the > <SearchIndex> element of your workspace.xml (and repository.xml). Also > try to avoid 4000 node childs (certainly same name nodes) under one > node, try to create a larger tree where nodes to not contains many child > nodes. This is just like your filesystem not fast > > > Question 1: why is search for xpath '/jcr:root/a/b/c' slower than '//c' > or '//*[@someprop]' ? > > Answer 1: When using a path like '/jcr:root/a/b/c' or '/jcr:root/a//*/c' > will be executed, the hierarchy manager has to check all found nodes > wether their parents are correct. Since Jackrabbit does not store > hierarchical data (if it would, it could not efficiently move a node > anymore, at least in the current architecture), hierarchies need to be > checked by iterating through the lucene indexes to find parent nodes of > a result. This is cpu consuming. Although since Jackrabbit 1.4 the > hierarchy is cached properly, returning many results is still an > expensive operation. The first execution of a query might be slow > because the hierarchy cache needs to be build up. Queries like '//c' or > '//*[@someprop]' do not need to check hierarchies, because results do > not need to check wether they are allowed according their parent node. > > Conclusion 1: When the resultset of the search is expected to be large, > try to avoid path info in the xpath. Try to distinguish based on for > example nodetype or some property. > > Question 2: My xpath was '//c' and the result size is 10.000 nodes. When > I call queryResult.getNodes().hasNext() it takes up to minutes to > complete this call. > > Answer 2: For Jackrabbit version < 1.5 , the default setting in the > <SearchIndex> configuration in repository.xml is > <param name="respectDocumentOrder" value="true"/>. This means that when > a query does *not* have a 'order by' clause, result nodes will be in > document order. Returning nodes in document order for many results (> > 1000) will become increasingly slow. You can fix this by either setting > respectDocumentOrder to false in your repository.xml (and in > workspace.xml if you have an existing workspace already) *or* by adding > an 'order by' clause in your query. Minutes delay will be decreased to > 0-15ms > > Conclusion 2: When you have a lot of results, either include an 'order > by' clause or set respectDocumentOrder to false. Modelling your content > in having many child nodes below one single node will make the problem > even larger when you have respectDocumentOrder = true and do not define > an 'order by' clause > > Question 3: My xpath is '//*[jcr:like(@propertyName, '%somevalue%')]' > and it takes minutes to complete. > > Answer 3: a jcr:like with % will be translated to a WildcardQuery lucene > query. In order to prevent extremely slow WildcardQueries, a Wildcard > term should not start with one of the wildcards * or ?. So this is not a > Jackrabbit implementation detail, but a general Lucene (and I think > inverted indexes in general) issue [1] > > Conclusion 3: Avoid % prefixes in jcr:like. Use jcr:contains when > searching for a specific word. If jcr:contains is not suitable, you can > work around the problem by creating a custom lucene analyzer for the > specific propery (see IndexingConfiguration [2] at Index Analyzers). > > Question 4: I am not searching through nodes, but traversing, and this > is slow > > Answer 4: Model your repository to not have very many child nodes > directly below a node. Try to structure your repository to have not > extremely 'large folders', comparable to how your FileSystem would > become slow > > This mail is getting to long :-) I'll come up with ssome extra FAQ's > from time to time, and if people are interested I will make a (wiki?) > document for it. I though might need some help because at some parts my > knowledge might be insufficient > > To be continued, > > Regards Ard > > [1] > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or > g/apache/lucene/search/WildcardQuery.html > [2] http://wiki.apache.org/jackrabbit/IndexingConfiguration > > -- > > Hippo > Oosteinde 11 > 1017WT Amsterdam > The Netherlands > Tel +31 (0)20 5224466 > ------------------------------------------------------------- > a.schrijvers@... / ard@... / http://www.hippo.nl > -------------------------------------------------------------- > |
|
|
Re: Explanation and solutions of some Jackrabbit queries regarding performanceMany thanks to Ard Schrijvers for the explanation!
+1 for putting this to FAQ :-) -- Martin Zdila CTO M-Way Solutions Slovakia s.r.o. Letna 27, 040 01 Kosice Slovakia tel:+421-908-363-848 mailto:m.zdila@... http://www.mwaysolutions.com xmpp:zdila@... (Jabber) skype:m.zdila |
|
|
Re: Explanation and solutions of some Jackrabbit queries regarding performanceHi Ard,
excellent work. this should definitively be placed on a query faq wiki page. regards marcel Ard Schrijvers wrote: > Hello Martin Zdila regarding JCR-1196 et al, > > from time to time I see mails regarding performance of queries and slow > things like queryResult.getNodes().hasNext(). There are queries which > can be slow, there are data modelling structures which might be slow, > and there are seemingly trivial things like > queryResult.getNodes().hasNext() which might be slow. I write 'might' > all the time, because everything can and must be blistering fast with > millions of documents, and most of the time, solutions are extremely > simple to achieve this. We just have to document some pitfalls of easy > made mistakes. I'll try to find some time in the near future to document > some parts I am aware of in the form of a FAQ, like the rest of this > mail will be. For now just some frequently made mistakes from the top of > my head: > > @Martin Zdila : if you are not interested in reading the rest of this > mail, just add <param name="respectDocumentOrder" value="false"/> to the > <SearchIndex> element of your workspace.xml (and repository.xml). Also > try to avoid 4000 node childs (certainly same name nodes) under one > node, try to create a larger tree where nodes to not contains many child > nodes. This is just like your filesystem not fast > > > Question 1: why is search for xpath '/jcr:root/a/b/c' slower than '//c' > or '//*[@someprop]' ? > > Answer 1: When using a path like '/jcr:root/a/b/c' or '/jcr:root/a//*/c' > will be executed, the hierarchy manager has to check all found nodes > wether their parents are correct. Since Jackrabbit does not store > hierarchical data (if it would, it could not efficiently move a node > anymore, at least in the current architecture), hierarchies need to be > checked by iterating through the lucene indexes to find parent nodes of > a result. This is cpu consuming. Although since Jackrabbit 1.4 the > hierarchy is cached properly, returning many results is still an > expensive operation. The first execution of a query might be slow > because the hierarchy cache needs to be build up. Queries like '//c' or > '//*[@someprop]' do not need to check hierarchies, because results do > not need to check wether they are allowed according their parent node. > > Conclusion 1: When the resultset of the search is expected to be large, > try to avoid path info in the xpath. Try to distinguish based on for > example nodetype or some property. > > Question 2: My xpath was '//c' and the result size is 10.000 nodes. When > I call queryResult.getNodes().hasNext() it takes up to minutes to > complete this call. > > Answer 2: For Jackrabbit version < 1.5 , the default setting in the > <SearchIndex> configuration in repository.xml is > <param name="respectDocumentOrder" value="true"/>. This means that when > a query does *not* have a 'order by' clause, result nodes will be in > document order. Returning nodes in document order for many results (> > 1000) will become increasingly slow. You can fix this by either setting > respectDocumentOrder to false in your repository.xml (and in > workspace.xml if you have an existing workspace already) *or* by adding > an 'order by' clause in your query. Minutes delay will be decreased to > 0-15ms > > Conclusion 2: When you have a lot of results, either include an 'order > by' clause or set respectDocumentOrder to false. Modelling your content > in having many child nodes below one single node will make the problem > even larger when you have respectDocumentOrder = true and do not define > an 'order by' clause > > Question 3: My xpath is '//*[jcr:like(@propertyName, '%somevalue%')]' > and it takes minutes to complete. > > Answer 3: a jcr:like with % will be translated to a WildcardQuery lucene > query. In order to prevent extremely slow WildcardQueries, a Wildcard > term should not start with one of the wildcards * or ?. So this is not a > Jackrabbit implementation detail, but a general Lucene (and I think > inverted indexes in general) issue [1] > > Conclusion 3: Avoid % prefixes in jcr:like. Use jcr:contains when > searching for a specific word. If jcr:contains is not suitable, you can > work around the problem by creating a custom lucene analyzer for the > specific propery (see IndexingConfiguration [2] at Index Analyzers). > > Question 4: I am not searching through nodes, but traversing, and this > is slow > > Answer 4: Model your repository to not have very many child nodes > directly below a node. Try to structure your repository to have not > extremely 'large folders', comparable to how your FileSystem would > become slow > > This mail is getting to long :-) I'll come up with ssome extra FAQ's > from time to time, and if people are interested I will make a (wiki?) > document for it. I though might need some help because at some parts my > knowledge might be insufficient > > To be continued, > > Regards Ard > > [1] > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or > g/apache/lucene/search/WildcardQuery.html > [2] http://wiki.apache.org/jackrabbit/IndexingConfiguration > |
|
|
RE: Explanation and solutions of some Jackrabbit queries regarding performanceHello Marcel,
> > Hi Ard, > > excellent work. this should definitively be placed on a query > faq wiki page. I'll try to find time in short notice (this weekend most likely) to create the query faq wiki (I always feel like it is much easier to write an email, because when on a wiki it *must* / should be correct, and a mail can easily be corrected). Since on some parts my knowledge is not enough, I might like some feedback on the things I write, but I suppose/hope people will get back with corrections/suggestiuons/enhancements. An organically growing wiki FAQ document about queries might be benificial to a lot of users. -Ard |
| Free Forum Powered by Nabble | Forum Help |