|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
Text search performance and postgresqlHi,
I'm new on this list, and I'm not sure I'm posting at the right place. I've got a question about text indexation. I have a postgresql mediawiki instance installed on a very small server, and have been working on tweaking it's performance these last few days. When all that could be done from the OS and database parameters was done, I decided to give a look to the SQL queries. The main problem I'm seeing is with the text searchs. I don't really know how it's handled with mysql, but with postgresql, I think there may be some optimization that could be done, but I'm not sure about it as I don't know all the code. The text search query is this one : EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text AS page_text, ts_rank(titlevector, to_tsquery('default','postgres')) AS rnk FROM page p, revision r, pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id = c.old_id AND textvector @@ to_tsquery('default','postgres') AND page_is_redirect = '0' AND page_namespace IN (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20 OFFSET 0; The plan is this one : Limit (cost=718.49..718.50 rows=1 width=621) (actual time=305.943..305.982 rows=20 loops=1) -> Sort (cost=718.49..718.50 rows=1 width=621) (actual time=305.939..305.952 rows=20 loops=1) Sort Key: rank(p.titlevector, '''postgr'''::tsquery), p.page_id -> Nested Loop (cost=0.00..718.48 rows=1 width=621) (actual time=4.278..305.671 rows=44 loops=1) -> Nested Loop (cost=0.00..695.00 rows=21 width=204) (actual time=0.829..76.740 rows=3210 loops=1) -> Seq Scan on page p (cost=0.00..524.95 rows=21 width=204) (actual time=0.804..19.686 rows=3210 loops=1) Filter: (((page_is_redirect)::text = '0'::text) AND (page_namespace = ANY ('{0,9,11}'::integer[]))) -> Index Scan using revision_rev_id_key on revision r (cost=0.00..8.09 rows=1 width=8) (actual time=0.012..0.013 rows=1 loops=3210) Index Cond: (p.page_latest = r.rev_id) -> Index Scan using pagecontent_pkey on pagecontent c (cost=0.00..1.11 rows=1 width=425) (actual time=0.069..0.069 rows=0 loops=3210) Index Cond: (r.rev_text_id = c.old_id) Filter: (textvector @@ '''postgr'''::tsquery) Total runtime: 306.118 ms This plan joins page and revision to determine all the latests revisions of pagecontents, then scans all matching pagecontent to determine which ones match my query. There is also an other plan, depending on the amount of ram available and the estimate of the number of 'latest pagecontents' Limit (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.594..224.646 rows=20 loops=1) -> Sort (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.591..224.610 rows=20 loops=1) Sort Key: (ts_rank(p.titlevector, '''postgr'''::tsquery)), p.page_id Sort Method: top-N heapsort Memory: 37kB -> Hash Join (cost=2689.31..2979.45 rows=4 width=504) (actual time=211.141..224.432 rows=43 loops=1) Hash Cond: (p.page_latest = r.rev_id) -> Seq Scan on page p (cost=0.00..276.86 rows=3527 width=82) (actual time=0.460..10.202 rows=3118 loops=1) Filter: ((page_is_redirect = '0'::bpchar) AND (page_namespace = ANY ('{0,9,11}'::integer[]))) -> Hash (cost=2688.26..2688.26 rows=84 width=430) (actual time=210.409..210.409 rows=1517 loops=1) -> Hash Join (cost=534.76..2688.26 rows=84 width=430) (actual time=26.557..207.725 rows=1517 loops=1) Hash Cond: (r.rev_text_id = c.old_id) -> Seq Scan on revision r (cost=0.00..1836.94 rows=84194 width=8) (actual time=0.023..98.850 rows=84194 loops=1) -> Hash (cost=533.59..533.59 rows=93 width=430) (actual time=18.182..18.182 rows=1515 loops=1) -> Bitmap Heap Scan on pagecontent c (cost=190.83..533.59 rows=93 width=430) (actual time=0.585..15.663 rows=1515 loops=1) Recheck Cond: (textvector @@ '''postgr'''::tsquery) -> Bitmap Index Scan on ts2_page_text2 (cost=0.00..190.81 rows=93 width=0) (actual time=0.431..0.431 rows=1515 loops=1) Index Cond: (textvector @@ '''postgr'''::tsquery) Total runtime: 224.765 ms Times are different because this machine is much more powerful. This time, postgresql decides to get all articles from pagecontent, with all versions, and then determines which ones are latest. In both cases this is rather inefficient, as I guess we search only on the latest version of the articles. So I'm coming to the point ... Is there a reason we index every version of every content ? For instance, with my database, I've tested removing all textvectors from pagecontent except for the latest version of each page. My text index size went from 400Mb to 15Mb. And my text search times went down to a near constant 10ms for all queries. I can then maintain the textvectors by modifying the trigger on pagecontent a bit to cleanup the previous record while updating the table. If I'm posting in the wrong place, please tell me. If the idea is stupid, please tell me also :) Cheers Marc Cousin _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlOn Jul 18, 2008, at 16:03, Marc Cousin <mcousin@...>
> EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text > AS page_text, ts_rank(titlevector, to_tsquery('default','postgres')) > AS rnk FROM page p, revision r, > pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id = > c.old_id AND textvector @@ to_tsquery('default','postgres') AND > page_is_redirect = '0' AND page_namespace IN > (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20 OFFSET 0; Hmm; a fundamental problem here is that text.old_text is opaque to the database -- it may contain compressed text, text in an alternate encoding, or a reference to another table or database at the application level. This is part of why we use a separate table for the search index for the MySQL search -- it's actually consistently readable to the db's search engine as well as providing a summary table so the index only has to cover live versions. -- brion > > > The plan is this one : > > Limit (cost=718.49..718.50 rows=1 width=621) (actual time=305.943..305.982 > rows=20 loops=1) > -> Sort (cost=718.49..718.50 rows=1 width=621) (actual time=305.939..305.952 > rows=20 loops=1) > Sort Key: rank(p.titlevector, '''postgr'''::tsquery), > p.page_id > -> Nested Loop (cost=0.00..718.48 rows=1 width=621) > (actual time=4.278..305.671 rows=44 loops=1) > -> Nested Loop (cost=0.00..695.00 rows=21 width=204) > (actual time=0.829..76.740 rows=3210 loops=1) > -> Seq Scan on page p (cost=0.00..524.95 > rows=21 width=204) (actual time=0.804..19.686 rows=3210 loops=1) > Filter: (((page_is_redirect)::text = > '0'::text) AND (page_namespace = ANY ('{0,9,11}'::integer[]))) > -> Index Scan using revision_rev_id_key on > revision r (cost=0.00..8.09 rows=1 width=8) (actual > time=0.012..0.013 rows=1 loops=3210) > Index Cond: (p.page_latest = r.rev_id) > -> Index Scan using pagecontent_pkey on pagecontent > c (cost=0.00..1.11 rows=1 width=425) (actual time=0.069..0.069 > rows=0 loops=3210) > Index Cond: (r.rev_text_id = c.old_id) > Filter: (textvector @@ '''postgr'''::tsquery) > Total runtime: 306.118 ms > > This plan joins page and revision to determine all the latests > revisions of pagecontents, then scans all matching pagecontent to > determine which ones match my query. > > There is also an other plan, depending on the amount of ram > available and the estimate of the number of 'latest pagecontents' > > Limit (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.594..224.646 > rows=20 loops=1) > -> Sort (cost=2979.49..2979.50 rows=4 width=504) (actual > time=224.591..224.610 rows=20 loops=1) > Sort Key: (ts_rank(p.titlevector, '''postgr'''::tsquery)), > p.page_id > Sort Method: top-N heapsort Memory: 37kB > -> Hash Join (cost=2689.31..2979.45 rows=4 width=504) > (actual time=211.141..224.432 rows=43 loops=1) > Hash Cond: (p.page_latest = r.rev_id) > -> Seq Scan on page p (cost=0.00..276.86 rows=3527 > width=82) (actual time=0.460..10.202 rows=3118 loops=1) > Filter: ((page_is_redirect = '0'::bpchar) AND > (page_namespace = ANY ('{0,9,11}'::integer[]))) > -> Hash (cost=2688.26..2688.26 rows=84 width=430) > (actual time=210.409..210.409 rows=1517 loops=1) > -> Hash Join (cost=534.76..2688.26 rows=84 > width=430) (actual time=26.557..207.725 rows=1517 loops=1) > Hash Cond: (r.rev_text_id = c.old_id) > -> Seq Scan on revision r > (cost=0.00..1836.94 rows=84194 width=8) (actual time=0.023..98.850 > rows=84194 loops=1) > -> Hash (cost=533.59..533.59 rows=93 > width=430) (actual time=18.182..18.182 rows=1515 loops=1) > -> Bitmap Heap Scan on pagecontent > c (cost=190.83..533.59 rows=93 width=430) (actual > time=0.585..15.663 rows=1515 loops=1) > Recheck Cond: (textvector @@ > '''postgr'''::tsquery) > -> Bitmap Index Scan on > ts2_page_text2 (cost=0.00..190.81 rows=93 width=0) (actual time=0.431..0.431 > rows=1515 loops=1) > Index Cond: (textvector > @@ '''postgr'''::tsquery) > Total runtime: 224.765 ms > > > Times are different because this machine is much more powerful. > > This time, postgresql decides to get all articles from pagecontent, > with all versions, and then determines which ones are latest. > > In both cases this is rather inefficient, as I guess we search only > on the latest version of the articles. > > > > So I'm coming to the point ... > > > Is there a reason we index every version of every content ? > > > For instance, with my database, I've tested removing all textvectors > from pagecontent except for the latest version of each page. My text > index size went from 400Mb to 15Mb. And > my text search times went down to a near constant 10ms for all > queries. I can then maintain the textvectors by modifying the > trigger on pagecontent a bit to cleanup the previous > record while updating the table. > > > If I'm posting in the wrong place, please tell me. If the idea is > stupid, please tell me also :) > > > Cheers > > Marc Cousin > > _______________________________________________ > MediaWiki-l mailing list > MediaWiki-l@... > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlOn Saturday 19 July 2008 15:44:51 Brion Vibber wrote:
> On Jul 18, 2008, at 16:03, Marc Cousin <mcousin@...> > > > EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text > > AS page_text, ts_rank(titlevector, to_tsquery('default','postgres')) > > AS rnk FROM page p, revision r, > > pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id = > > c.old_id AND textvector @@ to_tsquery('default','postgres') AND > > page_is_redirect = '0' AND page_namespace IN > > (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20 OFFSET 0; > > Hmm; a fundamental problem here is that text.old_text is opaque to the > database -- it may contain compressed text, text in an alternate > encoding, or a reference to another table or database at the > application level. > > This is part of why we use a separate table for the search index for > the MySQL search -- it's actually consistently readable to the db's > search engine as well as providing a summary table so the index only > has to cover live versions. > > -- brion Hi, Thanks for the answer... What do you mean by 'opaque' ? Btw the separate table for mysql solves the problem I'm having with postgresql (except that there is not the myisam problem, so the fulltext index could be stored directly in the 'page' table with postgresql). For a wiki like mine, with a lot of versions per document, text search performance becomes very bad. So can you explain me what you mean by opaque ? Thanks a lot for your time. Marc _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlBy 'opaque' he means that the databases are supposed to handle that
field transparently. They are supposed to send the data raw to the app and act as if they have absolutely no clue how to handle the data inside of that field. The reason being the data inside of that column is inconstant and letting the database modify or pretend it knows what is inside there can end up with real unexpected results. -- ~Daniel Friesen(Dantman, Nadir-Seen-Fire) of: -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) --Games-G.P.S. (http://ggps.org) -And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com) Marc Cousin wrote: > On Saturday 19 July 2008 15:44:51 Brion Vibber wrote: >> On Jul 18, 2008, at 16:03, Marc Cousin <mcousin@...> >> >>> EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text >>> AS page_text, ts_rank(titlevector, to_tsquery('default','postgres')) >>> AS rnk FROM page p, revision r, >>> pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id = >>> c.old_id AND textvector @@ to_tsquery('default','postgres') AND >>> page_is_redirect = '0' AND page_namespace IN >>> (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20 OFFSET 0; >> Hmm; a fundamental problem here is that text.old_text is opaque to the >> database -- it may contain compressed text, text in an alternate >> encoding, or a reference to another table or database at the >> application level. >> >> This is part of why we use a separate table for the search index for >> the MySQL search -- it's actually consistently readable to the db's >> search engine as well as providing a summary table so the index only >> has to cover live versions. >> >> -- brion > > Hi, > > Thanks for the answer... > > What do you mean by 'opaque' ? > > Btw the separate table for mysql solves the problem I'm having with postgresql > (except that there is not the myisam problem, so the fulltext index could be > stored directly in the 'page' table with postgresql). For a wiki like mine, > with a lot of versions per document, text search performance becomes very > bad. > > So can you explain me what you mean by opaque ? > > Thanks a lot for your time. > > Marc _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlOk, I now understand the meaning of opaque, even if that data isn't that
opaque, because the database and mediawiki are using it for full text indexing. I totally agree with you, that would be a really bad idea to try to mess with the content of the page (that wasn't what I was talking about). Currently, for postgresql there is a trigger updating the 'textvector' column in pagecontent table when 'old_text' is modified, so that the text search index is up to date. The problem is that the textvector should only exist for the latest version of the pagecontent. There is no point in indexing the old versions of the contents of the page. That's done for mysql because of the 1:1 relationship between page and searchindex, but isn't done for postgresql. For PostgreSQL, there are at least 2 solutions : - Put the textvector into the page table - Do the same as for mysql : create a searchindex table (even if the myisam justification for this table doesn't hold for mysql) Plus the temporary fix I was talking about in my mail : put an empty textvector for all versions of a document except its last one, by changing the trigger a bit. That divided the size of my fulltext index by 30 and made the text search fast again. What would be the preferred way to solve this problem ? On Monday 21 July 2008 19:47:09 Daniel Friesen wrote: > By 'opaque' he means that the databases are supposed to handle that > field transparently. They are supposed to send the data raw to the app > and act as if they have absolutely no clue how to handle the data inside > of that field. > > The reason being the data inside of that column is inconstant and > letting the database modify or pretend it knows what is inside there can > end up with real unexpected results. _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlMarc Cousin wrote:
> For PostgreSQL, there are at least 2 solutions : > - Put the textvector into the page table > - Do the same as for mysql : create a searchindex table (even if the myisam > justification for this table doesn't hold for mysql) > > Plus the temporary fix I was talking about in my mail : put an empty > textvector for all versions of a document except its last one, by changing > the trigger a bit. That divided the size of my fulltext index by 30 and made > the text search fast again. > > What would be the preferred way to solve this problem ? Probably to create a new table. The less you differ from the mysql setup, the better. _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlI totally agree with you.
Is there a way of getting it in a future mediawiki release (I can do the work if necessary) ? On Friday 25 July 2008 17:05:48 Platonides wrote: > Marc Cousin wrote: > > For PostgreSQL, there are at least 2 solutions : > > - Put the textvector into the page table > > - Do the same as for mysql : create a searchindex table (even if the > > myisam justification for this table doesn't hold for mysql) > > > > Plus the temporary fix I was talking about in my mail : put an empty > > textvector for all versions of a document except its last one, by > > changing the trigger a bit. That divided the size of my fulltext index by > > 30 and made the text search fast again. > > > > What would be the preferred way to solve this problem ? > > Probably to create a new table. The less you differ from the mysql > setup, the better. > > > _______________________________________________ > MediaWiki-l mailing list > MediaWiki-l@... > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlMarc Cousin wrote:
> I totally agree with you. > > Is there a way of getting it in a future mediawiki release (I can do the work > if necessary) ? > Sure. Ask brion for a SVN account (you'll need to send him a public key). The postgresql backend is not too well maintained, people wanting to postgre appear from time to time and give it a kick. I don't remember who is currently supposed to be taking care of it, but i interpret that as he hasn't commented on this, he doesn't oppose ;) _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlOn Monday 28 July 2008 13:34:41 Platonides wrote:
> Marc Cousin wrote: > > I totally agree with you. > > > > Is there a way of getting it in a future mediawiki release (I can do the > > work if necessary) ? > > Sure. Ask brion for a SVN account (you'll need to send him a public key). > The postgresql backend is not too well maintained, people wanting to > postgre appear from time to time and give it a kick. > I don't remember who is currently supposed to be taking care of it, but > i interpret that as he hasn't commented on this, he doesn't oppose ;) Okay. First I'd like to sum up what's to be done (please anybody tell me if I'm wrong, I'll let a few days pass before I start working). = Current state = MySQL's way of doing text indexation: - Articles updates are done on text (pagecontent for postgresql), and page (I of course omit the rest of the schema as it's the same as far as full text indexing is concerned) - Sometimes, there is an indexation work that locks searchindex and puts the new pages contents into it (retrieving the contents from the text table). New page contents is retrieved thanks to recentchanges. For PostgreSQL, right now : full text data is stored into the pagecontent (text) table. So there are many unnecessary versions of full text data. What I'd like to know is the reason for this asynchronous indexing in mysql : is it because of the myisam table for full text (to avoid locking during inserts), or for other performance reasons? If there is no other reason than locking, there is no point in having this asynchronous job with postgresql. = Proposal = - create the same searchindex table in postgresql as in mysql - modify the search queries to use this new table (reuse and adapt the mysql code) - maintain searchindex synchronously with a trigger. I would think the best place to put this trigger would be on the page table. The trigger should update searchindex when page_latest is put to a non zero value (I think the trigger will have everything it needs when this trigger fires). - create a migration script for the database : - remove unnecessary columns and indexes on text table, and its trigger. - remove titlevector from title and its trigger. - add pagecontent and put all page records in it, then put gin indexes (on title and on content) - put the trigger in place Comments on this, please ? Cheers Marc _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresql-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 Marc Cousin wrote: > MySQL's way of doing text indexation: > - Articles updates are done on text (pagecontent for postgresql), and page (I > of course omit the rest of the schema as it's the same as far as full text > indexing is concerned) > - Sometimes, there is an indexation work that locks searchindex and puts the > new pages contents into it (retrieving the contents from the text table). New > page contents is retrieved thanks to recentchanges. Well, no -- the search index is updated on page save, unless you go to a lot of (usually unnecessary) trouble to disable it and do bulk updates. Here's what happens normally: Article::editUpdates() puts a SearchUpdate object on the post-edit updates stack when the edit is complete. This gets run at the completion of the request, along with many other updates (links tables, view counts, etc). SearchUpdate::doUpdate() takes the text, does some freaky markup normalization, and sends it off to the search engine's SearchEngine::update() or SeachEngine::updateTitle() methods. The SearchEngine subclass for the relevant engine then updates its index for the given page. In the case of the MySQL backend, this creates or updates a row in the searchindex table. > What I'd like to know is the reason for this asynchronous indexing in mysql : > is it because of the myisam table for full text (to avoid locking during > inserts), or for other performance reasons? If you mean the non-default option to disable updates on edit, that would be because MyISAM updates cause locking trouble if your site is very busy, so you might prefer to do updates in bulk off-peak with the search temporarily disabled. However this is not the default. > - maintain searchindex synchronously with a trigger. I would think the best > place to put this trigger would be on the page table. The trigger should > update searchindex when page_latest is put to a non zero value (I think the > trigger will have everything it needs when this trigger fires). Can't do it -- the database has insufficient knowledge to interpret the contents of page text in the general case. * It doesn't know how to uncompress compressed revisions * It doesn't know how to access text in external storage blobs * It doesn't know the namespaces etc for proper markup normalization Since the database can't read page text as a general case, this has to be done from the application layer. - -- brion -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkiOEnIACgkQwRnhpk1wk47lnACg2SApw9M6bxbcwsS9IhzFC/3Z I1gAn3ZWFwnABt82JvNkJHQiax7RNRXW =ES9a -----END PGP SIGNATURE----- _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlOn Monday 28 July 2008 20:39:46 Brion Vibber wrote:
> Marc Cousin wrote: > > MySQL's way of doing text indexation: > > - Articles updates are done on text (pagecontent for postgresql), and > > page (I of course omit the rest of the schema as it's the same as far as > > full text indexing is concerned) > > - Sometimes, there is an indexation work that locks searchindex and puts > > the new pages contents into it (retrieving the contents from the text > > table). New page contents is retrieved thanks to recentchanges. > > Well, no -- the search index is updated on page save, unless you go to a > lot of (usually unnecessary) trouble to disable it and do bulk updates. > > > Here's what happens normally: > > Article::editUpdates() puts a SearchUpdate object on the post-edit > updates stack when the edit is complete. This gets run at the completion > of the request, along with many other updates (links tables, view > counts, etc). > > SearchUpdate::doUpdate() takes the text, does some freaky markup > normalization, and sends it off to the search engine's > SearchEngine::update() or SeachEngine::updateTitle() methods. > > The SearchEngine subclass for the relevant engine then updates its index > for the given page. In the case of the MySQL backend, this creates or > updates a row in the searchindex table. > > > What I'd like to know is the reason for this asynchronous indexing in > > mysql : is it because of the myisam table for full text (to avoid locking > > during inserts), or for other performance reasons? > > If you mean the non-default option to disable updates on edit, that > would be because MyISAM updates cause locking trouble if your site is > very busy, so you might prefer to do updates in bulk off-peak with the > search temporarily disabled. > > However this is not the default. > > > - maintain searchindex synchronously with a trigger. I would think the > > best place to put this trigger would be on the page table. The trigger > > should update searchindex when page_latest is put to a non zero value (I > > think the trigger will have everything it needs when this trigger fires). > > Can't do it -- the database has insufficient knowledge to interpret the > contents of page text in the general case. > > * It doesn't know how to uncompress compressed revisions > * It doesn't know how to access text in external storage blobs > * It doesn't know the namespaces etc for proper markup normalization > > Since the database can't read page text as a general case, this has to > be done from the application layer. Ok. It's not the case right now with the PostgreSQL code : the database maintains it's text index by itself via trigger. So doing the changes the mysql way may correct other problems at the same time ? _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
|
|
|
Re: Text search performance and postgresql-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 Marc Cousin wrote: > On Monday 28 July 2008 20:39:46 Brion Vibber wrote: [snip] >> Since the database can't read page text as a general case, this has to >> be done from the application layer. > > Ok. It's not the case right now with the PostgreSQL code : the database > maintains it's text index by itself via trigger. ... which does not work as a general case as I've mentioned many times in this thread so far ... > So doing the changes the > mysql way may correct other problems at the same time ? There's nothing MySQL-specific about doing it in a way that works. :) - -- brion -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkiPkNQACgkQwRnhpk1wk44MZgCgxFSNKfuD8xUTpClSgZT4Ah0p 9XwAmgOkTwd/DEDK2FWDoO6FCWtt+fEg =7t1V -----END PGP SIGNATURE----- _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlGreg Sabino Mullane wrote:
> Wow, that's a bit harsh - not too well maintained? Hardly. :) Myself and > others have put in a lot of work to get it working and keep it so, > including paving the way for other database backends by discovering and > addressing lots of mysqlisms in the code. I apologize for not replying: > I somehow got recently unsubscribed. That's what you get if you don't reply at time. You get any noob like me answering anything about a secondary database backend :-) _______________________________________________ MediaWiki-l mailing list MediaWiki-l@... https://lists.wikimedia.org/mailman/listinfo/mediawiki-l |
|
|
Re: Text search performance and postgresqlThanks for your aswers.
Here's what I see from my (narrow) understanding of how mediawiki works. - I already tried to change the parameters you mention, but as they were correct for my (pretty old) machine, this brought me even worse results (this was expected). The bad plans come from the fact we have a lot of versions of page revisions indexed in the page table. Anyhow, I find it strange that we use the pagecontent table to store those full text vectors when they are in fact the full text vectors for a page. Wouldn't this textvector be better placed either in the page table or in a searchindex with a 1:1 relationship like in mysql ? That would be more logical to the planner : how could it guess that the records from the pagecontent table that are indexed are indeed the latest revisions from the page table ? Whereas if the vector is in the page table, the query becomes a more classical join : we find the records matching the full text criterion from page table, then join to revision and pagecontent to retrieve the associated text. - all revisions of all pages were indexed in the text table. Maybe because we're still using 1.9.1 ? (there is no update in SearchPostgres.php here) If I understand correctly what you explain below, the content of text is not 'opacified' for postgresql, |