Text search performance and postgresql

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

Text search performance and postgresql

by Marc Cousin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I'm new on this list, and I'm not sure I'm posting at the right place. I've got a question about text indexation.

I have a postgresql mediawiki instance installed on a very small server, and have been working on tweaking it's performance these last few days. When all that could be done from
the OS and database parameters was done, I decided to give a look to the SQL queries.

The main problem I'm seeing is with the text searchs. I don't really know how it's handled with mysql, but with postgresql, I think there may be some optimization that could be done,
but I'm not sure about it as I don't know all the code.

The text search query is this one :

EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text AS page_text, ts_rank(titlevector, to_tsquery('default','postgres')) AS rnk FROM page p, revision r,
pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id = c.old_id AND textvector @@ to_tsquery('default','postgres') AND page_is_redirect = '0' AND page_namespace IN
(0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20  OFFSET 0;

The plan is this one :

 Limit  (cost=718.49..718.50 rows=1 width=621) (actual time=305.943..305.982 rows=20 loops=1)
   ->  Sort  (cost=718.49..718.50 rows=1 width=621) (actual time=305.939..305.952 rows=20 loops=1)
         Sort Key: rank(p.titlevector, '''postgr'''::tsquery), p.page_id
         ->  Nested Loop  (cost=0.00..718.48 rows=1 width=621) (actual time=4.278..305.671 rows=44 loops=1)
               ->  Nested Loop  (cost=0.00..695.00 rows=21 width=204) (actual time=0.829..76.740 rows=3210 loops=1)
                     ->  Seq Scan on page p  (cost=0.00..524.95 rows=21 width=204) (actual time=0.804..19.686 rows=3210 loops=1)
                           Filter: (((page_is_redirect)::text = '0'::text) AND (page_namespace = ANY ('{0,9,11}'::integer[])))
                     ->  Index Scan using revision_rev_id_key on revision r  (cost=0.00..8.09 rows=1 width=8) (actual time=0.012..0.013 rows=1 loops=3210)
                           Index Cond: (p.page_latest = r.rev_id)
               ->  Index Scan using pagecontent_pkey on pagecontent c  (cost=0.00..1.11 rows=1 width=425) (actual time=0.069..0.069 rows=0 loops=3210)
                     Index Cond: (r.rev_text_id = c.old_id)
                     Filter: (textvector @@ '''postgr'''::tsquery)
 Total runtime: 306.118 ms

This plan joins page and revision to determine all the latests revisions of pagecontents, then scans all matching pagecontent to determine which ones match my query.

There is also an other plan, depending on the amount of ram available and the estimate of the number of 'latest pagecontents'

 Limit  (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.594..224.646 rows=20 loops=1)
   ->  Sort  (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.591..224.610 rows=20 loops=1)
         Sort Key: (ts_rank(p.titlevector, '''postgr'''::tsquery)), p.page_id
         Sort Method:  top-N heapsort  Memory: 37kB
         ->  Hash Join  (cost=2689.31..2979.45 rows=4 width=504) (actual time=211.141..224.432 rows=43 loops=1)
               Hash Cond: (p.page_latest = r.rev_id)
               ->  Seq Scan on page p  (cost=0.00..276.86 rows=3527 width=82) (actual time=0.460..10.202 rows=3118 loops=1)
                     Filter: ((page_is_redirect = '0'::bpchar) AND (page_namespace = ANY ('{0,9,11}'::integer[])))
               ->  Hash  (cost=2688.26..2688.26 rows=84 width=430) (actual time=210.409..210.409 rows=1517 loops=1)
                     ->  Hash Join  (cost=534.76..2688.26 rows=84 width=430) (actual time=26.557..207.725 rows=1517 loops=1)
                           Hash Cond: (r.rev_text_id = c.old_id)
                           ->  Seq Scan on revision r  (cost=0.00..1836.94 rows=84194 width=8) (actual time=0.023..98.850 rows=84194 loops=1)
                           ->  Hash  (cost=533.59..533.59 rows=93 width=430) (actual time=18.182..18.182 rows=1515 loops=1)
                                 ->  Bitmap Heap Scan on pagecontent c  (cost=190.83..533.59 rows=93 width=430) (actual time=0.585..15.663 rows=1515 loops=1)
                                       Recheck Cond: (textvector @@ '''postgr'''::tsquery)
                                       ->  Bitmap Index Scan on ts2_page_text2  (cost=0.00..190.81 rows=93 width=0) (actual time=0.431..0.431 rows=1515 loops=1)
                                             Index Cond: (textvector @@ '''postgr'''::tsquery)
 Total runtime: 224.765 ms


Times are different because this machine is much more powerful.

This time, postgresql decides to get all articles from pagecontent, with all versions, and then determines which ones are latest.

In both cases this is rather inefficient, as I guess we search only on the latest version of the articles.



So I'm coming to the point ...


Is there a reason we index every version of every content ?


For instance, with my database, I've tested removing all textvectors from pagecontent except for the latest version of each page. My text index size went from 400Mb to 15Mb. And
my text search times went down to a near constant 10ms for all queries. I can then maintain the textvectors by modifying the trigger on pagecontent a bit to cleanup the previous
record while updating the table.


If I'm posting in the wrong place, please tell me. If the idea is stupid, please tell me also :)


Cheers

Marc Cousin

_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Brion Vibber-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Jul 18, 2008, at 16:03, Marc Cousin <mcousin@...>
> EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text  
> AS page_text, ts_rank(titlevector, to_tsquery('default','postgres'))  
> AS rnk FROM page p, revision r,
> pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id =  
> c.old_id AND textvector @@ to_tsquery('default','postgres') AND  
> page_is_redirect = '0' AND page_namespace IN
> (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20  OFFSET 0;

Hmm; a fundamental problem here is that text.old_text is opaque to the  
database -- it may contain compressed text, text in an alternate  
encoding, or a reference to another table or database at the  
application level.

This is part of why we use a separate table for the search index for  
the MySQL search -- it's actually consistently readable to the db's  
search engine as well as providing a summary table so the index only  
has to cover live versions.

-- brion

>
>
> The plan is this one :
>
> Limit  (cost=718.49..718.50 rows=1 width=621) (actual time=305.943..305.982
>  rows=20 loops=1)
>   ->  Sort  (cost=718.49..718.50 rows=1 width=621) (actual time=305.939..305.952
>  rows=20 loops=1)
>         Sort Key: rank(p.titlevector, '''postgr'''::tsquery),  
> p.page_id
>         ->  Nested Loop  (cost=0.00..718.48 rows=1 width=621)  
> (actual time=4.278..305.671 rows=44 loops=1)
>               ->  Nested Loop  (cost=0.00..695.00 rows=21 width=204)  
> (actual time=0.829..76.740 rows=3210 loops=1)
>                     ->  Seq Scan on page p  (cost=0.00..524.95  
> rows=21 width=204) (actual time=0.804..19.686 rows=3210 loops=1)
>                           Filter: (((page_is_redirect)::text =  
> '0'::text) AND (page_namespace = ANY ('{0,9,11}'::integer[])))
>                     ->  Index Scan using revision_rev_id_key on  
> revision r  (cost=0.00..8.09 rows=1 width=8) (actual  
> time=0.012..0.013 rows=1 loops=3210)
>                           Index Cond: (p.page_latest = r.rev_id)
>               ->  Index Scan using pagecontent_pkey on pagecontent  
> c  (cost=0.00..1.11 rows=1 width=425) (actual time=0.069..0.069  
> rows=0 loops=3210)
>                     Index Cond: (r.rev_text_id = c.old_id)
>                     Filter: (textvector @@ '''postgr'''::tsquery)
> Total runtime: 306.118 ms
>
> This plan joins page and revision to determine all the latests  
> revisions of pagecontents, then scans all matching pagecontent to  
> determine which ones match my query.
>
> There is also an other plan, depending on the amount of ram  
> available and the estimate of the number of 'latest pagecontents'
>
> Limit  (cost=2979.49..2979.50 rows=4 width=504) (actual time=224.594..224.646
>  rows=20 loops=1)
>   ->  Sort  (cost=2979.49..2979.50 rows=4 width=504) (actual  
> time=224.591..224.610 rows=20 loops=1)
>         Sort Key: (ts_rank(p.titlevector, '''postgr'''::tsquery)),  
> p.page_id
>         Sort Method:  top-N heapsort  Memory: 37kB
>         ->  Hash Join  (cost=2689.31..2979.45 rows=4 width=504)  
> (actual time=211.141..224.432 rows=43 loops=1)
>               Hash Cond: (p.page_latest = r.rev_id)
>               ->  Seq Scan on page p  (cost=0.00..276.86 rows=3527  
> width=82) (actual time=0.460..10.202 rows=3118 loops=1)
>                     Filter: ((page_is_redirect = '0'::bpchar) AND  
> (page_namespace = ANY ('{0,9,11}'::integer[])))
>               ->  Hash  (cost=2688.26..2688.26 rows=84 width=430)  
> (actual time=210.409..210.409 rows=1517 loops=1)
>                     ->  Hash Join  (cost=534.76..2688.26 rows=84  
> width=430) (actual time=26.557..207.725 rows=1517 loops=1)
>                           Hash Cond: (r.rev_text_id = c.old_id)
>                           ->  Seq Scan on revision r  
> (cost=0.00..1836.94 rows=84194 width=8) (actual time=0.023..98.850  
> rows=84194 loops=1)
>                           ->  Hash  (cost=533.59..533.59 rows=93  
> width=430) (actual time=18.182..18.182 rows=1515 loops=1)
>                                 ->  Bitmap Heap Scan on pagecontent  
> c  (cost=190.83..533.59 rows=93 width=430) (actual  
> time=0.585..15.663 rows=1515 loops=1)
>                                       Recheck Cond: (textvector @@  
> '''postgr'''::tsquery)
>                                       ->  Bitmap Index Scan on  
> ts2_page_text2  (cost=0.00..190.81 rows=93 width=0) (actual time=0.431..0.431
>  rows=1515 loops=1)
>                                             Index Cond: (textvector  
> @@ '''postgr'''::tsquery)
> Total runtime: 224.765 ms
>
>
> Times are different because this machine is much more powerful.
>
> This time, postgresql decides to get all articles from pagecontent,  
> with all versions, and then determines which ones are latest.
>
> In both cases this is rather inefficient, as I guess we search only  
> on the latest version of the articles.
>
>
>
> So I'm coming to the point ...
>
>
> Is there a reason we index every version of every content ?
>
>
> For instance, with my database, I've tested removing all textvectors  
> from pagecontent except for the latest version of each page. My text  
> index size went from 400Mb to 15Mb. And
> my text search times went down to a near constant 10ms for all  
> queries. I can then maintain the textvectors by modifying the  
> trigger on pagecontent a bit to cleanup the previous
> record while updating the table.
>
>
> If I'm posting in the wrong place, please tell me. If the idea is  
> stupid, please tell me also :)
>
>
> Cheers
>
> Marc Cousin
>
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l@...
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Marc Cousin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Saturday 19 July 2008 15:44:51 Brion Vibber wrote:

> On Jul 18, 2008, at 16:03, Marc Cousin <mcousin@...>
>
> > EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text
> > AS page_text, ts_rank(titlevector, to_tsquery('default','postgres'))
> > AS rnk FROM page p, revision r,
> > pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id =
> > c.old_id AND textvector @@ to_tsquery('default','postgres') AND
> > page_is_redirect = '0' AND page_namespace IN
> > (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20  OFFSET 0;
>
> Hmm; a fundamental problem here is that text.old_text is opaque to the
> database -- it may contain compressed text, text in an alternate
> encoding, or a reference to another table or database at the
> application level.
>
> This is part of why we use a separate table for the search index for
> the MySQL search -- it's actually consistently readable to the db's
> search engine as well as providing a summary table so the index only
> has to cover live versions.
>
> -- brion

Hi,

Thanks for the answer...

What do you mean by 'opaque' ?

Btw the separate table for mysql solves the problem I'm having with postgresql
(except that there is not the myisam problem, so the fulltext index could be
stored directly in the 'page' table with postgresql). For a wiki like mine,
with a lot of versions per document, text search performance becomes very
bad.

So can you explain me what you mean by opaque ?

Thanks a lot for your time.

Marc


_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Daniel Friesen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

By 'opaque' he means that the databases are supposed to handle that
field transparently. They are supposed to send the data raw to the app
and act as if they have absolutely no clue how to handle the data inside
of that field.

The reason being the data inside of that column is inconstant and
letting the database modify or pretend it knows what is inside there can
end up with real unexpected results.

--
~Daniel Friesen(Dantman, Nadir-Seen-Fire) of:
-The Nadir-Point Group (http://nadir-point.com)
--It's Wiki-Tools subgroup (http://wiki-tools.com)
--The ElectronicMe project (http://electronic-me.org)
--Games-G.P.S. (http://ggps.org)
-And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG)
--Animepedia (http://anime.wikia.com)
--Narutopedia (http://naruto.wikia.com)

Marc Cousin wrote:

> On Saturday 19 July 2008 15:44:51 Brion Vibber wrote:
>> On Jul 18, 2008, at 16:03, Marc Cousin <mcousin@...>
>>
>>> EXPLAIN ANALYZE SELECT page_id, page_namespace, page_title, old_text
>>> AS page_text, ts_rank(titlevector, to_tsquery('default','postgres'))
>>> AS rnk FROM page p, revision r,
>>> pagecontent c WHERE p.page_latest = r.rev_id AND r.rev_text_id =
>>> c.old_id AND textvector @@ to_tsquery('default','postgres') AND
>>> page_is_redirect = '0' AND page_namespace IN
>>> (0,9,11) ORDER BY rnk DESC, page_id DESC LIMIT 20  OFFSET 0;
>> Hmm; a fundamental problem here is that text.old_text is opaque to the
>> database -- it may contain compressed text, text in an alternate
>> encoding, or a reference to another table or database at the
>> application level.
>>
>> This is part of why we use a separate table for the search index for
>> the MySQL search -- it's actually consistently readable to the db's
>> search engine as well as providing a summary table so the index only
>> has to cover live versions.
>>
>> -- brion
>
> Hi,
>
> Thanks for the answer...
>
> What do you mean by 'opaque' ?
>
> Btw the separate table for mysql solves the problem I'm having with postgresql
> (except that there is not the myisam problem, so the fulltext index could be
> stored directly in the 'page' table with postgresql). For a wiki like mine,
> with a lot of versions per document, text search performance becomes very
> bad.
>
> So can you explain me what you mean by opaque ?
>
> Thanks a lot for your time.
>
> Marc


_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Marc Cousin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ok, I now understand the meaning of opaque, even if that data isn't that
opaque, because the database and mediawiki are using it for full text
indexing.

I totally agree with you, that would be a really bad idea to try to mess with
the content of the page (that wasn't what I was talking about).

Currently, for postgresql there is a trigger updating the 'textvector' column
in pagecontent table when 'old_text' is modified, so that the text search
index is up to date.

The problem is that the textvector should only exist for the latest version of
the pagecontent. There is no point in indexing the old versions of the
contents of the page.

That's done for mysql because of the 1:1 relationship between page and
searchindex, but isn't done for postgresql.

For PostgreSQL, there are at least 2 solutions :
- Put the textvector into the page table
- Do the same as for mysql : create a searchindex table (even if the myisam
justification for this table doesn't hold for mysql)

Plus the temporary fix I was talking about in my mail : put an empty
textvector for all versions of a document except its last one, by changing
the trigger a bit. That divided the size of my fulltext index by 30 and made
the text search fast again.

What would be the preferred way to solve this problem ?


On Monday 21 July 2008 19:47:09 Daniel Friesen wrote:
> By 'opaque' he means that the databases are supposed to handle that
> field transparently. They are supposed to send the data raw to the app
> and act as if they have absolutely no clue how to handle the data inside
> of that field.
>
> The reason being the data inside of that column is inconstant and
> letting the database modify or pretend it knows what is inside there can
> end up with real unexpected results.


_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Platonides :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Marc Cousin wrote:

> For PostgreSQL, there are at least 2 solutions :
> - Put the textvector into the page table
> - Do the same as for mysql : create a searchindex table (even if the myisam
> justification for this table doesn't hold for mysql)
>
> Plus the temporary fix I was talking about in my mail : put an empty
> textvector for all versions of a document except its last one, by changing
> the trigger a bit. That divided the size of my fulltext index by 30 and made
> the text search fast again.
>
> What would be the preferred way to solve this problem ?

Probably to create a new table. The less you differ from the mysql
setup, the better.


_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Marc Cousin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I totally agree with you.

Is there a way of getting it in a future mediawiki release (I can do the work
if necessary) ?



On Friday 25 July 2008 17:05:48 Platonides wrote:

> Marc Cousin wrote:
> > For PostgreSQL, there are at least 2 solutions :
> > - Put the textvector into the page table
> > - Do the same as for mysql : create a searchindex table (even if the
> > myisam justification for this table doesn't hold for mysql)
> >
> > Plus the temporary fix I was talking about in my mail : put an empty
> > textvector for all versions of a document except its last one, by
> > changing the trigger a bit. That divided the size of my fulltext index by
> > 30 and made the text search fast again.
> >
> > What would be the preferred way to solve this problem ?
>
> Probably to create a new table. The less you differ from the mysql
> setup, the better.
>
>
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l@...
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l



_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Platonides :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Marc Cousin wrote:
> I totally agree with you.
>
> Is there a way of getting it in a future mediawiki release (I can do the work
> if necessary) ?
>

Sure. Ask brion for a SVN account (you'll need to send him a public key).
The postgresql backend is not too well maintained, people wanting to
postgre appear from time to time and give it a kick.
I don't remember who is currently supposed to be taking care of it, but
i interpret that as he hasn't commented on this, he doesn't oppose ;)


_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Marc Cousin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Monday 28 July 2008 13:34:41 Platonides wrote:

> Marc Cousin wrote:
> > I totally agree with you.
> >
> > Is there a way of getting it in a future mediawiki release (I can do the
> > work if necessary) ?
>
> Sure. Ask brion for a SVN account (you'll need to send him a public key).
> The postgresql backend is not too well maintained, people wanting to
> postgre appear from time to time and give it a kick.
> I don't remember who is currently supposed to be taking care of it, but
> i interpret that as he hasn't commented on this, he doesn't oppose ;)


Okay.

First I'd like to sum up what's to be done (please anybody tell me if I'm
wrong, I'll let a few days pass before I start working).

= Current state =

MySQL's way of doing text indexation:
- Articles updates are done on text (pagecontent for postgresql), and page (I
of course omit the rest of the schema as it's the same as far as full text
indexing is concerned)
- Sometimes, there is an indexation work that locks searchindex and puts the
new pages contents into it (retrieving the contents from the text table). New
page contents is retrieved thanks to recentchanges.

For PostgreSQL, right now : full text data is stored into the pagecontent
(text) table. So there are many unnecessary versions of full text data.

What I'd like to know is the reason for this asynchronous indexing in mysql :
is it because of the myisam table for full text (to avoid locking during
inserts), or for other performance reasons?

If there is no other reason than locking, there is no point in having this
asynchronous job with postgresql.

= Proposal =

- create the same searchindex table in postgresql as in mysql
- modify the search queries to use this new table (reuse and adapt the mysql
code)
- maintain searchindex synchronously with a trigger. I would think the best
place to put this trigger would be on the page table. The trigger should
update searchindex when page_latest is put to a non zero value (I think the
trigger will have everything it needs when this trigger fires).
- create a migration script for the database :
        - remove unnecessary columns and indexes on text table, and its trigger.
        - remove titlevector from title and its trigger.
        - add pagecontent and put all page records in it, then put gin indexes (on
title and on content)
        - put the trigger in place


Comments on this, please ?

Cheers

Marc

_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Brion Vibber-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Marc Cousin wrote:
> MySQL's way of doing text indexation:
> - Articles updates are done on text (pagecontent for postgresql), and page (I
> of course omit the rest of the schema as it's the same as far as full text
> indexing is concerned)
> - Sometimes, there is an indexation work that locks searchindex and puts the
> new pages contents into it (retrieving the contents from the text table). New
> page contents is retrieved thanks to recentchanges.

Well, no -- the search index is updated on page save, unless you go to a
lot of (usually unnecessary) trouble to disable it and do bulk updates.


Here's what happens normally:

Article::editUpdates() puts a SearchUpdate object on the post-edit
updates stack when the edit is complete. This gets run at the completion
of the request, along with many other updates (links tables, view
counts, etc).

SearchUpdate::doUpdate() takes the text, does some freaky markup
normalization, and sends it off to the search engine's
SearchEngine::update() or SeachEngine::updateTitle() methods.

The SearchEngine subclass for the relevant engine then updates its index
for the given page. In the case of the MySQL backend, this creates or
updates a row in the searchindex table.

> What I'd like to know is the reason for this asynchronous indexing in mysql :
> is it because of the myisam table for full text (to avoid locking during
> inserts), or for other performance reasons?

If you mean the non-default option to disable updates on edit, that
would be because MyISAM updates cause locking trouble if your site is
very busy, so you might prefer to do updates in bulk off-peak with the
search temporarily disabled.

However this is not the default.

> - maintain searchindex synchronously with a trigger. I would think the best
> place to put this trigger would be on the page table. The trigger should
> update searchindex when page_latest is put to a non zero value (I think the
> trigger will have everything it needs when this trigger fires).

Can't do it -- the database has insufficient knowledge to interpret the
contents of page text in the general case.

* It doesn't know how to uncompress compressed revisions
* It doesn't know how to access text in external storage blobs
* It doesn't know the namespaces etc for proper markup normalization

Since the database can't read page text as a general case, this has to
be done from the application layer.

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiOEnIACgkQwRnhpk1wk47lnACg2SApw9M6bxbcwsS9IhzFC/3Z
I1gAn3ZWFwnABt82JvNkJHQiax7RNRXW
=ES9a
-----END PGP SIGNATURE-----

_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Marc Cousin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Monday 28 July 2008 20:39:46 Brion Vibber wrote:

> Marc Cousin wrote:
> > MySQL's way of doing text indexation:
> > - Articles updates are done on text (pagecontent for postgresql), and
> > page (I of course omit the rest of the schema as it's the same as far as
> > full text indexing is concerned)
> > - Sometimes, there is an indexation work that locks searchindex and puts
> > the new pages contents into it (retrieving the contents from the text
> > table). New page contents is retrieved thanks to recentchanges.
>
> Well, no -- the search index is updated on page save, unless you go to a
> lot of (usually unnecessary) trouble to disable it and do bulk updates.
>
>
> Here's what happens normally:
>
> Article::editUpdates() puts a SearchUpdate object on the post-edit
> updates stack when the edit is complete. This gets run at the completion
> of the request, along with many other updates (links tables, view
> counts, etc).
>
> SearchUpdate::doUpdate() takes the text, does some freaky markup
> normalization, and sends it off to the search engine's
> SearchEngine::update() or SeachEngine::updateTitle() methods.
>
> The SearchEngine subclass for the relevant engine then updates its index
> for the given page. In the case of the MySQL backend, this creates or
> updates a row in the searchindex table.
>
> > What I'd like to know is the reason for this asynchronous indexing in
> > mysql : is it because of the myisam table for full text (to avoid locking
> > during inserts), or for other performance reasons?
>
> If you mean the non-default option to disable updates on edit, that
> would be because MyISAM updates cause locking trouble if your site is
> very busy, so you might prefer to do updates in bulk off-peak with the
> search temporarily disabled.
>
> However this is not the default.
>
> > - maintain searchindex synchronously with a trigger. I would think the
> > best place to put this trigger would be on the page table. The trigger
> > should update searchindex when page_latest is put to a non zero value (I
> > think the trigger will have everything it needs when this trigger fires).
>
> Can't do it -- the database has insufficient knowledge to interpret the
> contents of page text in the general case.
>
> * It doesn't know how to uncompress compressed revisions
> * It doesn't know how to access text in external storage blobs
> * It doesn't know the namespaces etc for proper markup normalization
>
> Since the database can't read page text as a general case, this has to
> be done from the application layer.

Ok. It's not the case right now with the PostgreSQL code : the database
maintains it's text index by itself via trigger. So doing the changes the
mysql way may correct other problems at the same time ?

_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Parent Message unknown Re: Text search performance and postgresql

by Greg Sabino Mullane-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Marc Cousin wrote:
> The plan is this one
[explain analyze plans]

Those are pretty bad plans - you might benefit from some simple tuning.
Try lowering random_page_cost and boosting effective_cache_size. Or come
round to #postgresql on freenode, for more free tuning help than you can
shake a stick at.

The query you gave ran in 56 ms for me on what I'm guessing is
a larger wiki (~ 400K rows in pagecontent) (and a pretty busy box).

You should be seeing a Bitmap Index Scan on "ts2_page_text",

If you find yourself using the same namespace restrictions a lot, you can
create some custom indexes as well, e.g.

CREATE INDEX page_index1 ON page(page_title) WHERE page_is_redirect = 0
AND page_namespace IN (0,9,11);

> Is there a reason we index every version of every content ?

We don't. Or at least, we were not intending to. You can grep for UPDATE
in SearchPostgres.php. I just changed (r38184) the code to double-check we
nullify *all* old revisions: this may explain part of what you were seeing
before. You can check on individual pages and see if they are all
null-but-latest like so:

SELECT CASE WHEN textvector IS NULL THEN 1 ELSE 0 END AS isnull, COUNT(*)
FROM pagecontent WHERE old_id IN (SELECT rev_text_id FROM revision WHERE
rev_page = (SELECT page_id FROM page WHERE page_namespace = 0 AND
page_title = 'Foobar' LIMIT 1)) GROUP BY 1;

(mostly for the benefit of other people reading this thread, since you
said you already manually updated the database)

Brion Vibber wrote:
> Hmm; a fundamental problem here is that text.old_text is opaque to the
> database -- it may contain compressed text, text in an alternate
> encoding, or a reference to another table or database at the
> application level.

Right: $wgCompressRevisions is not recommended for a Postgres backend.
Should probably document that somewhere. Postgres will compress
automatically anyway.

Not to say that we might not want to change the way things are done, but
the easiest solution to the original poster's problem is to get the query
optimized first, and figure out why the old revision are not being nulled.

Platonides wrote:

> The postgresql backend is not too well maintained, people wanting to
> postgre appear from time to time and give it a kick.
> I don't remember who is currently supposed to be taking care of it, but
> i interpret that as he hasn't commented on this, he doesn't oppose ;)

Wow, that's a bit harsh - not too well maintained? Hardly. :) Myself and
others have put in a lot of work to get it working and keep it so,
including paving the way for other database backends by discovering and
addressing lots of mysqlisms in the code. I apologize for not replying:
I somehow got recently unsubscribed.

--
Greg Sabino Mullane greg@...
End Point Corporation
PGP Key: 0x14964AC8 200807291126
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8



_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

signature.asc (169 bytes) Download Attachment

Re: Text search performance and postgresql

by Brion Vibber-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Marc Cousin wrote:
> On Monday 28 July 2008 20:39:46 Brion Vibber wrote:
[snip]
>> Since the database can't read page text as a general case, this has to
>> be done from the application layer.
>
> Ok. It's not the case right now with the PostgreSQL code : the database
> maintains it's text index by itself via trigger.

... which does not work as a general case as I've mentioned many times
in this thread so far ...

> So doing the changes the
> mysql way may correct other problems at the same time ?

There's nothing MySQL-specific about doing it in a way that works. :)

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiPkNQACgkQwRnhpk1wk44MZgCgxFSNKfuD8xUTpClSgZT4Ah0p
9XwAmgOkTwd/DEDK2FWDoO6FCWtt+fEg
=7t1V
-----END PGP SIGNATURE-----

_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Platonides :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Greg Sabino Mullane wrote:
> Wow, that's a bit harsh - not too well maintained? Hardly. :) Myself and
> others have put in a lot of work to get it working and keep it so,
> including paving the way for other database backends by discovering and
> addressing lots of mysqlisms in the code. I apologize for not replying:
> I somehow got recently unsubscribed.

That's what you get if you don't reply at time. You get any noob like me
  answering anything about a secondary database backend :-)


_______________________________________________
MediaWiki-l mailing list
MediaWiki-l@...
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: Text search performance and postgresql

by Marc Cousin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks for your aswers.

Here's what I see from my (narrow) understanding of how mediawiki works.

- I already tried to change the parameters you mention, but as they were
correct for my (pretty old) machine, this brought me even worse results (this
was expected). The bad plans come from the fact we have a lot of versions of
page revisions indexed in the page table.

Anyhow, I find it strange that we use the pagecontent table to store those
full text vectors when they are in fact the full text vectors for a page.
Wouldn't this textvector be better placed either in the page table or in a
searchindex with a 1:1 relationship like in mysql ? That would be more
logical to the planner : how could it guess that the records from the
pagecontent table that are indexed are indeed the latest revisions from the
page table ? Whereas if the vector is in the page table, the query becomes a
more classical join : we find the records matching the full text criterion
from page table, then join to revision and pagecontent to retrieve the
associated text.

- all revisions of all pages were indexed in the text table. Maybe because
we're still using 1.9.1 ? (there is no update in SearchPostgres.php here)


If I understand correctly what you explain below, the content of text is
not 'opacified' for postgresql,