Scaling eXist

View: New views
2 Messages — Rating Filter:   Alert me  

Scaling eXist

by Ben Bangert :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

What's the current best strategy when it comes to scaling eXist to  
handle huge amounts of data and throughput? I'm currently only storing  
around 15-25gb of data, but its expanding at a decent pace and I'm  
realistically looking at upwards of a terabyte of XML data in the next  
6 months.

Is there anyone currently storing that much in eXist, or close to it?

I know in the past, that eXist generally scaled up to about 20gb or  
so, so I figure I could always shard since my dataset does split well  
into groupings that will likely be 20-25gb each. Though it'd be easier  
to manage with less shards of course, so being able to store 100gb or  
so per server would be more manageable.

Regarding data, my XML documents generally range in size from 15KB to  
about 5MB, with about 3% of my XML documents being as large as 50MB.  
eXist performs great so far, I'm just wondering how far others have  
taken it, and what their experience has been.

Thanks,
Ben

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

smime.p7s (3K) Download Attachment

Re: Scaling eXist

by José María Fernández González-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Ben,
        the size of the documents I'm working with are much bigger (from a tens of MB
to tens of GB each one, and the mean case is around 200MB), but the datasets
have more or less the same volume. I hope next tips can help you.

        From my experience working with such volumes of information, first of all you
need to use are fast filesystems (no ext3, please) for your eXist instances,
fast storage access (for instance, Linux software RAID0 with 4 disks performs
very well) and at least 2GB for the Java VM. If you are concerned about data
loss risks due RAID0, then you should think on a RAID3 hardware solution. As
http://www.acnc.com/04_01_03.html explains (and I had the chance to test some
years ago), it behaves very well on writes.

        Another strategy which I have not had the chance to test with last
developments is using a separate physical disk/device for the journal logs.
Journal logs are synced very often on insertions, deletions and updates, and
performance of those operations is hurt because too much hard disk heads moves
must be accomplished (journal updates + insertion/update/deletion operations).

        At database level, you must identify the scalability bottlenecks which can
affect your installation, and one of them is the FTS index. As indexes are
associated to collections, the size of the indexes depend on the content of
the collection. FTS indexes do not scale up very well because on the
build/usage of each logged term it must be kept in memory an array with the
ordered positions where the term appears. On a collection with many FTS
indexed terms there is a slowdown in insertions due continuous cache
invalidations. The partial solution is creating sub-collections for the
content, so each collection can keep its own index subcopy.

        Another scalability bottleneck is the recovery from a crash (due power
failures). As far as I know (correct me if I'm wrong, Wolf) eXist index files
are not journaled, and in those cases, when the recovery system suspects that
any index file can be corrupted it proceeds to erase the index files and to
rebuild them. With a volume of tens or hundreds of GBs it can take almost the
same time as if you were inserting the whole database content.

        Last, but not the least important, are the queries you are going to issue.
eXist performs very well when it can use (range or FTS) qname-based indexes on
the queries. In other case, when you are querying a huge volume of
information, due the creation of intermediate node fragments the memory usage
can grow too much, and the feared OutOfMemoryException is fired. Last commited
patches in stable and trunk branches mitigates that problem, but I have not
had the chance to test them.

        Best Regards,
                José María


Ben Bangert wrote:

> What's the current best strategy when it comes to scaling eXist to
> handle huge amounts of data and throughput? I'm currently only storing
> around 15-25gb of data, but its expanding at a decent pace and I'm
> realistically looking at upwards of a terabyte of XML data in the next 6
> months.
>
> Is there anyone currently storing that much in eXist, or close to it?
>
> I know in the past, that eXist generally scaled up to about 20gb or so,
> so I figure I could always shard since my dataset does split well into
> groupings that will likely be 20-25gb each. Though it'd be easier to
> manage with less shards of course, so being able to store 100gb or so
> per server would be more manageable.
>
> Regarding data, my XML documents generally range in size from 15KB to
> about 5MB, with about 3% of my XML documents being as large as 50MB.
> eXist performs great so far, I'm just wondering how far others have
> taken it, and what their experience has been.
>
> Thanks,
> Ben
>
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Exist-open mailing list
> Exist-open@...
> https://lists.sourceforge.net/lists/listinfo/exist-open

--
"There is no reason why anybody would want a computer in their home" -
        Ken Olson, founder of DEC 1977
"640K ought to be enough for anybody" - Bill Gates, 1981
"Nobody will ever outgrow a 20Mb hard drive." - ???

"Premature optimization is the root of all evil." - Donald Knuth

José María Fernández González
Tlfn: (+34) 91 732 80 00 / 91 224 69 00 (ext 3061)
e-mail: jmfernandez@... Fax: (+34) 91 224 69 76
Unidad del Instituto Nacional de Bioinformática
Biología Estructural y Biocomputación Structural Biology and Biocomputing
Centro Nacional de Investigaciones Oncológicas
C.P.: 28029 Zip Code: 28029
C/. Melchor Fernández Almagro, 3 Madrid (Spain)

**NOTA DE CONFIDENCIALIDAD** Este correo electrónico, y en su caso los ficheros adjuntos, pueden contener información protegida para el uso exclusivo de su destinatario. Se prohíbe la distribución, reproducción o cualquier otro tipo de transmisión por parte de otra persona que no sea el destinatario. Si usted recibe por error este correo, se ruega comunicarlo al remitente y borrar el mensaje recibido.
**CONFIDENTIALITY NOTICE** This email communication and any attachments may contain confidential and privileged information for the sole use of the designated recipient named above. Distribution, reproduction or any other use of this transmission by any party other than the intended recipient is prohibited. If you are not the intended recipient please contact the sender and delete all copies.


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open
LightInTheBox - Buy quality products at wholesale price