[Sprints] Re: [cph-sprint] implications of deferred reindexing

Matt Hamilton matth at netsight.co.uk
Fri Oct 26 10:40:54 UTC 2007


Lennart Regebro <regebro at ...> writes:

> 
> On 10/24/07, Malthe Borch <mborch at ...> wrote:

> > data point: on a 10.000 object site with a few added indices and
> > metadata columns it takes 10 minutes to rename a root folder; this is on
> > high-end machines.
> 
> Yup. This touches on two of the areas I would like most to see radical
> changes in Plone: Navigation (the current way of using the catalog for
> this should go away) and Cataloging. I wouldn't mind seeing a separate
> sprint on Cataloging to make sure we have a good story that supports
> deferred indexing, incremental searches all with support for external
> indexes that still are transaction-safe.

I am very interested in this at the performance sprint... in fact my main
personal interests in 'performance' are with making cataloging faster/better. 
I'm sure there are lots of gains in the ZPT rendering and stuff like that, but
for me cataloging is where my interests are and what I would like to work on at
the Sprint.

There was a BOF at Plone conf in which Roche Compaan talked about a stress test
he created to test conflict errors in the cataloging.  Alec Mitchell has studied
it further and I think many of the issues seem to be with the test, not the
code, but it has started an interesting thread (this was just in private email
between some of us who were talking about it at the conference).  ie. the BTrees
cause conflicts when a bucket is split and two processes are writing to the same
bucket... this can happen quite often in certain circumstances (ie when the
btree is initially small).  I'm sure we can work out some better strategy to
deal with these issues.

Many moons ago (at on of the first European Z3 sprints, in Reading, UK) I messed
around with some 'compressed' indexes.  These were written in C at the time for
speed as the routines were too slow in python (this was later tested by ZC as I
believe the compression code was put into ZCTextIndex when it was released but
never enabled).  One property these compressed indexes had was that updates to
them were done in a 'quick but not optimal' way and at a later stage a 'packing'
type operation happened which re-compressed and optimised the indexes.  The
point was that the new data was available to the catalog immediately, but not in
an optimal state (so slightly slower to access) until packed.  I'm wondering if
something like this could be used with the current indexing process, ie when a
document is indexed the data is just 'chucked somewhere' a place or structure in
which it would not conflict with another process doing the same, but at a later
point it is moved to the correct location.  The key is that even in the
temporary location it is usable, just maybe a bit slower or something.

Alec suggested potentially an external process that could do the queue catalog
stuff very quickly, ie it would check the queue every few seconds/milliseconds
or so... ie if there is something there that needs re-indexing then it would be
re-indexed pretty much before the user has a time to click on the next thing,
but it does solve the concurrency problems and hides the cataloging latency from
the user.

I have a couple of patches I want to test further at the sprint

1) if isTemporary(object): don't index -- this stops things in portal factory
being un-necessarily indexed. Saves some ZODB churn

2) When an object is re-indexed the text indexes need to split/stem the text
(maybe even convert from word/pdf to text) before they can compare the document
to the existing indexes to determine if anything has changed.  The
splitting/stemming process is actually relatively expensive.  I propose that an
MD5 hash of the source text is taken and stored and on a re-index compared, if
it matches then there is no need to do anything.  A very quick test on
re-indexing a Page object with about 10 screenfulls of lorem ipsum completes
about 30% faster when the source text has not changed.  In the use-case above in
which someone renames a folder and all sub-objects are re-indexed this could be
a massive win.

-Matt





More information about the Sprints mailing list