[Product-Developers] Search feature slang

Andreas Jung lists at zopyx.com
Mon May 18 12:12:46 UTC 2009

On 18.05.09 13:40, Raphael Ritz wrote:
> Hi folks,
> sorry if this is considered inappropriate but I'd appreciate input
> from those who know more than I on searching technologies:
> In short I need to support full text searches with
> 1. plural versus singular forms treated equally
> 2. American versus British English treated equally
> 3. (Obvious?) spelling errors corrected/taken care of.
> Now from what I know this translates to
> 1. -> stemming support
Honestly spoken: all stemmers based on "porter stemmer"
suck in many ways. I don't know of any open-source stemmer
software (not based on the porter stemmer) that would fulfill
professional requirements. The only available usable solutions
are commercial and expensive

> 2. -> appropriate normalization? thesaurus based search?
>    (if so, what would be appropriate normalizers or
>     thesauri?)

In some way supported in TXNG3 (and extensible). Both normalization
(including aspects like handling of compounds has potential
for improvements) and thesaurus search are features not widely
used and in some way underdeveloped in TXNG3

> 3. -> similarity search where similarity is defined
>    according to some algorithm (e.g., Levenstein)
Similarity search can be implemented in different way.
Levenhstein is perhaps the best approach for doing algorithmic similarity
search. Professional search systems take a thesaurus into account.
> First question: did I get the vocabulary right here?
> Second question: looking around I (obviously) consider
> TextIndexNG but one thing I found there is that stemming
> support is incompatible with globbing (wildcard) support.

Isn't it? The various combinations of TXNG options will
likely lead to confusion and unpredicatable
result depending on the settings.
> While it seems obvious to me that they are kind of
> mutually exclusive I'd like to know how others are dealing
> with this (have two differently configured text indexes for
> the full text search and query one or the other???).

Some features are mutally exclusive (perhaps TXNG does not
try to catch stupid parameter combinations).

> Last but not least I'd appreciate pointers to docs teaching
> the general concepts, constraints, and vocabularies so that
> I know what I'm talking about in the future.

You might look at alternatives based on Lucene (e.g. SOLR,
collective.solr by Florian). The last time I checked the Lucene
world for related features as you need it, I came to the conclusion
that those solutions have the same problem when it comes
to professional stemming, thesauri support...those solutions suck
on the same high level as TextIndexNG3.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: lists.vcf
Type: text/x-vcard
Size: 316 bytes
Desc: not available
URL: <http://lists.plone.org/pipermail/plone-product-developers/attachments/20090518/ed1f1772/attachment.vcf>

More information about the Product-Developers mailing list