[PLIP-Advisories] Re: [Plone] #9309: Better search for East Asian (multi-byte) languages.
plip-advisories at lists.plone.org
plip-advisories at lists.plone.org
Sun Jun 28 07:43:34 UTC 2009
#9309: Better search for East Asian (multi-byte) languages.
---------------------------------+------------------------------------------
Reporter: terapyon | Owner:
Type: PLIP | Status: new
Priority: minor | Milestone: 4.0
Component: Unknown | Resolution:
Keywords: search splitter CJK |
---------------------------------+------------------------------------------
Comment(by terapyon):
More details as follows:[[BR]]
Plone currently does not produce good search results for Chinese,
Japanese, and Korean (CJK). Text in those languages is not divided into
words using white space, which means that Plone's standard white space-
based text indexing does not work.
'''PROPOSAL'''[[BR]]
We propose giving Plone better CJK search out-of-the box by implementing a
different indexing logic for CJK text. There are three main elements to
this:
1. CJK text detection. The indexing code recognizes CJK text by checking
the code of each character of text being indexed. This will add a small
overhead to all text indexing. It switches the indexing logic only when
the code is in the range of CJK; non-CJK text will thus not be affected.
When non-CJK characters follow CJK characters, indexing reverts to the
standard white-space method.
2. Splitting. CJK text is then split into index keys using the bi-gram
method. There are two main methods of splitting CJK text: bi-grams and
morphological analysis. Morphological analysis can yield better results,
but requires maintenance of a dictionary. We propose using the bi-gram
method because it does not require dictionary maintenance and because its
results are good enough for text search purposes. The bi-gram method
splits a text consisting of N characters into (N-1) sets of bi-grams, or
pairs of adjacent characters. Each bi-gram is then registered as an index
key.
3. Searching. When searching, the search term is also split into a set of
bi-grams that are used as search keys. These search keys are compared with
the index keys, considering "matched" and "continuously adjacent". This
should work like searching in English text for a double-quoted search term
such as "I like Plone".
- Consider this example; each roman character here represents a CJK
character.
a. The text "content" is indexed. This generates the following 6 index
keys:
"co", "on", "nt", "te", "en", "nt"
b. The search term "cont" is entered. This generates 3 search terms:
"co", "on", "nt"
c. All 3 generated search terms will be matched with the first 3 keys
adjacently. So, we can say "cont" is matched to "content".
'''RISKS'''[[BR]]
Low
- Since the first step of this process checks the character code and
switches the logic only when its code is in the range of CJK, non-CJK
should not be affected. This approach also allows indexing methods for
non-white-spaced languages other than CJK languages to be added in future.
- Code range checking will be a small, but we believe acceptable, overhead
per character for all users regardless of languages.
'''PARTICIPANTS'''[[BR]]
Manabu Terada - Leader[[BR]]
Mikio Hokari - Programmer[[BR]]
Takeshi Yamamoto - Documentation and Testing[[BR]]
Naotaka Hotta - Support and Documentation[[BR]]
Jonathan Lewis - Moral support and documentation[[BR]]
'''PROGRESS'''[[BR]]
A start on CJK indexing was made as a GSoC project in 2008. That project
didn't deliver as much as hoped, but the mentor group has continued
development and the East Asian Plone community has helped to test the
code. The major remaining tasks are refining the code and repackaging it
as a built-in component.
--
Ticket URL: <https://dev.plone.org/plone/ticket/9309#comment:5>
Plone <http://plone.org>
Plone Content Management System
More information about the PLIP-Advisories
mailing list