[PLIP-Advisories] Re: [Plone] #9309: Better search for East Asian (multi-byte) languages.

plip-advisories at lists.plone.org plip-advisories at lists.plone.org
Sun Jun 28 07:43:34 UTC 2009


#9309: Better search for East Asian (multi-byte) languages.
---------------------------------+------------------------------------------
 Reporter:  terapyon             |        Owner:     
     Type:  PLIP                 |       Status:  new
 Priority:  minor                |    Milestone:  4.0
Component:  Unknown              |   Resolution:     
 Keywords:  search splitter CJK  |  
---------------------------------+------------------------------------------

Comment(by terapyon):

 More details as follows:[[BR]]


 Plone currently does not produce good search results for Chinese,
 Japanese, and Korean (CJK). Text in those languages is not divided into
 words using white space, which means that Plone's standard white space-
 based text indexing does not work.

 '''PROPOSAL'''[[BR]]

 We propose giving Plone better CJK search out-of-the box by implementing a
 different indexing logic for CJK text. There are three main elements to
 this:

 1. CJK text detection. The indexing code recognizes CJK text by checking
 the code of each character of text being indexed. This will add a small
 overhead to all text indexing. It switches the indexing logic only when
 the code is in the range of CJK; non-CJK text will thus not be affected.
 When non-CJK characters follow CJK characters, indexing reverts to the
 standard white-space method.

 2. Splitting. CJK text is then split into index keys using the bi-gram
 method. There are two main methods of splitting CJK text: bi-grams and
 morphological analysis. Morphological analysis can yield better results,
 but requires maintenance of a dictionary. We propose using the bi-gram
 method because it does not require dictionary maintenance and because its
 results are good enough for text search purposes. The bi-gram method
 splits a text consisting of N characters into (N-1) sets of bi-grams, or
 pairs of adjacent characters. Each bi-gram is then registered as an index
 key.

 3. Searching. When searching, the search term is also split into a set of
 bi-grams that are used as search keys. These search keys are compared with
 the index keys, considering "matched" and "continuously adjacent". This
 should work like searching in English text for a double-quoted search term
 such as "I like Plone".

 - Consider this example; each roman character here represents a CJK
 character.

 a. The text "content" is indexed. This  generates the following 6 index
 keys:
 "co", "on", "nt", "te", "en", "nt"

 b. The search term "cont" is entered.  This generates 3 search terms:
 "co", "on", "nt"

 c. All 3 generated search terms will be matched with the first 3 keys
 adjacently.  So, we can say "cont" is matched to "content".

 '''RISKS'''[[BR]]

 Low
 - Since the first step of this process checks the character code and
 switches the logic only when its code is in the range of CJK, non-CJK
 should not be affected. This approach also allows indexing methods for
 non-white-spaced languages other than CJK languages to be added in future.

 - Code range checking will be a small, but we believe acceptable, overhead
 per character for all users regardless of languages.

 '''PARTICIPANTS'''[[BR]]

 Manabu Terada - Leader[[BR]]

 Mikio Hokari - Programmer[[BR]]

 Takeshi Yamamoto - Documentation and Testing[[BR]]

 Naotaka Hotta - Support and Documentation[[BR]]

 Jonathan Lewis - Moral support and documentation[[BR]]


 '''PROGRESS'''[[BR]]

 A start on CJK indexing was made as a GSoC project in 2008. That project
 didn't deliver as much as hoped, but the mentor group has continued
 development and the East Asian Plone community has helped to test the
 code. The major remaining tasks are refining the code and repackaging it
 as a  built-in component.

-- 
Ticket URL: <https://dev.plone.org/plone/ticket/9309#comment:5>
Plone <http://plone.org>
Plone Content Management System


More information about the PLIP-Advisories mailing list