[Plone-AsiaPacific] Status Update Text Search for East Asian Languages

Takeshi Yamamoto tyam at mac.com
Mon Dec 22 14:57:20 UTC 2008


Let me update the UnicodeSplitterPatch development status.
We hold a testing session and mini sprint session in the past few weeks
and made a significant progress.  Now in the testing and tuning phase.

So, please try it and give us your feedback.

How to install
0. Install subversion client.
1. Install Plone 3.1.7 for testing.
2. Download and install UnicodeSplitterPatch from subversion.
     Following commands are for standard Mac OS X installation.
     mkdir mytemp
     cd mytemp
     svn co https://svn.plone.org/svn/collective/UnicodeSplitterPatch/trunk/UnicodeSplitterPatch
     mv UnicodeSplitterPatch /Applications/Plone/zinstance/products

< For the unit testing >
3. Adjust various PATH parameters in the tests/fasttest.py file  
according to your environment.  You might need to change two lines or  
more, typically ZOPE_HOME and INSTANCE_HOME.
     cd /Applications/Plone/zinstance/products/UnicodeSplitterPatch/ 
tests
     vim fasttest.py
4. Adjust test assertion files.  The downloaded tests/langs directory  
might already have several txt files as test assertion files, which  
was made by other person or for other languages.  You might want to  
erase them from your environment since they might not of your interest  
and could cause UNMATCH errors.  An UNMATCH error itself is healthy to  
have, since that error is the one programmer want to know as your  
feedback.
5. Go to tests directory and invoke unit test through zopepy.
     cd /Applications/Plone/zinstance/products/UnicodeSplitterPatch/ 
tests
     /Applications/Plone/zinstance/bin/zopepy unittest_langs.py
6. You can change the debug mode by changing a switch on the first  
line of UnicodeSplitterPatch/__init__.py.  DEBUG=0 means debug mode  
OFF, while DEBUG=1 means debug mode ON.
7. Please report us your errors.

< For the integration testing >
3. Go to the instance folder.
     cd /Applications/Plone/zinstance
4. Adjust teset assertion files as same as unit testing.
5. Invoke testing through plonectl.
     bin/plonectl test -s Products/UnicodeSplitterPatch
7. Please report us your errors.

< For the testing through the web >
3. Start Zope/Plone if you want to test on your site.
4. Rebuild catalog index.
     Open ZMI page and access to Plone/portal_catalog
     Click Indexes tab and then scroll down and click Reindex button.
5. Use Plone text search box and evaluate the result.
     We appreciate your feedback of both good and bad.

< How this UnicodeSplitterPatch works >
a. __init__.py will override original CMFPlone.UnicodeSplitter.py's  
classes to make it to use this patch version of UnicodeSplitter.py.
b. If you look into the UnicodeSplitter.py code, you will find some  
code range definition.  You may find all Kanji(ideographic letters)  
are assigned to  the same code range regardless Chinese, Japanese,  
Korean(CJK).  So, all CJK letters will be handled in the same manner.   
Hiragana(Japanese), Katakana(Japanese), Hangul(Korean) letters are  
located in separate ranges.  Thai language is also located in  
different range.  Other languages will not be affected by this  
splitter.  As of today, this splitter works only for CJK and Thai  
languages.

< Normalization >
Normalizing function is to make Hiragana and Katakana identical for  
Japanese.  These two character sets are phonetic and not ideographic.   
Both sets are expressing same sounds, but used for different purposes.
Hiragana is for a part of ordinary Japanese expression while Katakana  
is for expressing imported words from foreign countries or some old  
Japanese incidents.  I think some of non-English languages might have  
such and exotic rules, which might be difficult for non-speaker to  
understand.  The purpose I introduced above exotic usage of Japanese  
is to encourage other people to implement their own exotic rules in  
this splitter.  Your language could have some new operations for your  
own language other than normalization.

Thanks for your cooperation.
Happy holidays!
retsu




More information about the Plone-AsiaPacific mailing list