[Plone-AsiaPacific] Status Update Text Search for East Asian Languages
tyam at mac.com
Mon Dec 22 14:57:20 UTC 2008
Let me update the UnicodeSplitterPatch development status.
We hold a testing session and mini sprint session in the past few weeks
and made a significant progress. Now in the testing and tuning phase.
So, please try it and give us your feedback.
How to install
0. Install subversion client.
1. Install Plone 3.1.7 for testing.
2. Download and install UnicodeSplitterPatch from subversion.
Following commands are for standard Mac OS X installation.
svn co https://svn.plone.org/svn/collective/UnicodeSplitterPatch/trunk/UnicodeSplitterPatch
mv UnicodeSplitterPatch /Applications/Plone/zinstance/products
< For the unit testing >
3. Adjust various PATH parameters in the tests/fasttest.py file
according to your environment. You might need to change two lines or
more, typically ZOPE_HOME and INSTANCE_HOME.
4. Adjust test assertion files. The downloaded tests/langs directory
might already have several txt files as test assertion files, which
was made by other person or for other languages. You might want to
erase them from your environment since they might not of your interest
and could cause UNMATCH errors. An UNMATCH error itself is healthy to
have, since that error is the one programmer want to know as your
5. Go to tests directory and invoke unit test through zopepy.
6. You can change the debug mode by changing a switch on the first
line of UnicodeSplitterPatch/__init__.py. DEBUG=0 means debug mode
OFF, while DEBUG=1 means debug mode ON.
7. Please report us your errors.
< For the integration testing >
3. Go to the instance folder.
4. Adjust teset assertion files as same as unit testing.
5. Invoke testing through plonectl.
bin/plonectl test -s Products/UnicodeSplitterPatch
7. Please report us your errors.
< For the testing through the web >
3. Start Zope/Plone if you want to test on your site.
4. Rebuild catalog index.
Open ZMI page and access to Plone/portal_catalog
Click Indexes tab and then scroll down and click Reindex button.
5. Use Plone text search box and evaluate the result.
We appreciate your feedback of both good and bad.
< How this UnicodeSplitterPatch works >
a. __init__.py will override original CMFPlone.UnicodeSplitter.py's
classes to make it to use this patch version of UnicodeSplitter.py.
b. If you look into the UnicodeSplitter.py code, you will find some
code range definition. You may find all Kanji(ideographic letters)
are assigned to the same code range regardless Chinese, Japanese,
Korean(CJK). So, all CJK letters will be handled in the same manner.
Hiragana(Japanese), Katakana(Japanese), Hangul(Korean) letters are
located in separate ranges. Thai language is also located in
different range. Other languages will not be affected by this
splitter. As of today, this splitter works only for CJK and Thai
< Normalization >
Normalizing function is to make Hiragana and Katakana identical for
Japanese. These two character sets are phonetic and not ideographic.
Both sets are expressing same sounds, but used for different purposes.
Hiragana is for a part of ordinary Japanese expression while Katakana
is for expressing imported words from foreign countries or some old
Japanese incidents. I think some of non-English languages might have
such and exotic rules, which might be difficult for non-speaker to
understand. The purpose I introduced above exotic usage of Japanese
is to encourage other people to implement their own exotic rules in
this splitter. Your language could have some new operations for your
own language other than normalization.
Thanks for your cooperation.
More information about the Plone-AsiaPacific