Wednesday, October 21, 2009

Thuderbird gets full-text indexing for CJK strings

Today Andrew has landed CJK indexing code for full-text search! So I think next nightly release (21 Oct?) of Thunderbird 3 get full-text search even if CJK string.

Original code of Thunderbird 3 is, SQLite3 used "porter" tokenizer into SQLite3. But this tokenizer doesn't support CJK string since word break rule is different of Europe languages. New tokenizer "mozporter" is hybrid tokenizer of original porter and bi-gram. If text is CJK, it uses bi-gram. If not, it uses porter.

If you found a problem or bug for global indexing search with CJK string, please file a bug to bugzilla with adding me to CC (m_kato at ga2.so-net.ne.jp).

No comments: