We proposed two approaches to improve Chinese word segmentation: a subword-based tagging and a confidence measure approach. We found the former achieved better performance than the existing character-based tagging, and the latter improved segmentation further by combining the former with a dictionary-based segmentation. In addition, the latter can be used to balance out-of-vocabulary rates and in-vocabulary rates. By these techniques we achieved higher F-scores in CITYU, PKU and MSR corpora than the best results from Sighan Bakeoff 2005.
Ruiqiang ZhangGenichiro KikuiEiichiro Sumita
Mai Fan-jinShitong WuTaoshi Cui
Liping DuXiaoge LiChunli LiuRui LiuXian FanJianing YangDayi LinMian Wei
Fuchun PengFangfang FengAndrew McCallum