Text comparison is an interesting though hard task, with many applications in Natural Language Processing.This work introduces a new text-similarity measure, which employs named-entities' information extracted from the texts and the ngram graphs' model for representing documents.Using OpenCalais as a namedentity recognition service and the JIN-SECT toolkit for constructing and managing n-gram graphs, the text similarity measure is embedded in a text clustering algorithm (k-Means).The evaluation of the produced clusters with various clustering validity metrics shows that the extraction of named entities at a first step can be profitable for the time-performance of similarity measures that are based on the n-gram graph representation without affecting the overall performance of the NLP task.
Tom De NiesChristian BeecksWesley De NeveThomas SeidlErik MannensRik Van de Walle
Maxime DeforcheIlse De VosAntoon BronselaerGuy De Tré
Shikha ChaudharyH. VyasN AroraSejal D’Mello
Ben HacheyWill RadfordJames Curran