Using Roget’s Thesaurus to Determine the Similarity of Texts

Using Roget’s Thesaurus to Determine the Similarity of Texts

A Thesis in Computational Linguistics

LAP Lambert Academic Publishing ( 2010-01-14 )

€ 79,00

Buy at the MoreBooks! Shop

This thesis addresses the problem of extracting a representation of text''s meaning from its content. The solution investigated is based on the use of Roget''s thesaurus as an external knowledge source and can be used to analyse texts of any length or complexity. The resulting document representation can then be compared to others, producing a new method for text similarity assessment. All coherent texts contain embedded sequences of words that are related in meaning. These sequences can be detected by identifying simple relationships between the relevant thesaural entries in which the words are found. The identification of initial sequences drives the addition of further related words into conceptually related “lexical chains”. Every coherent text contains many lexical chains of different lengths and strengths. These may be used to represent the broad subject matter of a text. By identifying the key concept of each chain, and relating this to its presence we may produce an attribute value vector of concepts and their strengths. This may then be used to identify other texts as closer or further away in meaning.

Book Details:

ISBN-13:

978-3-8383-3840-8

ISBN-10:

3838338405

EAN:

9783838338408

Book language:

English

By (author) :

Jeremy Ellman

Number of pages:

228

Published on:

2010-01-14

Category:

Informatics, IT