Automated coherence detection with term-distance path extraction of the co-occurrence matrix of a document


Thesis Type: Postgraduate

Institution Of The Thesis: Middle East Technical University, Turkey

Approval Date: 2015

Thesis Language: English

Student: Halil Ağın

Supervisor: CENGİZ ACARTÜRK

Open Archive Collection: AVESIS Open Access Collection

Abstract:

This thesis takes the distributional semantics (frequency-based semantics) approach as the theoretical framework to quantify textual coherence. Distributional semantics describes discourse sections as vectors, having dimensions are the frequency count of co-occurring words in the text within its semantic space. It quantifies the textual coherence by measuring the cosine values of vectors of successive sentences (cf. Latent Semantic Analysis, LSA). The common assumption underlying LSA based studies is that the frequency of word co-occurrence can be used as a cohesive cue to quantify textual coherence, thus leading to analyses based on a term-document matrix. In this thesis, the spatial distance of co-occurring words is considered as a new frequency event of cohesive cues and introduces a document-distance matrix, which is derived from the term-document matrix. This thesis proposes that the matrix representation of document-distance (a derivation of term-document matrix) of co-occurring words in adjacent sentences in a text can be used to quantify textual coherence. Two mathematical functions are suggested for deriving the document-distance matrix and two algorithms for the operations. The mathematical functions operate on the documentdocument matrix (a derivation of term-document matrix) to derive the documentdistance matrix. The algorithms measure the coherence of text by operating on the newly introduced document-distance matrices.