Information Geometry and Document Classification
Dr Guy Lebanon
Department of Statistics and School of Electrical and Computer Engineering,
Date:
Host: R. Jin
Abstract: The task of
classifying documents according to topic is traditionally based on extracting
features, and treating the features as points in a Euclidean space, equipped
with Euclidean geometry. We argue that this may be improved upon by examining a
more appropriate geometry for text documents, and adapting classification
models to this geometry. By embedding documents in the multinomial simplex, we
identify a canonical geometry for them - the Fisher geometry on the multinomial
simplex. Adapting popular classification models such as radial basis support vector
machines and logistic regression to the Fisher geometry yields impressive
results in text classification. The application of information geometry to text
classification results in an improvement over the-state-of-the-art in this
field.
If time
remains, I will discuss an extension of Cencov's
theorem for spaces of conditional models and a novel geometric representation
for documents that moves beyond the standard bag of words assumption.
Biography: Guy Lebanon is an
Assistant Professor of Statistics and ECE at