Title: Pattern Extraction and Clustering for High-Dimensional Discrete Data
Speaker:
Alice Peng Jiang
University of Illinois
Time&Location:
Wednesday, March 12, 10am in EB3105
Abstract:
With the advent of big data, data analysis has replaced data acquisition as the new bottleneck.
Pattern extraction and clustering are two essential methods for data analysis. In many
application areas, big data has high dimensionality and discrete features, where data values
are restricted, for example, to be binary (i.e., only values 0 and 1 are permitted). Such
problems pose significant challenges in data analysis. Methods for analyzing continuous data
are often ineffective for discrete datasets, as such methods usually produce non-discrete
values, which may be difficult to interpret and may require excessive memory to store. Thus we
need algorithms designed especially for discrete data.
Abstract:
This talk focuses on compressing data with minimum loss of accuracy via low-rank matrix
factorizations, which greatly reduces the time for solving problems in data mining and machine
learning. Specifically, matrix factorization produces two matrix factors, one finding the
dominant patterns and the other indicating the cluster assignment of the original patterns. We
have developed a framework in which matrix factorization problems are reformulated as variants
of clustering problems, for which we proposed novel algorithms and proved their constant
approximation bounds. These are very nice properties, as the original discrete problems are
NP-hard and approximation algorithms that provide close bounds to exact solutions are highly
desired. In addition to theoretical verification, we also perform experimental evaluation on
various applications in pattern extraction, transaction data mining, recommender systems,
bi-cluster discovery in gene expression data, document clustering, and social network mining.