Repository logo
 

Some topics on model-based clustering

Date

2016

Authors

Wang, Lulu, author
Hoeting, Jennifer, advisor
Zhou, Wen, advisor
Wang, Haonan, committee member
Laituri, Melinda, committee member

Journal Title

Journal ISSN

Volume Title

Abstract

Cluster analysis is widely applied in various areas. Model-based clustering, which assumes a mixture model, is one of the most useful approaches in clustering. Using model-based clustering, we can make statistical inferences and obtain uncertainty estimates for parameters or clustering assignments. Traditional model-based clustering methods often assume a Gaussian mixture model which may not perform well in real applications such as data with heavy tails. Several non- or semi-parametric mixture models, which assume that the variables are independent to ensure parameter identifiability, have been studied in past years. In this dissertation, we propose two new methods for model-based clustering. The first method, semiparametric model-based clustering (SPM-clust), is based on a nonparanormal distribution for each cluster. The method accounts for correlations between variables while maintaining parameter identifiability under mild assumptions. By modeling the dependence between variables and relaxing the normality assumption, the proposed method is shown via simulations to have better performance than commonly used methods in clustering, especially for clustering non-Gaussian data. The second method is particularly useful for clustering high-dimensional data. The classical mixture model approach cannot cluster high-dimensional data due to the curse of dimensionality. Moreover, identifying important variables for separating unlabeled observations into homogeneous groups plays a critical role in dimension reduction and modeling data with complex structures. This problem is directly related to selecting informative variables in cluster analysis, where a small fraction of variables is identified for separating observed variable vectors Xi ∈ Rp, i = 1, . . . , n, into K possible classes. Utilizing the framework of model-based clustering, we introduce the PAirwise Reciprocal fuSE (PARSE) procedure based on a new class of penalization functions that imposes infinite penalties on variables with small differences across clusters. PARSE effectively avoids selecting an overly dense set of variables for separating observations into clusters. We establish the consistency of the proposed procedure for identifying informative variables for cluster analysis. The PARSE procedure is shown to enjoy certain optimality properties as well. We develop a backward selection algorithm, in conjunction with the EM algorithm, to implement PARSE. Simulation studies show that PARSE has competitive performance compared to other popular model-based clustering methods. PARSE is shown to select a sparse set of variables and produce accurate clustering results. We apply PARSE to microarray data on human asthma disease and discuss the biological implications of the results. We develop an R package PARSE which is available in CRAN for implementing regularization methods in model-based clustering including PARSE.

Description

Rights Access

Subject

Citation

Associated Publications