Repository logo
 

A geometric data analysis approach to dimension reduction in machine learning and data mining in medical and biological sensing

Date

2017

Authors

Emerson, Tegan Halley, author
Kirby, Michael, advisor
Peterson, Chris, advisor
Nyborg, Jennifer, committee member
Chenney, Margaret, committee member

Journal Title

Journal ISSN

Volume Title

Abstract

Geometric data analysis seeks to uncover and leverage structure in data for tasks in machine learning when data is visualized as points in some dimensional, abstract space. This dissertation considers data which is high dimensional with respect to varied notions of dimension. Algorithms developed herein seek to reduce or estimate dimension while preserving the ability to perform a specific task in detection, identification, or classification. In some of the applications the only property considered important to be preserved under dimension reduction is the ability to perform the indicated machine learning task while in others strictly geometric relationships between data points are required to be preserved or minimized. First presented is the development of a numerical representation of images of rare circulating cells in immunofluorescent images. This representation is paired with a support vector machine and is able to identify differentiating cell structure between cell populations under consideration. Moreover, this differentiating information can be visualized through inversion of the representation and was found to be consistent with classification criterion used by clinically trained pathologists. Considered second is the task of identification and tracking of aerosolized bioagents via a multispectral lidar system. A nonnegative matrix factorization problem arised out of this data mining task which can be solved in several ways including a ℓ1-norm regularized, convex but nondifferentiable optimization problem. Exisiting methodologies achieve excellent results when internal matrix factor dimension is known but fail or can be computationally prohibitive when this dimension is not known. A modified optimization problem is proposed that may help reveal the appropriate internal factoring dimension based on the sparsity of averages of nonnegative values. Third, we present an algorithmic framework for reducing dimension in the linear mixing model. The mean-squared error of a statistical estimator of a component of the linear mixing model can be considered as a function of the rank of different estimating matrices. We seek to minimize mean squared error as a function of the rank of the appropriate estimating matrix and yield interesting order determination rules and improved results, relative to full rank counterparts, in applications in matched subspace detection and generalized modal analysis. Finally, the culminating work of this dissertation explores the existence of nearly isometric, dimension reducing mappings between special manifolds characterized by different dimensions. Understanding the analogous problem between Euclidean spaces provides insights into potential challenges and pitfalls one could encounter in proving the existence of such mappings. Most significant of the contributions is the statement and proof of a theorem establishing a connection between packing problems on Grassmannian manifolds and nearly isometric mappings between Grassmannians. The frameworks and algorithms constructed and developed in this doctoral research consider multiple manifestations of the notion of dimension. Across applications arising from varied areas of medical and biological sensing we have shown there to be great benefits to taking a geometric perspective on challenges in machine learning and data mining.

Description

Rights Access

Subject

dimension reduction
Grassmannian manifold
data mining
machine learning
geometric data analysis

Citation

Associated Publications