Department of Statistics

Permanent URI for this community

https://hdl.handle.net/10217/100518

These digital collections include theses, dissertations, and datasets from the Department of Statistics.

Browse

Now showing 1 - 20 of 87

Open Access
A fiducial approach to extremes and multiple comparisons
(Colorado State University. Libraries, 2010) Wandler, Damian V., author; Hannig, Jan, advisor; Iyer, Hariharan K., advisor; Chong, Edwin Kah Pin, committee member; Wang, Haonan, committee member
Generalized fiducial inference is a powerful tool for many difficult problems. Based on an extension of R. A. Fisher's work, we used generalized fiducial inference for two extreme value problems and a multiple comparison procedure. The first extreme value problem is dealing with the generalized Pareto distribution. The generalized Pareto distribution is relevant to many situations when modeling extremes of random variables. We use a fiducial framework to perform inference on the parameters and the extreme quantiles of the generalized Pareto. This inference technique is demonstrated in both cases when the threshold is a known and unknown parameter. Simulation results suggest good empirical properties and compared favorably to similar Bayesian and frequentist methods. The second extreme value problem pertains to the largest mean of a multivariate normal distribution. Difficulties arise when two or more of the means are simultaneously the largest mean. Our solution uses a generalized fiducial distribution and allows for equal largest means to alleviate the overestimation that commonly occurs. Theoretical calculations, simulation results, and application suggest our solution possesses promising asymptotic and empirical properties. Our solution to the largest mean problem arose from our ability to identify the correct largest mean(s). This essentially became a model selection problem. As a result, we applied a similar model selection approach to the multiple comparison problem. We allowed for all possible groupings (of equality) of the means of k independent normal distributions. Our resulting fiducial probability for the groupings of the means demonstrates the effectiveness of our method by selecting the correct grouping at a high rate.
Open Access
A penalized estimation procedure for varying coefficient models
(Colorado State University. Libraries, 2015) Tu, Yan, author; Wang, Haonan, advisor; Breidt, F. Jay, committee member; Chapman, Phillip, committee member; Luo, J. Rockey, committee member
Varying coefficient models are widely used for analyzing longitudinal data. Various methods for estimating coefficient functions have been developed over the years. We revisit the problem under the theme of functional sparsity. The problem of sparsity, including global sparsity and local sparsity, is a recurrent topic in nonparametric function estimation. A function has global sparsity if it is zero over the entire domain, and it indicates that the corresponding covariate is irrelevant to the response variable. A function has local sparsity if it is nonzero but remains zero for a set of intervals, and it identifies an inactive period of the corresponding covariate. Each type of sparsity has been addressed in the literature using the idea of regularization to improve estimation as well as interpretability. In this dissertation, a penalized estimation procedure has been developed to achieve functional sparsity, that is, simultaneously addressing both types of sparsity in a unified framework. We exploit the property of B-spline approximation and group bridge penalization. Our method is illustrated in simulation study and real data analysis, and outperforms the existing methods in identifying both local sparsity and global sparsity. Asymptotic properties of estimation consistency and sparsistency of the proposed method are established. The term of sparsistency refers to the property that the functional sparsity can be consistently detected.
Open Access
Adjusting for capture, recapture, and identity uncertainty when estimating detection probability from capture-recapture surveys
(Colorado State University. Libraries, 2015) Edmondson, Stacy L., author; Givens, Geof, advisor; Opsomer, Jean, committee member; Kokoszka, Piotr, committee member; Noon, Barry, committee member
When applying capture-recapture analysis methods, estimates of detection probability, and hence abundance estimates, can be biased if individuals of a population are not correctly identified (Creel et. al., 2003). My research, motivated by the 2010 and 2011 surveys of Western Arctic bowhead whales conducted off the shores of Barrow, Alaska, offers two methods for addressing the complex scenario where an individual may be mistaken as another individual from that population, thus creating erroneous recaptures. The first method uses a likelihood weighted capture recapture method to account for three sources of uncertainty in the matching process. I illustrate this approach with a detailed application to the whale data. The second method develops an explicit model for match errors and uses MCMC methods to estimate model parameters. Implementation of this approach must overcome significant hurdles dealing with the enormous number and complexity of potential catch history configurations when matches are uncertain. The performance of this approach is evaluated using a large set of Monte Carlo simulation tests. Results of these test vary from good performance to weak performance, depending on factors including detection probability, number of sightings, and error rates. Finally, this model is applied to a portion of the bowhead survey data and found to produce plausible and scientifically informative results as long as the MCMC algorithm is started at a reasonable point in the space of possible catch history configurations.
Open Access
Advances in statistical analysis and modeling of extreme values motivated by atmospheric models and data products
(Colorado State University. Libraries, 2018) Fix, Miranda J., author; Cooley, Daniel, advisor; Hoeting, Jennifer, committee member; Wilson, Ander, committee member; Barnes, Elizabeth, committee member
This dissertation presents applied and methodological advances in the statistical analysis and modeling of extreme values. We detail three studies motivated by the types of data found in the atmospheric sciences, such as deterministic model output and observational products. The first two investigations represent novel applications and extensions of extremes methodology to climate and atmospheric studies. The third investigation proposes a new model for areal extremes and develops methods for estimation and inference from the proposed model. We first detail a study which leverages two initial condition ensembles of a global climate model to compare future precipitation extremes under two climate change scenarios. We fit non-stationary generalized extreme value (GEV) models to annual maximum daily precipitation output and compare impacts under the RCP8.5 and RCP4.5 scenarios. A methodological contribution of this work is to demonstrate the potential of a "pattern scaling" approach for extremes, in which we produce predictive GEV distributions of annual precipitation maxima under RCP4.5 given only global mean temperatures for this scenario. We compare results from this less computationally intensive method to those obtained from our GEV model fitted directly to the RCP4.5 output and find that pattern scaling produces reasonable projections. The second study examines, for the first time, the capability of an atmospheric chemistry model to reproduce observed meteorological sensitivities of high and extreme surface ozone (O3). This work develops a novel framework in which we make three types of comparisons between simulated and observational data, comparing (1) tails of the O3 response variable, (2) distributions of meteorological predictor variables, and (3) sensitivities of high and extreme O3 to meteorological predictors. This last comparison is made using quantile regression and a recent tail dependence optimization approach. Across all three study locations, we find substantial differences between simulations and observational data in both meteorology and meteorological sensitivities of high and extreme O3. The final study is motivated by the prevalence of large gridded data products in the atmospheric sciences, and presents methodological advances in the (finite-dimensional) spatial setting. Existing models for spatial extremes, such as max-stable process models, tend to be geostatistical in nature as well as very computationally intensive. Instead, we propose a new model for extremes of areal data, with a common-scale extension, that is inspired by the simultaneous autoregressive (SAR) model in classical spatial statistics. The proposed model extends recent work on transformed-linear operations applied to regularly varying random vectors, and is unique among extremes models in being directly analogous to a classical linear model. We specify a sufficient condition on the spatial dependence parameter such that our extreme SAR model has desirable properties. We also describe the limiting angular measure, which is discrete, and corresponding tail pairwise dependence matrix (TPDM) for the model. After examining model properties, we then investigate two approaches to estimation and inference for the common-scale extreme SAR model. First, we consider a censored likelihood approach, implemented using Bayesian MCMC with a data augmentation step, but find that this approach is not robust to model misspecification. As an alternative, we develop a novel estimation method that minimizes the discrepancy between the TPDM for the fitted model and the estimated TPDM, and find that it is able to produce reasonable estimates of extremal dependence even in the case of model misspecification.
Open Access
Analysis of structured data and big data with application to neuroscience
(Colorado State University. Libraries, 2015) Sienkiewicz, Ela, author; Wang, Haonan, advisor; Meyer, Mary, committee member; Breidt, F. Jay, committee member; Hayne, Stephen, committee member
Neuroscience research leads to a remarkable set of statistical challenges, many of them due to the complexity of the brain, its intricate structure and dynamical, non-linear, often non-stationary behavior. The challenge of modeling brain functions is magnified by the quantity and inhomogeneity of data produced by scientific studies. Here we show how to take advantage of advances in distributed and parallel computing to mitigate memory and processor constraints and develop models of neural components and neural dynamics. First we consider the problem of function estimation and selection in time-series functional dynamical models. Our motivating application is on the point-process spiking activities recorded from the brain, which poses major computational challenges for modeling even moderately complex brain functionality. We present a big data approach to the identification of sparse nonlinear dynamical systems using generalized Volterra kernels and their approximation using B-spline basis functions. The performance of the proposed method is demonstrated in experimental studies. We also consider a set of unlabeled tree objects with topological and geometric properties. For each data object, two curve representations are developed to characterize its topological and geometric aspects. We further define the notions of topological and geometric medians as well as quantiles based on both representations. In addition, we take a novel approach to define the Pareto medians and quantiles through a multi-objective optimization problem. In particular, we study two different objective functions which measure the topological variation and geometric variation respectively. Analytical solutions are provided for topological and geometric medians and quantiles, and in general, for Pareto medians and quantiles the genetic algorithm is implemented. The proposed methods are applied to analyze a data set of pyramidal neurons.
Open Access
Application of statistical and deep learning methods to power grids
(Colorado State University. Libraries, 2023) Rimkus, Mantautas, author; Kokoszka, Piotr, advisor; Wang, Haonan, advisor; Nielsen, Aaron, committee member; Cooley, Dan, committee member; Chen, Haonan, committee member
The structure of power flows in transmission grids is evolving and is likely to change significantly in the coming years due to the rapid growth of renewable energy generation that introduces randomness and bidirectional power flows. Another transformative aspect is the increasing penetration of various smart-meter technologies. Inexpensive measurement devices can be placed at practically any component of the grid. As a result, traditional fault detection methods may no longer be sufficient. Consequently, there is a growing interest in developing new methods to detect power grid faults. Using model data, we first propose a two-stage procedure for detecting a fault in a regional power grid. In the first stage, a fault is detected in real time. In the second stage, the faulted line is identified with a negligible delay. The approach uses only the voltage modulus measured at buses (nodes of the grid) as the input. Our method does not require prior knowledge of the fault type. We further explore fault detection based on high-frequency data streams that are becoming available in modern power grids. Our approach can be treated as an online (sequential) change point monitoring methodology. However, due to the mostly unexplored and very nonstandard structure of high-frequency power grid streaming data, substantial new statistical development is required to make this methodology practically applicable. The work includes development of scalar detectors based on multichannel data streams, determination of data-driven alarm thresholds and investigation of the performance and robustness of the new tools. Due to a reasonably large database of faults, we can calculate frequencies of false and correct fault signals, and recommend implementations that optimize these empirical success rates. Next, we extend our proposed method for fault localization in a regional grid for scenarios where partial observability limits the available data. While classification methods have been proposed for fault localization, their effectiveness depends on the availability of labeled data, which is often impractical in real-life situations. Our approach bridges the gap between partial and full observability of the power grid. We develop efficient fault localization methods that can operate effectively even when only a subset of power grid bus data is available. This work contributes to the research area of fault diagnosis in scenarios where the number of available phasor measurement unit devices is smaller than the number of buses in the grid. We propose using Graph Neural Networks in combination with statistical fault localization methods to localize faults in a regional power grid with minimal available data. Our contribution to the field of fault localization aims to enable the adoption of effective fault localization methods for future power grids.
Open Access
Applications of generalized fiducial inference
(Colorado State University. Libraries, 2009) E, Lidong, author; Iyer, Hariharan K., advisor
Hannig (2008) generalized Fisher's fiducial argument and obtained a fiducial recipe for interval estimation that is applicable in virtually any situation. In this dissertation research, we apply this fiducial recipe and fiducial generalized pivotal quantity to make inference in four practical problems. The list of problems we consider is (a) confidence intervals for variance components in an unbalanced two-component normal mixed linear model; (b) confidence intervals for median lethal dose (LD50) in bioassay experiments; (c) confidence intervals for the concordance correlation coefficient (CCC) in method comparison; (d) simultaneous confidence intervals for ratios of means of Lognormal distributions. For all the fiducial generalized confidence intervals (a)-(d), we conducted a simulation study to evaluate their performance and compare them with other competing confidence interval procedures from the literature. We also proved that the intervals (a) and (d) have asymptotically exact frequentist coverage.
Open Access
Bayesian methods for environmental exposures: mixtures and missing data
(Colorado State University. Libraries, 2022) Hoskovec, Lauren, author; Wilson, Ander, advisor; Magzamen, Sheryl, committee member; Hoeting, Jennifer, committee member; Cooley, Dan, committee member
Air pollution exposure has been linked to increased morbidity and mortality. Estimating the association between air pollution exposure and health outcomes is complicated by simultaneous exposure to multiple pollutants, referred to as a multipollutant mixture. In a multipollutant mixture, exposures may have both independent and interactive effects on health. In addition, observational studies of air pollution exposure often involve missing data. In this dissertation, we address challenges related to model choice and missing data when studying exposure to a mixture of environmental pollutants. First, we conduct a formal simulation study of recently developed methods for estimating the association between a health outcome and exposure to a multipollutant mixture. We evaluate methods on their performance in estimating the exposure-response function, identifying mixture components associated with the outcome, and identifying interaction effects. Other studies have reviewed the literature or compared performance on a single data set; however, none have formally compared such a broad range of new methods in a simulation study. Second, we propose a statistical method to analyze multiple asynchronous multivariate time series with missing data for use in personal exposure assessments. We develop an infinite hidden Markov model for multiple time series to impute missing data and identify shared time-activity patterns in exposures. We estimate hidden states that represent latent environments presenting a unique distribution of a mixture of environmental exposures. Through our multiple imputation algorithm, we impute missing exposure data conditional on the hidden states. Finally, we conduct an individual-level study of the association between long-term exposure to air pollution and COVID-19 severity in a Denver, Colorado, USA cohort. We develop a Bayesian multinomial logistic regression model for data with partially missing categorical outcomes. Our model uses Polya-gamma data augmentation, and we propose a visualization approach for inference on the odds ratio. We conduct one of the first individual-level studies of air pollution exposure and COVID-19 health outcomes using detailed clinical data and individual-level air pollution exposure data.
Open Access
Bayesian methods for spatio-temporal ecological processes using imagery data
(Colorado State University. Libraries, 2021) Lu, Xinyi, author; Hooten, Mevin, advisor; Kaplan, Andee, committee member; Fosdick, Bailey, committee member; Koons, David, committee member
In this dissertation, I present novel Bayesian hierarchical models to statistically characterize spatio-temporal ecological processes. I am motivated by the volatility of Alaskan ecosystems in the face of global climate change and I demonstrate methods for emerging imagery data as survey technologies advance. For the nearshore marine ecosystem, I developed a model that combines ecological diffusion and logistic growth to quantify colonization dynamics of a population that establishes long-term equilibrium over a heterogeneous environment. I also unified modeling concepts from entity resolution and capture-recapture to identify unique individuals of the population from overlapping images and infer total abundance. For the terrestrial ecosystem, I developed a stochastic state-space model to quantify the impact of climate change on the structural transformation of land cover types. The methods presented in this dissertation provide interpretable inference and employ statistical computing strategies to achieve scalability.
Open Access
Bayesian models and streaming samplers for complex data with application to network regression and record linkage
(Colorado State University. Libraries, 2023) Taylor, Ian M., author; Kaplan, Andee, advisor; Fosdick, Bailey K., advisor; Keller, Kayleigh P., committee member; Koslovsky, Matthew D., committee member; van Leeuwen, Peter Jan, committee member
Real-world statistical problems often feature complex data due to either the structure of the data itself or the methods used to collect the data. In this dissertation, we present three methods for the analysis of specific complex data: Restricted Network Regression, Streaming Record Linkage, and Generative Filtering. Network data contain observations about the relationships between entities. Applying mixed models to network data can be problematic when the primary interest is estimating unconditional regression coefficients and some covariates are exactly or nearly in the vector space of node-level effects. We introduce the Restricted Network Regression model that removes the collinearity between fixed and random effects in network regression by orthogonalizing the random effects against the covariates. We discuss the change in the interpretation of the regression coefficients in Restricted Network Regression and analytically characterize the effect of Restricted Network Regression on the regression coefficients for continuous response data. We show through simulation on continuous and binary data that Restricted Network Regression mitigates, but does not alleviate, network confounding. We apply the Restricted Network Regression model in an analysis of 2015 Eurovision Song Contest voting data and show how the choice of regression model affects inference. Data that are collected from multiple noisy sources pose challenges to analysis due to potential errors and duplicates. Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. We approach streaming record linkage from a Bayesian perspective with estimates calculated from posterior samples of parameters, and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. We generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational trade-offs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Motivated by the streaming data setting and streaming record linkage, we propose a more general sampling method for Bayesian models for streaming data. In the streaming data setting, Bayesian models can employ recursive updates, incorporating each new batch of data into the model parameters' posterior distribution. Filtering methods are currently used to perform these updates efficiently, however, they suffer from eventual degradation as the number of unique values within the filtered samples decreases. We propose Generative Filtering, a method for efficiently performing recursive Bayesian updates in the streaming setting. Generative Filtering retains the speed of a filtering method while using parallel updates to avoid degenerate distributions after repeated applications. We derive rates of convergence for Generative Filtering and conditions for the use of sufficient statistics instead of storing all past data. We investigate properties of Generative Filtering through simulation and ecological species count data.
Open Access
Bayesian shape-restricted regression splines
(Colorado State University. Libraries, 2011) Hackstadt, Amber J., author; Hoeting, Jennifer, advisor; Meyer, Mary, advisor; Opsomer, Jean, committee member; Huyvaert, Kate, committee member
Semi-parametric and non-parametric function estimation are useful tools to model the relationship between design variables and response variables as well as to make predictions without requiring the assumption of a parametric form for the regression function. Additionally, Bayesian methods have become increasingly popular in statistical analysis since they provide a flexible framework for the construction of complex models and produce a joint posterior distribution for the coefficients that allows for inference through various sampling methods. We use non-parametric function estimation and a Bayesian framework to estimate regression functions with shape restrictions. Shape-restricted functions include functions that are monotonically increasing, monotonically decreasing, convex, concave, and combinations of these restrictions such as increasing and convex. Shape restrictions allow researchers to incorporate knowledge about the relationship between variables into the estimation process. We propose Bayesian semi-parametric models for regression analysis under shape restrictions that use a linear combination of shape-restricted regression splines such as I-splines or C-splines. We find function estimates using Markov chain Monte Carlo (MCMC) algorithms. The Bayesian framework along with MCMC allows us to perform model selection and produce uncertainty estimates much more easily than in the frequentist paradigm. Indeed, some of the work proposed in this dissertation has not been developed in parallel in the frequentist paradigm. We begin by proposing a semi-parametric generalized linear model for regression analysis under shape-restrictions. We provide Bayesian shape-restricted regression spline (Bayes SRRS) models and MCMC estimation algorithms for the normal errors, Bernoulli, and Poisson models. We propose several types of inference that can be performed for the normal errors model as well as examine the asymptotic behavior of the estimates for the normal errors model under the monotone shape-restriction. We also examine the small sample behavior of the proposed Bayes SRRS model estimates via simulation studies. We then extend the semi-parametric Bayesian shape-restricted regression splines to generalized linear mixed models. We provide a MCMC algorithm to estimate functions for the random intercept model with normal errors under the monotone shape restriction. We then further extend the semi-parametric Bayesian shape-restricted regression splines to allow the number and location of the knot points for the regression splines to be random and propose a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm for regression function estimation under the monotone shape restriction. Lastly, we propose a Bayesian shape-restricted regression spline change-point model where the regression function is shape-restricted except at the change-points. We provide RJMCMC algorithms to estimate functions with change-points where the number and location of interior knot points for the regression splines are random. We provide a RJMCMC algorithm to estimate the location of an unknown change-point as well as a RJMCMC algorithm to decide between a model with no change-points and model with a change-point.
Open Access
Bayesian treed distributed lag models
(Colorado State University. Libraries, 2021) Mork, Daniel S., author; Wilson, Ander, advisor; Sharp, Julia, committee member; Keller, Josh, committee member; Neophytou, Andreas, committee member
In many applications there is interest in regressing an outcome on exposures observed over a previous time window. This frequently arises in environmental epidemiology where either a health outcome on one day is regressed on environmental exposures (e.g. temperature or air pollution) observed on that day and several proceeding days or when a birth or children's health outcome is regressed on exposures observed daily or weekly throughout pregnancy. The distributed lag model (DLM) is a statistical method commonly implemented to estimate an exposure-time-response function by regressing the outcome on repeated measures of a single exposure over a preceding time period, for example, mean exposure during each week of pregnancy. Inferential goals include estimating the exposure-time-response function and identifying critical windows during which exposures can alter a health endpoint. In this dissertation, we develop novel formulations of Bayesian additive regression trees that allow for estimating a DLM. First, we propose treed distributed lag nonlinear models to estimate the association between weekly maternal exposure to air pollution and a birth outcome when the exposure-response relation is nonlinear. We introduce a regression tree-based model that accommodates a multivariate predictor along with parametric control for fixed effects. Second, we propose a tree-based method for estimating the association between repeated measures of a mixture of multiple pollutants and a health outcome. The proposed approach introduces regression tree pairs, which allow for estimation of marginal effects of exposures along with structured interactions that account for the temporal ordering of the exposure data. Finally, we present a framework to estimate a heterogeneous DLM in the presence of a potentially high dimensional set of modifying variables. We present simulation studies to validate the models. We apply these methods to estimate the association between ambient pollution exposures and birth weight for a Colorado, USA birth cohort.
Open Access
Causality and clustering in complex settings
(Colorado State University. Libraries, 2023) Gibbs, Connor P., author; Keller, Kayleigh, advisor; Fosdick, Bailey, advisor; Koslovsky, Matthew, committee member; Kaplan, Andee, committee member; Anderson, Brooke, committee member
Causality and clustering are at the forefront of many problems in statistics. In this dissertation, we present new methods and approaches for drawing causal inference with temporally dependent units and clustering nodes in heterogeneous networks. To begin, we investigate the causal effect of a timeout at stopping an opposing team's run in the National Basketball Association (NBA). After formalizing the notion of a run in the NBA and in light of the temporal dependence among runs, we define the units under study with careful consideration of the stable unit-treatment-value assumption pertinent to the Rubin causal model. After introducing a novel, interpretable outcome based on the score difference, we conclude that while comebacks frequently occur after a run, it is slightly disadvantageous to call a timeout during a run by the opposing team. Further, we demonstrate that the magnitude of this effect varies by franchise, lending clarity to an oft-debated topic among sports' fans. Following, we represent the known relationships among and between genetic variants and phenotypic abnormalities as a heterogeneous network and introduce a novel analytic pipeline to identify clusters containing undiscovered gene to phenotype relations (ICCUR) from the network. ICCUR identifies, scores, and ranks small heterogeneous clusters according to their potential for future discovery in a large temporal biological network. We train an ensemble model of boosted regression trees to predict clusters' potential for future discovery using observable cluster features, and show the resulting clusters contain significantly more undiscovered gene to phenotype relations than expected by chance. To demonstrate its use as a diagnostic aid, we apply the results of the ICCUR pipeline to real, undiagnosed patients with rare diseases, identifying clusters containing patients' co-occurring yet otherwise unconnected genotypic and phenotypic information, some connections which have since been validated by human curation. Motivated by ICCUR and its application, we introduce a novel method called ECoHeN (pronounced "eco-hen") to extract communities from heterogeneous networks in a statistically meaningful way. Using a heterogeneous configuration model as a reference distribution, ECoHeN identifies communities that are significantly more densely connected than expected given the node types and connectivity of its membership without imposing constraints on the type composition of the extracted communities. The ECoHeN algorithm identifies communities one at a time through a dynamic set of iterative updating rules and is guaranteed to converge. To our knowledge this is the first discovery method that distinguishes and identifies both homogeneous and heterogeneous, possibly overlapping, community structure in a network. We demonstrate the performance of ECoHeN through simulation and in application to a political blogs network to identify collections of blogs which reference one another more than expected considering the ideology of its' members. Along with small partisan communities, we demonstrate ECoHeN's ability to identify a large, bipartisan community undetectable by canonical community detection methods and denser than modern, competing methods.
Open Access
Change-Point estimation using shape-restricted regression splines
(Colorado State University. Libraries, 2016) Liao, Xiyue, author; Meyer, Mary C., advisor; Breidt, F. Jay, committee member; Homrighausen, Darren, committee member; Belfiori, Elisa, committee member
Change-Point estimation is in need in fields like climate change, signal processing, economics, dose-response analysis etc, but it has not yet been fully discussed. We consider estimating a regression function ƒm and a change-point m, where m is a mode, an inflection point, or a jump point. Linear inequality constraints are used with spline regression functions to estimate m and ƒm simultaneously using profile methods. For a given m, the maximum-likelihood estimate of ƒm is found using constrained regression methods, then the set of possible change-points is searched to find the ˆm that maximizes the likelihood. Convergence rates are obtained for each type of change-point estimator, and we show an oracle property, that the convergence rate of the regression function estimator is as if m were known. Parametrically modeled covariates are easily incorporated in the model. Simulations show that for small and moderate sample sizes, these methods compare well to existing methods. The scenario when the random error is from a stationary autoregressive process is also presented. Under such a scenario, the change-point and parameters of the stationary autoregressive process, such as autoregressive coefficients and the model variance, are estimated together via Cochran-Orcutt-type iterations. Simulations are conducted and it is shown that the change-point estimator performs well in terms of choosing the right order of the autoregressive process. Penalized spline-based regression is also discussed as an extension. Given a large number of knots and a penalty parameter which controls the effective degrees of freedom of a shape-restricted model, penalized methods give smoother fits while balance between under- and over-fitting. A bootstrap confidence interval for a change-point is established. By generating random change-points from a curve on the unit interval, we compute the coverage rate of the bootstrap confidence interval using penalized estimators, which shows advantages such as robustness over competitors. The methods are available in the R package ShapeChange on the Comprehensive R Archival Network (CRAN). Moreover, we discuss the shape selection problem when there are more than one possible shapes for a given data set. A project with the Forest Inventory & Analysis (FIA) scientists is included as an example. In this project, we apply shape-restricted spline-based estimators, among which the one-jump and double-jump estimators are emphasized, to time-series Landsat imagery for the purpose of modeling, mapping, and monitoring annual forest disturbance dynamics. For each pixel and spectral band or index of choice in temporal Landsat data, our method delivers a smoothed rendition of the trajectory constrained to behave in an ecologically sensible manner, reflecting one of seven possible “shapes”. Routines to realize the methodology are built in the R package ShapeSelectForest on CRAN, and techniques in this package are being applied for forest disturbance and attribute mapping across the conterminous U.S.. The Landsat community will implement techniques in this package on the Google Earth Engine in 2016. Finally, we consider the change-point estimation with generalized linear models. Such work can be applied to dose-response analysis, when the effect of a drug increases as the dose increases to a saturation point, after which the effect starts decreasing.
Open Access
Confidence regions for level curves and a limit theorem for the maxima of Gaussian random fields
(Colorado State University. Libraries, 2009) French, Joshua, author; Davis, Richard A., advisor
One of the most common display tools used to represent spatial data is the contour plot. Informally, a contour plot is created by taking a "slice" of a three-dimensional surface at a certain level of the response variable and projecting the slice onto the two-dimensional coordinate-plane. The "slice" at each level is known as a level curve.
Open Access
Constrained spline regression and hypothesis tests in the presence of correlation
(Colorado State University. Libraries, 2013) Wang, Huan, author; Meyer, Mary C., advisor; Opsomer, Jean D., advisor; Breidt, F. Jay, committee member; Reich, Robin M., committee member
Extracting the trend from the pattern of observations is always difficult, especially when the trend is obscured by correlated errors. Often, prior knowledge of the trend does not include a parametric family, and instead the valid assumption are vague, such as "smooth" or "monotone increasing," Incorrectly specifying the trend as some simple parametric form can lead to overestimation of the correlation, and conversely, misspecifying or ignoring the correlation leads to erroneous inference for the trend. In this dissertation, we explore spline regression with shape constraints, such as monotonicity or convexity, for estimation and inference in the presence of stationary AR(p) errors. Standard criteria for selection of penalty parameter, such as Akaike information criterion (AIC), cross-validation and generalized cross-validation, have been shown to behave badly when the errors are correlated and in the absence of shape constraints. In this dissertation, correlation structure and penalty parameter are selected simultaneously using a correlation-adjusted AIC. The asymptotic properties of unpenalized spline regression in the presence of correlation are investigated. It is proved that even if the estimation of the correlation is inconsistent, the corresponding projection estimation of the regression function can still be consistent and have the optimal asymptotic rate, under appropriate conditions. The constrained spline fit attains the convergence rate of unconstrained spline fit in the presence of AR(p) errors. Simulation results show that the constrained estimator typically behaves better than the unconstrained version if the true trend satisfies the constraints. Traditional statistical tests for the significance of a trend rely on restrictive assumptions on the functional form of the relationship, e.g. linearity. In this dissertation, we develop testing procedures that incorporate shape restrictions on the trend and can account for correlated errors. These tests can be used in checking whether the trend is constant versus monotone, linear versus convex/concave and any combinations such as, constant versus increase and convex. The proposed likelihood ratio test statistics have an exact null distribution if the covariance matrix of errors is known. Theorems are developed for the asymptotic distributions of test statistics if the covariance matrix is unknown but the test statistics use a consistent estimator of correlation into their estimation. The comparisons of the proposed test with the F-test with the unconstrained alternative fit and the one-sided t-test with simple regression alternative fit are conducted through intensive simulations. Both test size and power of the proposed test are favorable, smaller test size and greater power in general, comparing to the F-test and t-test.
Open Access
Data associated with "Interpersonal relationships drive successful team science: an exemplary case-based study"
(Colorado State University. Libraries, 2020) Love, Hannah; Cross, Jennifer; Fosdick, Bailey; Crooks, Kevin; VandeWoude, Susan; Fisher, Ellen
Team science, or collaborations between groups of scientists with varying expertise, is required for researching solutions to complex problems of the 21st century. Despite the essential need for such transdisciplinary interactions, knowledge about training scientists and developing personal mastery, a set of principles and practices necessary for team learning, also referred to as the science of team science (SciTS) in productive team interactions is still in its nascent stages. This article reports on a longitudinal case study of an exemplary scientific team and evaluates the following question: How do scientists enhance their productivity through participation in transdisciplinary teams? Through a focused SciTS study applying mixed methods, including social network surveys, participant observation, focus groups, interviews, and historical social network data, we found that the interactions of an international, transdisciplinary scientific team trained scientists to become experts in their field, helped the team develop personal mastery, advanced their scientific productivity, and fulfilled the land grant mission. The team’s processes and practices to train new scientists propelled new ideas, collaborations, and research outcomes over a 15-year period. This case study highlights that in addition to specific scientific discoveries, scientific progress benefits from developing and forming interpersonal relationships among scientists from diverse disciplines.
Open Access
Data mining techniques for temporal point processes applied to insurance claims data
(Colorado State University. Libraries, 2008) Iverson, Todd Ashley, author; Ben-Hur, Asa, advisor; Iyer, Hariharan K., advisor
We explore data mining on databases consisting of insurance claims information. This dissertation focuses on two major topics we considered by way of data mining procedures. One is the development of a classification rule using kernels and support vector machines. The other is the discovery of association rules using the Apriori algorithm, its extensions, as well as a new association rules technique. With regard to the first topic we address the question-can kernel methods using an SVM classifier be used to predict patients at risk of type 2 diabetes using three years of insurance claims data? We report the results of a study in which we tested the performance of new methods for data extracted from the MarketScan® database. We summarize the results of applying popular kernels, as well as new kernels constructed specifically for this task, for support vector machines on data derived from this database. We were able to predict patients at risk of type 2 diabetes with nearly 80% success when combining a number of specialized kernels. The specific form of the data, that of a timed sequence, led us to develop two new kernels inspired by dynamic time warping. The Global Time Warping (GTW) and Local Time Warping (LTW) kernels build on an existing time warping kernel by including the timing coefficients present in classical time warping, while providing a solution for the diagonal dominance present in most alignment methods. We show that the LTW kernel performs significantly better than the existing time warping kernel when the times contained relevant information. With regard to the second topic, we provide a new theorem on closed rules that could help substantially improve the time to find a specific type of rule. An insurance claims database contains codes indicating associated diagnoses and the resulting procedures for each claim. The rules that we consider are of the form diagnoses imply procedures. In addition, we introduce a new class of interesting association rules in the context of medical claims databases and illustrate their potential uses by extracting example rules from the MarketScan® database.
Open Access
Dataset associated with "A laboratory assessment of 120 air pollutant emissions from biomass and fossil fuel cookstoves
(Colorado State University. Libraries, 2018) Bilsback, Kelsey
Cookstoves emit many pollutants that are harmful to human health and the environment. However, most of the existing scientific literature focuses on fine particulate matter (PM2.5) and carbon monoxide (CO). We present an extensive dataset of speciated air pollution emissions from wood, charcoal, kerosene, and liquefied petroleum gas (LPG) cookstoves. One-hundred and twenty gas- and particle-phase constituents—including organic carbon, elemental carbon (EC), ultrafine particles (10-100 nm), inorganic ions, carbohydrates, and volatile/semi-volatile organic compounds (e.g., alkanes, alkenes, alkynes, aromatics, carbonyls, and polycyclic aromatic hydrocarbons [PAHs])—were measured in the exhaust from 26 stove/fuel combinations. We find that improved biomass stoves tend to reduce PM2.5 emissions, however, certain design features (e.g., insulation or a fan) tend to increase relative levels of other co-emitted pollutants (e.g., EC, ultrafine particles, formaldehyde, or PAHs depending on stove type). In contrast, the pressurized kerosene and LPG stoves reduced all pollutants relative to a traditional three-stone fire (≥93% and ≥79%, respectively). Finally, we find that PM2.5 and CO are not strong predictors of co-emitted pollutants, which is problematic because these pollutants may not be indicators of other cookstove smoke constituents (such as formaldehyde and acetaldehyde) that may be emitted at concentrations that are harmful to human health.
Open Access
Dataset associated with "Design and Testing of a Low-Cost Sensor and Sampling Platform for Indoor Air Quality"
(Colorado State University. Libraries, 2021) Tryner, Jessica; Phillips, Mollie; Quinn, Casey W.; Neymark, Gabe; Wilson, Ander; Jather, Shantanu H.; Carter, Ellison; Volckens, John
Americans spend most of their time indoors at home, but comprehensive characterization of in-home air pollution is limited by the cost and size of reference-quality monitors. We assembled small "Home Health Boxes" (HHBs) to measure indoor PM2.5, PM10, CO2, CO, NO2, and O3 concentrations using filter samplers and low-cost sensors. Nine HHBs were collocated with reference monitors in the kitchen of an occupied home in Fort Collins, Colorado, USA for 168 h while wildfire smoke impacted local air quality. When HHB data were interpreted using gas sensor manufacturers' calibrations, HHBs and reference monitors (a) categorized the level of each gaseous pollutant similarly (as either low, elevated, or high relative to air quality standards) and (b) both indicated that gas cooking burners were the dominant source of CO and NO2 pollution; however, HHB and reference O3 data were not correlated. When HHB gas sensor data were interpreted using linear mixed calibration models derived via collocation with reference monitors, root-mean-square error decreased for CO2 (from 408 to 58 ppm), CO (645 to 572 ppb), NO2 (22 to 14 ppb), and O3 (21 to 7 ppb); additionally, correlation between HHB and reference O3 data improved (Pearson's r increased from 0.02 to 0.75). Mean 168-h PM2.5 and PM10 concentrations derived from nine filter samples were 19.4 micrograms per cubic meter (6.1% relative standard deviation [RSD]) and 40.1 micrograms per cubic meter (7.6% RSD). The 168-h PM2.5 concentration was overestimated by PMS5003 sensors (median sensor/filter ratio = 1.7) and underestimated slightly by SPS30 sensors (median sensor/filter ratio = 0.91).

Browse

Browsing Department of Statistics by Title

Results Per Page

Sort Options