The quantile analysis of gene expression trajectory and quality control

Date

2024-07

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Faculty of Graduate Studies and Research, University of Regina

Abstract

In Chapter 2 of the thesis, Taguchi’s loss function in the multivariate case is reviewed and a straightforward and practical definition for the cost coefficient matrix is proposed. In order for the quality control process cost more realistic, the expected modified Taguchi’s loss function is applied on Hotelling’s T2 control chart based on a modified economic statistical design, and variable sampling interval (VSI) scheme is adopted. To obtain optimal design parameters, artificial bee colony (ABC) algorithm is utilized. By comparing our approach with previous routine design, it is concluded that redefining the cost factor makes the control system more sensitive to the process mean shift and the cost positive proportional to the process deviation, which is an important characteristic of good economic model. Temporal gene expression data can be used to characterize gene function, and gene expression trajectories showing different trends under various biological conditions appeals great interest of scientists. Most literature analyzed the gene expression trajectory by using traditional mean regression model, which does not perform well in practice due to the presence of non-normal distribution, potential outliers and heteroscedasticity in the data. Chapter 3 proposes a likelihood-based EM algorithm to estimate marginal conditional quantile of multivariate response in linear quantile regression framework, which studies the association between multivariate response and the explanatory variables across different quantiles. We assume that the error term is multivariate asymmetric Laplace distributed and derive the MLE of parameters and implement EM algorithm. The proposed approach is validated through simulation studies, and a real dataset application of 18 genes in P. aeruginosa expressed under 24 biological conditions is analyzed. In Chapter 4, in order to identify the similarity among the 18 genes in P. aeruginosa, the clustering tree for all genes is built by utilizing the Sz´ekely and Rizzo’s hierarchical e-clustering algorithm which is actually an extension of Ward’s minimum variance method. It is based on the empirical cluster distance between population distributions and is able to describe the hierarchical structure of multivariate data and classify the observations into some disjoint clusters. Moreover, similar to the ANOVA F test, the DISCO analysis partitions the total dispersion of observed response into between and within components and determines the test statistic. We conduct the pairwise DISCO tests among all gene expressions for the hypothesis of multi-sample equal distributions. Both methods illustrate that some gene expressions are highly likely from the same population.

Description

A Thesis Submitted to the Faculty of Graduate Studies and Research In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Statistics, University of Regina. xvi, 166 p.

Keywords

Citation