Generalized Linear Models and Analysis of Big Data
M. Ataharul Islam
QM Husain Professor, ISRT, University of Dhaka
The generalized linear model has emerged as a very important approach of modeling data under variety of conditions, such as discrete or continuous, qualitative or quantitative, etc. The generalized linear models belong to a special class of distributions known as exponential family. One of the major advantages lies in the fact that some of the important properties of likelihood estimation such as the minimal sufficient statistics are restricted to the exponential family. The bivariate conditional probability distributions belonging to specified exponential families have been shown in the past (Arnold and Strauss, 1991; Arnold et al. 2001). Similarly, the conditional generalized linear models have been developed with covariate dependence (Islam and Chowdhury, 2017) for bivariate outcomes. In longitudinal or repeated measures data, we face the challenge of formulating models based on multivariate outcomes. The bivariate or multivariate outcomes are generally correlated. The knowledge about the underlying marginal probability distribution of outcomes at different times are not adequate for modeling the longitudinal outcome data due to lack of knowledge about the underlying correlation structure. Islam (2018) proposed a trivariate Bernoulli regression model using marginal and conditional approach. Fahrmeir and Tutz (2001) and Islam and Chowdhury (2017) proposed alternative regression models for multivariate outcomes. In this paper, a multivariate generalized linear model is proposed with underlying estimation and test procedures. The models based on quasi-likelihood methods are also highlighted. The application of the models to big data is discussed in this paper using the divide and recombine (D&R) framework (Buhlman et al., 2016). Lee et al. (2017) explored the concepts of sufficiency and summary statistics for model fitting. In this paper, the exponential family of distributions for multivariate outcome variables and the corresponding sufficient statistics are shown to have great potential in analyzing big data where traditional statistical methods fail to provide any result due to very large data sets. The use of sufficiency may provide the opportunity to make use of the grouped data characteristics (summary statistics D&R) instead of unit level characteristics (horizontal D&R). The proposed method is designed to reduce the complexity arising from the very large data sets by using an effective and feasible data reduction technique.
QMH Professor Dr. M. Ataharul Islam
Institute of Statistical Research and Training (ISRT)
University of Dhaka, Bangladesh
M. Ataharul Islam is currently QMH Professor at Institute of Statistical Research and Training (ISRT), University of Dhaka, Bangladesh. He is a former professor of statistics at the University Sains Malaysia, King Saud University, University of Dhaka and East West University. He served as a visiting faculty at the University of Hawaii and University of Pennsylvania. He is a recipient of the Pauline Stitt Award, Western North American Region (WNAR) Biometric Society Award for content and writing, University Grants Commission Award for book and research, and the Ibrahim Memorial Gold Medal for research. He has published more than 100 papers in international journals on various topics, mainly on longitudinal and repeated measures data including multistate and multistage hazards model, statistical modelling, Markov models with covariate dependence, generalized linear models, conditional and joint models for correlated outcomes. He authored a book on Markov models, edited one book jointly and contributed chapters in several books.