# Plenary 1

Generalized Linear Models and Analysis of Big Data

M. Ataharul Islam

QM Husain Professor, ISRT, University of Dhaka

Abstract

The generalized linear model has emerged as a very important approach of modeling data under variety of conditions, such as discrete or continuous, qualitative or quantitative, etc. The generalized linear models belong to a special class of distributions known as exponential family. One of the major advantages lies in the fact that some of the important properties of likelihood estimation such as the minimal sufficient statistics are restricted to the exponential family. The bivariate conditional probability distributions belonging to specified exponential families have been shown in the past (Arnold and Strauss, 1991; Arnold et al. 2001).  Similarly, the conditional generalized linear models have been developed with covariate dependence (Islam and Chowdhury, 2017) for bivariate outcomes. In longitudinal or repeated measures data, we face the challenge of formulating models based on multivariate outcomes. The bivariate or multivariate outcomes are generally correlated. The knowledge about the underlying marginal probability distribution of outcomes at different times are not adequate for modeling the longitudinal outcome data due to lack of knowledge about the underlying correlation structure. Islam (2018) proposed a trivariate Bernoulli regression model using marginal and conditional approach. Fahrmeir and Tutz (2001) and Islam and Chowdhury (2017) proposed alternative regression models for multivariate outcomes. In this paper, a multivariate generalized linear model is proposed with underlying estimation and test procedures. The models based on quasi-likelihood methods are also highlighted. The application of the models to big data is discussed in this paper using the divide and recombine (D&R) framework (Buhlman et al., 2016). Lee et al. (2017) explored the concepts of sufficiency and summary statistics for model fitting. In this paper, the exponential family of distributions for multivariate outcome variables and the corresponding sufficient statistics are shown to have great potential in analyzing big data where traditional statistical methods fail to provide any result due to very large data sets. The use of sufficiency may provide the opportunity to make use of the grouped data characteristics (summary statistics D&R) instead of unit level characteristics (horizontal D&R). The proposed method is designed to reduce the complexity arising from the very large data sets by using an effective and feasible data reduction technique. QMH Professor Dr. M. Ataharul Islam
Institute of Statistical Research and Training (ISRT)