## Overview

The bamlss package heavily builds upon the R package mgcv (Wood 2020) infrastructures, i.e., all model terms that are provided by mgcv can also be used in bamlss. As a consequence, it is also possible to write user defined model terms using the generic smooth.construct()method of mgcv. Moreover, users can in principle even extent this, writing model terms beyond mgcv framework. However, this is a bit more technical and assumes that the user has a good basic knowledge of the internal structure of the package and is only very briefly discussed in this article.

## Specification

To give an overview of users not familiar with mgcv, the following table lists some model specifications using R’s formula syntax:

Description Formula
Linear effects $$\mathbf{X}\boldsymbol{\beta}$$ ~ x1 + x2 + x3
Nonlinear effects of continuous covariates $$f(\mathbf{x}) = f(x_1)$$ ~ s(x1)
Two-dimensional surfaces $$f(\mathbf{x}) = f(x_1, x_2)$$ ~ s(x1,x2), ~ te(x1,x2) or ~ ti(x1,x2) (higher dimensional terms possible)
Spatially correlated effects $$f(\mathbf{x}) = f_{spat}(x_s)$$ ~ s(xs,bs="mrf",xt=list(penalty=K)), where xs is a factor indicating the discrete regional information and K is a supplied penalty matrix. Other options within the xt argument are possible, please see the documentation of smooth.construct.mrf.smooth.spec()
Varying coefficients $$f(\mathbf{x}) = x_1f(x_2)$$ ~ s(x2,by=x1)
Spatially varying effects $$f(\mathbf{x}) = x_1f_{spat}(x_s)$$ or $$f(\mathbf{x}) = x_1f(x_2, x_3)$$ ~ s(xs,bs="mrf",xt=list(penalty=K),by=x1), ~ s(x2,x3,by=x1) or ~ te(x2,x3,by=x1)
Random intercepts with cluster index $$c$$: $$f(\mathbf{x}) = \beta_c$$ ~ s(id,bs="re"), where id is a factor of cluster indices
Random slopes with cluster index $$c$$: $$f(\mathbf{x}) = x_1\beta_c$$ ~ s(id,x1,bs="re"), as above with continuous covariate x1

Note that each model term constructor has an argument called bs, which specifies the type of basis that should be used for modeling. For more details on the smooth constructors, please visit the mgcv manual pages of s(), te() and ti().

## Basics

To better introduce the standard types of model terms we use simulated data that is part of the bamlss package. The data contains of a couple of different response types and is usually used for internal tests, only.

d <- GAMart()
head(d)
##            num     pnum      bnum cnum bin    cat       cens          eta
## 1  0.341704542 2.317364 0.6462574   65 yes   high 1.50241790  0.313833175
## 2 -0.373302161 1.602357 0.3009252   30  no   none 0.00000000 -0.207359234
## 3  0.007432787 1.983092 0.4848116   48  no    low 0.00000000  0.006957129
## 4  0.163871126 2.139531 0.5603679   56 yes medium 0.07999916  0.152055884
## 5 -0.096522673 1.879137 0.4346035   43  no    low 0.91605636 -0.195067241
## 6  0.505487688 2.481147 0.7253611   73 yes   high 1.28712357  0.668382608
##          x1         x2        x3    fac id       lon       lat           err
## 1 0.2875775 0.35360608 0.2736227   high  1 0.3181818 0.8636364  0.0278713673
## 2 0.7883051 0.36644144 0.5938669 medium  2 0.6363636 0.3636364 -0.1659429266
## 3 0.4089769 0.28710013 0.1601848 medium  3 0.2727273 0.3636364  0.0004756581
## 4 0.8830174 0.07997291 0.8534302 medium  4 0.5909091 0.9090909  0.0118152420
## 5 0.9404673 0.36545427 0.8477392   high  5 0.4090909 0.7272727  0.0985445686
## 6 0.0455565 0.17801381 0.4778868 medium  6 0.6818182 0.3636364 -0.1628949195

The first column holding variable num is a Gaussian response, all subsequent columns up to variable cens are basically transformations of variable num. Variables x1 to lat are covariates, which we use for illustrating how to set up the different model terms in the following.

Let’s start with a simple example using three covariates for a Gaussian model with possible nonlinear influence on the response distribution. The model formula can be specified by

f <- list(
num ~ s(x1) + s(x2) + s(x3),
sigma ~ s(x1) + s(x2) + s(x3)
)

Note that the default basis function type in mgcv are thin plate regression splines (Wood 2003) using k = 10 basis functions per default. Setting k = 10 is reasonable in most cases, but this may not be enough for very complex functional forms.

The type of basis in mgcv is selected with the bs argument. There are a number of different basis types readily available, from P-splines (Eilers and Marx 1996) to Markov random fields, this is documented on the mgcv manual page ?smooth.terms. Setting up the smooth terms is relatively straightforward when already known one. For example, if one is interested in specifying interaction terms, e.g., with main effects and interaction using two covariates, we could do the following using tensor product splines

f <- list(
num ~ ti(x1) + ti(x2) + ti(x1,x2),
sigma ~ s(x1) + s(x2) + s(x3)
)

Here, we use a functional decomposition of the effects of covariates x1 and x2 in the first formula. Note again that the number of basis function within ti() is again small, k = 5. The reason is, that constructing tensor product model terms can easily end up with thousands of parameters, i.e., these type of model terms need to be specified with care.

## Markov Random Fields

Markov random fields (MRF) can be used to model spatially correlated effects using discrete locational observations. See the documentation of mgcv::smooth.construct.mrf.smooth.spec() how to set up this model terms.

## Functional random intercepts

This type of model term is basically a random intercept and slope term, where the functional type might be nonlinear, e.g., this type of term is heavily used in joint models for longitudinal and survival data (Köhler et al. 2017; Köhler, Umlauf, and Greven 2018). Fortunately, using the mgcv package it is not very difficult to set them up. Using the artificial data from above, together with our newly created region identifier from the MRF example, we can set up a functional random intercept using numeric variable x1 and factor variable ID with

num ~ ti(ID,bs="re") + ti(x1,bs="ps",k=10) + ti(ID,x1,bs=c("re","ps"),k=c(nlevels(d\$ID),5))

where all nonlinear functions are modeled with P-splines (Eilers and Marx 1996). One needs to be a little bit careful with the number of basis functions in the ti() smooth constructor since this can easily end up in thousands of parameters and problems with memory might occur (also, depending on the number of levels in the factor variable, here ID).

## References

Eilers, P. H. C, and B. D. Marx. 1996. “Flexible Smoothing Using B-Splines and Penalized Likelihood.” Statistical Science 11: 89–121. https://doi.org/10.1214/ss/1038425655.

Köhler, Meike, Nikolaus Umlauf, Andreas Beyerlein, Christiane Winkler, Anette-Gabriele Ziegler, and Sonja Greven. 2017. “Flexible Bayesian Additive Joint Models with an Application to Type 1 Diabetes Research.” Biometrical Journal 59 (6): 1144–65. https://doi.org/10.1002/bimj.201600224.

Köhler, Meike, Nikolaus Umlauf, and Sonja Greven. 2018. “Nonlinear Association Structures in Flexible Bayesian Additive Joint Models.” Statistics in Medicine 37 (30): 4771–88. https://doi.org/10.1002/sim.7967.

Umlauf, Nikolaus, Nadja Klein, Achim Zeileis, and Thorsten Simon. 2024. bamlss: Bayesian Additive Models for Location Scale and Shape (and Beyond). https://CRAN.R-project.org/package=bamlss.

Wood, Simon N. 2003. “Thin Plate Regression Splines.” Journal of the Royal Statistical Society. Series B (Statistical Methodology) 65 (1): 95–114. https://doi.org/10.1111/1467-9868.00374.

Wood, S. N. 2020. mgcv: GAMs with Gcv/Aic/Reml Smoothness Estimation and Gamms by Pql. https://CRAN.R-project.org/package=mgcv.