Overview of the ‘fastglm’ Package

The fastglm package is intended as a fast and stable alternative to stats::glm() for fitting generalized linear models. It is compatible with R’s family objects (see ?family) and produces fitted objects whose downstream methods (summary(), vcov(), predict(), coef(), residuals(), logLik()) behave exactly like those for a glm. This vignette walks through the main functionality of the package at a higher level than the other vignettes.

library(fastglm)

Fitting a GLM

The main package function is fastglm(). It takes a numeric design matrix x, a response y, and an R family object:

data(esoph)
x <- model.matrix(~ agegp + unclass(tobgp) + unclass(alcgp),
                  data = esoph)
y <- cbind(esoph$ncases, esoph$ncontrols)

fit <- fastglm(x, y, family = binomial(link = "cloglog"))
summary(fit)
#> 
#> Call:
#> fastglm.default(x = x, y = y, family = binomial(link = "cloglog"))
#> 
#> Coefficients:
#>                Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)    -4.29199    0.29777 -14.414  < 2e-16 ***
#> agegp.L         3.30677    0.63454   5.211 1.88e-07 ***
#> agegp.Q        -1.35525    0.57310  -2.365    0.018 *  
#> agegp.C         0.20296    0.43333   0.468    0.640    
#> agegp^4         0.15056    0.29318   0.514    0.608    
#> agegp^5        -0.23425    0.17855  -1.312    0.190    
#> unclass(tobgp)  0.33276    0.07285   4.568 4.93e-06 ***
#> unclass(alcgp)  0.80384    0.07538  10.664  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 61.609  on 87  degrees of freedom
#> Residual deviance: 96.950  on 80  degrees of freedom
#> AIC: 228
#> 
#> Number of Fisher Scoring iterations: 6

fastglm() does not accept a formula directly, it operates on a pre-built design matrix. To use a formula and a data frame, pass fastglm_fit as the fitting function to base glm():

fit2 <- glm(cbind(ncases, ncontrols) ~ agegp + unclass(tobgp) + unclass(alcgp),
            data = esoph, family = binomial(link = "cloglog"),
            method = fastglm_fit)

Decomposition methods

The IRLS algorithm reduces every iteration to a weighted least-squares problem. fastglm supports six different matrix decompositions for solving that WLS step, all from RcppEigen (Bates and Eddelbuettel, 2013); the choice trades off speed against numerical stability and rank-revealing behavior:

method decomposition
0 column-pivoted Householder QR (default; rank-revealing)
1 unpivoted Householder QR
2 LLT Cholesky
3 LDLT Cholesky
4 full-pivoted Householder QR
5 bidiagonal divide-and-conquer SVD

The default (method = 0) is the safe choice: it is rank-revealing, so it handles aliased / collinear columns gracefully. The Cholesky methods (2 and 3) are roughly 3–4× faster but assume full column rank.

set.seed(123)
n <- 5000; p <- 30
x <- matrix(rnorm(n * p), n, p)
y <- rbinom(n, 1, plogis(x %*% rnorm(p) * 0.05))

system.time(f0 <- fastglm(x, y, family = binomial()))                 # default
#>    user  system elapsed 
#>    0.01    0.00    0.01
system.time(f2 <- fastglm(x, y, family = binomial(), method = 2))     # LLT
#>    user  system elapsed 
#>   0.005   0.000   0.005
system.time(f3 <- fastglm(x, y, family = binomial(), method = 3))     # LDLT
#>    user  system elapsed 
#>   0.005   0.000   0.005

Stability

fastglm does not compromise computational stability for speed. It uses a step-halving safeguard following Marschner (2011) and starts from better-initialized values than glm() or glm2::glm2(), so it tends to converge in cases where the standard IRLS algorithm fails. See vignette("fastglm", package = "fastglm") for a Gamma-with-sqrt-link example where glm() does not converge but fastglm() does.

Inference: vcov(), robust SE, predictions

The fitted object stores the unscaled covariance directly (no refitting), and the standard vcov() / summary() machinery works as expected:

fit <- fastglm(x, y, family = binomial(), method = 2)

V    <- vcov(fit)
dim(V)
#> [1] 30 30

## standard errors agree with the diagonal of vcov()
all.equal(sqrt(diag(V)), unname(coef(summary(fit))[, "Std. Error"]))
#> [1] "names for target but not for current"

Heteroskedasticity-consistent and cluster-robust covariance matrices are available via sandwich::vcovHC() and sandwich::vcovCL()fastglm registers methods on those generics, so loading sandwich is all that is required:

library(sandwich)

V_hc <- vcovHC(fit, type = "HC0")
V_hc[1:3, 1:3]
#>              X1           X2           X3
#> X1 8.546454e-04 7.509395e-06 1.483571e-06
#> X2 7.509395e-06 8.321108e-04 7.951566e-07
#> X3 1.483571e-06 7.951566e-07 8.351855e-04

cluster <- rep(seq_len(20), length.out = n)
V_cl    <- vcovCL(fit, cluster = cluster, type = "HC1")
V_cl[1:3, 1:3]
#>               X1            X2           X3
#> X1  0.0007676800 -3.021035e-04 1.517204e-04
#> X2 -0.0003021035  8.060696e-04 8.492543e-05
#> X3  0.0001517204  8.492543e-05 7.823621e-04

Results are numerically identical to sandwich::vcovHC() and sandwich::vcovCL() applied to a glm fit (Zeileis, Köll, and Graham, 2020) to floating-point precision. They work for sparse and big.matrix fits as well, since the full design matrix is preserved on the fitted object (as a reference, not copied).

predict() supports se.fit = TRUE:

xnew <- x[1:5, , drop = FALSE]
predict(fit, newdata = xnew, type = "response", se.fit = TRUE)
#> $fit
#> [1] 0.6087584 0.5272820 0.4245913 0.5494426 0.6192178
#> 
#> $se.fit
#> [1] 0.03919065 0.03095856 0.03623790 0.03629691 0.03512202
#> 
#> $residual.scale
#> [1] 1

Sparse, big.matrix, and streaming designs

For designs that are sparse, that live on disc, or that have to be built from a parquet / arrow / DuckDB source, fastglm provides three large-data paths that share a common streaming kernel and produce identical results:

  • Matrix::dgCMatrix: pass directly to fastglm(). Useful for one-hot encoded categoricals and high-dimensional sparse designs.
  • bigmemory::big.matrix: pass directly to fastglm(). The matrix is read in row-blocks and never fully materialised in memory.
  • fastglm_streaming(chunk_callback, n_chunks, family): a user-supplied closure yields one row-block per call. The right path for fitting on a parquet dataset, DuckDB query, or any external columnar store.

See vignette("large-data-fastglm", package = "fastglm") for a detailed walk through all three paths. A short example:

n <- 4000
X <- cbind(1, matrix(rnorm(n * 3), n, 3))
y <- rbinom(n, 1, plogis(X %*% c(0.2, 0.4, -0.2, 0.3)))

chunk_size <- 1000
chunks <- function(k) {
    idx <- ((k - 1) * chunk_size + 1):(k * chunk_size)
    list(X = X[idx, , drop = FALSE], y = y[idx])
}

fit_stream <- fastglm_streaming(chunks, n_chunks = 4, family = binomial())
fit_full   <- fastglm(X, y, family = binomial(), method = 2)

max(abs(coef(fit_stream) - coef(fit_full)))
#> [1] 1.111683e-07

Native families

For the most commonly-used family/link combinations, fastglm dispatches variance(), mu.eta(), linkinv(), and dev.resids() to inline C++ implementations rather than calling back into R once per IRLS iteration. The covered combinations are:

  • gaussian (identity, log, inverse)
  • binomial (logit, probit, cloglog, log)
  • poisson (log, identity, sqrt)
  • Gamma (log, inverse, identity)
  • inverse.gaussian (1/mu^2, log, identity, inverse)

Detection is automatic, if the family object matches one of the above, the native path is used; otherwise fastglm falls back to the R-callback path used in earlier versions. The native path is meaningfully faster on large n because it eliminates the per-iteration round-trip to R for each of the four family functions:

library(microbenchmark)

set.seed(7)
n <- 50000; p <- 30
x <- matrix(rnorm(n * p), n, p)
y <- rpois(n, exp(x %*% rnorm(p) * 0.05 + 1))

## force the R-callback path by giving the family object an unrecognized name
disguise <- function(fam) { fam$family <- paste0(fam$family, "*"); fam }

mb <- microbenchmark(
    native   = fastglm(x, y, family = poisson(),            method = 2),
    callback = fastglm(x, y, family = disguise(poisson()),  method = 2),
    times = 5L
)
print(mb)
#> Unit: milliseconds
#>      expr      min       lq     mean   median       uq      max neval
#>    native 45.99576 46.05611 47.59150 48.34347 48.74012 48.82204     5
#>  callback 56.04044 56.44821 57.64738 57.28308 58.59437 59.87081     5

The fastglm package switches transparently between the two paths; user code does not change.

Negative binomial, hurdle, zero-inflation, and Firth logistic

A handful of regressions that come up often in practice are not standard GLMs but build on the same IRLS machinery. fastglm provides dedicated fitting functions for these:

  • fastglm_nb() — negative-binomial regression with the dispersion theta estimated jointly with beta. Plays the same role as MASS::glm.nb() but runs the joint (beta, theta) MLE entirely in C++ (IRLS for beta, Brent for theta).
  • firth = TRUE — a flag on fastglm() / fastglm_fit() that activates Firth’s bias-reducing penalty on the score. Useful for binary logistic regression under separation or in small samples; analogous to logistf::logistf().
  • fastglm_hurdle() — a two-part count model: a binary regression for whether y > 0, plus a zero-truncated Poisson or NB regression on the positive subset. The two parts factorize, and both are fit by the same C++ IRLS solver. Same model as pscl::hurdle().
  • fastglm_zi() — a zero-inflated Poisson or NB regression: a binary inflation component overlaid on the original count distribution, fit by an EM algorithm in C++ with closed-form posterior responsibilities and an analytical observed-information vcov. Same model as pscl::zeroinfl().

All four take the standard fastglm tuning options (method, tol, maxit) and return objects with the usual coef(), vcov(), predict(), logLik() methods. fastglm_hurdle() and fastglm_zi() use the Formula package’s two-RHS syntax y ~ x1 + x2 | z1 + z2 to specify different designs for the count and zero parts. See vignette("count-firth-fastglm", package = "fastglm") for examples and timing comparisons against the canonical reference packages.

Three R entry points

There are three R-side functions to fit a GLM with fastglm. They all call into the same C++ solver, but differ in how much of base R’s machinery they wrap around it:

  • fastglm() is the top-level S3 function. It returns an object of class "fastglm" with deviance, AIC, null deviance, residuals, etc. populated for interactive use.
  • fastglm_fit() is designed to be passed as method = fastglm_fit to base glm(). The returned object inherits from "fastglmFit" so that glm()’s formula/data handling, update(), predict(), etc. all work on top.
  • fastglmPure() is the lowest-level wrapper: no dispersion, no AIC, no null-deviance computation. Use this when calling fastglm from another package and you only need the coefficients.

References

  • Marschner, I. C. (2011). glm2: Fitting generalized linear models with convergence problems. The R Journal, 3(2), 12–15. doi:10.32614/RJ-2011-012

  • Bates, D. and Eddelbuettel, D. (2013). Fast and elegant numerical linear algebra using the RcppEigen package. Journal of Statistical Software, 52(5), 1–24. doi:10.18637/jss.v052.i05

  • Zeileis, A., Köll, S., and Graham, N. (2020). Various versatile variances: An object-oriented implementation of clustered covariances in R. Journal of Statistical Software, 95(1), 1–36. doi:10.18637/jss.v095.i01