Negative Binomial Stability Benchmark: fastglm vs MASS

Negative-binomial regression is fit by alternating between IRLS for the regression coefficients beta (at a fixed dispersion) and a one-dimensional MLE update for the dispersion theta. This is a harder problem than an ordinary GLM: the two updates interact, and the joint likelihood is slow to converge, or fails outright, when the fitted means span many orders of magnitude or when theta drifts toward a boundary. fastglm_nb() runs the entire (beta, theta) MLE in C++ (IRLS for beta, Brent for theta) with the same step-halving safeguard following Marschner (2011) that fastglm uses for ordinary GLMs, so it tends to converge in cases where MASS::glm.nb() (Venables and Ripley, 2002) does not.

The companion vignette vignette("nb-convergence-fastglm", package = "fastglm") compares the two solvers on a single difficult data-generating process. This vignette is broader: it is a stability benchmark that sweeps a grid of twelve negative-binomial scenarios, each one chosen to stress the joint MLE in a different way (wide fitted-mean ranges, extreme over- and under-dispersion, small samples, high dimension, collinearity, and contaminating outliers). For each scenario we fit both solvers over many replications and tabulate how often each one converges cleanly, whether they agree when both succeed, and where one rescues the other.

Both solvers are given a matched iteration budget so the comparison is fair: MASS::glm.nb() runs with glm.control(maxit = 50), and fastglm_nb() uses its default outer.maxit = 50.

Data-generating process

A single generator produces negative-binomial data with a log link. Its arguments dial each stress axis independently: the number of columns p, the true dispersion theta, the spread of the coefficients coef_sd, the intercept (which shifts the fitted-mean scale), an optional near-duplicate column (collinear), and an optional handful of extreme outliers.

library(fastglm)
suppressPackageStartupMessages(library(MASS))

gen <- function(seed, n, p, theta, coef_sd = 0.5, intercept = 1.0,
                collinear = FALSE, outliers = 0L, x_sd = 1.0) {
    set.seed(seed)
    X <- cbind(1, matrix(rnorm(n * (p - 1), sd = x_sd), n, p - 1))
    if (collinear && p >= 3) X[, p] <- X[, 2] + rnorm(n, sd = 1e-3)
    beta <- c(intercept, rnorm(p - 1, sd = coef_sd))
    eta  <- pmin(as.vector(X %*% beta), 12)
    mu   <- exp(eta)
    y    <- MASS::rnegbin(n, mu = mu, theta = theta)
    if (outliers > 0) {
        idx <- sample.int(n, outliers)
        y[idx] <- y[idx] + round(stats::rnbinom(outliers, size = theta,
                                                mu = max(mu) * 50))
    }
    list(X = X, y = y)
}

The twelve scenarios span the range from a benign reference case to designs that base IRLS routinely chokes on:

scen <- list(
  baseline      = list(n = 200, p = 3,  theta = 1,    coef_sd = 0.5),
  wide_mu       = list(n = 200, p = 9,  theta = 1,    coef_sd = 1.4, intercept = 4),
  extreme_mu    = list(n = 200, p = 12, theta = 1,    coef_sd = 1.6, intercept = 5),
  tiny_theta    = list(n = 200, p = 3,  theta = 0.05, coef_sd = 0.6),
  vtiny_theta   = list(n = 200, p = 3,  theta = 0.01, coef_sd = 0.6),
  near_poisson  = list(n = 200, p = 3,  theta = 5000, coef_sd = 0.5),
  small_n       = list(n = 40,  p = 5,  theta = 1,    coef_sd = 0.7),
  high_dim      = list(n = 120, p = 60, theta = 2,    coef_sd = 0.5),
  collinear     = list(n = 200, p = 6,  theta = 1,    coef_sd = 0.6, collinear = TRUE),
  outliers      = list(n = 200, p = 4,  theta = 1,    coef_sd = 0.5, outliers = 5L),
  big_intercept = list(n = 200, p = 4,  theta = 0.5,  coef_sd = 1.2, intercept = 6),
  many_zeros    = list(n = 200, p = 4,  theta = 0.3,  coef_sd = 0.6, intercept = -0.5)
)

descriptions <- c(
  baseline      = "well-conditioned reference case",
  wide_mu       = "fitted means span many orders of magnitude",
  extreme_mu    = "even wider mean range, p = 12",
  tiny_theta    = "heavy overdispersion (theta = 0.05)",
  vtiny_theta   = "extreme overdispersion (theta = 0.01)",
  near_poisson  = "nearly Poisson (theta = 5000)",
  small_n       = "small sample, n = 40",
  high_dim      = "high dimensional, p = 60",
  collinear     = "near-collinear design column",
  outliers      = "five extreme response outliers",
  big_intercept = "large intercept inflates the mean scale",
  many_zeros    = "low mean, many zeros (theta = 0.3)"
)

Scoring convergence

For one simulated data set we fit both solvers and record, for each: whether it errored, whether it converged, whether MASS emitted an iteration- or step-limit warning, the maximized log-likelihood, the dispersion estimate, and the maximum relative coefficient discrepancy.

The scoring is deliberately strict on MASS and lenient nowhere: MASS counts as cleanly converged only when it converges and emits no warning, since an iteration-limit warning means the reported fit is not to be trusted. fastglm_nb() counts as converged when it returns without error, sets converged = TRUE, and produces no NaN coefficients. Near-Poisson and badly overfit cells can drive theta toward a non-identifiable boundary, which is a property of the data rather than a failure of the solver, so we score convergence at the solver level and separately verify that, whenever both methods succeed, fastglm never lands at a worse optimum.

fit_both <- function(d) {
    X <- d$X; y <- d$y

    m_err <- FALSE; m_warn <- FALSE; m_conv <- FALSE
    m_ll <- NA_real_; m_th <- NA_real_; m_co <- NULL
    tryCatch(
        withCallingHandlers(
            {
                m <- MASS::glm.nb(y ~ X[, -1], control = glm.control(maxit = 50))
                m_conv <- isTRUE(m$converged)
                m_ll   <- as.numeric(logLik(m))
                m_th   <- m$theta
                m_co   <- coef(m)
            },
            warning = function(w) { m_warn <<- TRUE; invokeRestart("muffleWarning") }
        ),
        error = function(e) { m_err <<- TRUE }
    )

    f_err <- FALSE; f_conv <- FALSE
    f_ll <- NA_real_; f_th <- NA_real_; f_co <- NULL
    tryCatch({
        f <- fastglm_nb(X, y, link = "log")
        f_conv <- isTRUE(f$converged) && !any(is.nan(coef(f)))
        f_ll   <- as.numeric(logLik(f))
        f_th   <- f$theta
        f_co   <- as.numeric(coef(f))
    }, error = function(e) { f_err <<- TRUE })

    co_rel <- if (!is.null(m_co) && !is.null(f_co))
        max(abs(f_co - m_co) / (abs(m_co) + 1e-6)) else NA_real_

    c(m_err = m_err, m_warn = m_warn, m_conv = m_conv,
      f_err = f_err, f_conv = f_conv,
      m_ll = m_ll, f_ll = f_ll, m_th = m_th, f_th = f_th, co_rel = co_rel)
}

Running the benchmark

Each scenario is replicated 50 times with distinct seeds, and the per-replication results are collapsed into the counts that matter for stability: how often MASS converges cleanly, how often fastglm converges, where fastglm rescues a MASS failure, where it loses, and, among the replications where both succeed, how often fastglm’s log-likelihood is at least as large as MASS’s.

n_reps <- 50
out <- data.frame()

for (nm in names(scen)) {
    s <- scen[[nm]]
    M <- t(vapply(seq_len(n_reps), function(r)
        fit_both(do.call(gen, c(list(seed = 10000 * match(nm, names(scen)) + r), s))),
        numeric(10)))
    df <- as.data.frame(M)

    mass_ok    <- !df$m_err & df$m_conv & is.finite(df$m_ll)
    mass_clean <- mass_ok & !df$m_warn
    fg_ok      <- !df$f_err & df$f_conv & is.finite(df$f_ll)
    both       <- mass_ok & fg_ok

    fg_ge <- if (any(both)) sum(df$f_ll[both] >= df$m_ll[both] - 1e-3) else 0L
    co_ok <- if (any(both)) sum(df$co_rel[both] < 1e-2, na.rm = TRUE) else 0L

    out <- rbind(out, data.frame(
        scenario   = nm,
        mass_clean = sum(mass_clean),
        fg_conv    = sum(fg_ok),
        fg_rescue  = sum(!mass_clean & fg_ok),
        fg_lose    = sum(mass_clean & !fg_ok),
        both       = sum(both),
        fg_ge_ll   = fg_ge,
        coef_agree = co_ok
    ))
}

Results

tab <- data.frame(
    Scenario   = out$scenario,
    Description = descriptions[out$scenario],
    `MASS clean` = out$mass_clean,
    `fastglm`    = out$fg_conv,
    Rescue       = out$fg_rescue,
    Lose         = out$fg_lose,
    `Both conv`  = out$both,
    `fg >= MASS` = out$fg_ge_ll,
    check.names  = FALSE
)
knitr::kable(
    tab, row.names = FALSE, align = "llrrrrrr",
    caption = sprintf(
        "Convergence over %d replications per scenario. 'MASS clean' = converged with no warning; 'fastglm' = converged with finite coefficients; 'Rescue' = fastglm converges where MASS does not; 'Lose' = MASS clean but fastglm fails; 'fg >= MASS' = replications (of 'Both conv') where fastglm's log-likelihood is at least MASS's.",
        n_reps)
)
Convergence over 50 replications per scenario. ‘MASS clean’ = converged with no warning; ‘fastglm’ = converged with finite coefficients; ‘Rescue’ = fastglm converges where MASS does not; ‘Lose’ = MASS clean but fastglm fails; ‘fg >= MASS’ = replications (of ‘Both conv’) where fastglm’s log-likelihood is at least MASS’s.
Scenario Description MASS clean fastglm Rescue Lose Both conv fg >= MASS
baseline well-conditioned reference case 50 50 0 0 50 50
wide_mu fitted means span many orders of magnitude 34 50 16 0 35 35
extreme_mu even wider mean range, p = 12 23 50 27 0 29 29
tiny_theta heavy overdispersion (theta = 0.05) 47 46 0 1 46 46
vtiny_theta extreme overdispersion (theta = 0.01) 16 35 19 0 23 23
near_poisson nearly Poisson (theta = 5000) 17 35 18 0 28 28
small_n small sample, n = 40 48 50 2 0 50 50
high_dim high dimensional, p = 60 3 49 46 0 23 19
collinear near-collinear design column 50 50 0 0 50 50
outliers five extreme response outliers 40 40 0 0 40 40
big_intercept large intercept inflates the mean scale 47 50 3 0 47 47
many_zeros low mean, many zeros (theta = 0.3) 50 50 0 0 50 50
library(ggplot2)

plot_df <- rbind(
    data.frame(scenario = out$scenario, method = "MASS::glm.nb (clean)",
               count = out$mass_clean),
    data.frame(scenario = out$scenario, method = "fastglm_nb",
               count = out$fg_conv)
)
plot_df$scenario <- factor(plot_df$scenario, levels = out$scenario[order(out$mass_clean)])
plot_df$method   <- factor(plot_df$method,
                           levels = c("MASS::glm.nb (clean)", "fastglm_nb"))

ggplot(plot_df, aes(scenario, count, fill = method)) +
    geom_col(position = position_dodge(width = 0.75), width = 0.7) +
    scale_fill_manual(values = c("MASS::glm.nb (clean)" = "#b2182b",
                                 "fastglm_nb" = "#2166ac")) +
    labs(x = NULL, y = "clean convergences (of 50)", fill = NULL,
         title = "Negative-binomial stability across scenarios") +
    coord_cartesian(ylim = c(0, n_reps)) +
    theme_minimal(base_size = 12) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = "top",
          panel.grid.major.x = element_blank())
plot of chunk convergence-plot
plot of chunk convergence-plot

The headline result is that fastglm_nb() converges at least as often as MASS::glm.nb() in every scenario, and never finds a worse optimum when both succeed. In all but a handful of high-dimensional replications (where theta itself is non-identifiable and both methods drift to the same degenerate boundary) fastglm’s maximized log-likelihood is greater than or equal to MASS’s. Across all 12 scenarios and 50 replications, 600 fits in total, MASS converged cleanly 425 times (71%) against fastglm’s 555 (92%). fastglm rescued 131 replications where MASS did not converge cleanly, and lost only 1; in 11 of the 12 scenarios its log-likelihood was at least as large as MASS’s on every shared fit.

The gaps open up exactly where the joint (beta, theta) problem is hardest. When the fitted means span many orders of magnitude (wide_mu, extreme_mu, big_intercept), MASS’s IRLS frequently exhausts its iteration budget while fastglm’s step-halving keeps it on the likelihood ascent. At the dispersion extremes (vtiny_theta, near_poisson), MASS often warns or stalls and fastglm rescues roughly twenty replications in each. In the high-dimensional case (high_dim, p = 60), MASS cleanly converges only a few times while fastglm converges on nearly every replication. On the well-conditioned cells (baseline, collinear, many_zeros, small_n) the two agree essentially perfectly, the necessary sanity check that the added robustness costs nothing on easy problems.

Whenever both solvers reach a genuine (non-degenerate) optimum their coefficient estimates agree to within 1%, true in 8 of the 12 scenarios, which confirms that fastglm_nb() and MASS::glm.nb() are solving the same maximum-likelihood problem. fastglm is simply more reliable at reaching the solution.

References

  • Marschner, I. C. (2011). glm2: Fitting generalized linear models with convergence problems. The R Journal, 3(2), 12–15. doi:10.32614/RJ-2011-012

  • Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S, 4th edition. Springer, New York. doi:10.1007/978-0-387-21706-2