Negative-Binomial, Hurdle, and Zero-Inflation with ‘fastglm’

Beyond family-driven GLMs, fastglm has fitting functions for three count-data model types that arise often in practice but require model-specific machinery on top of the standard IRLS solver:

  • fastglm_nb() — negative-binomial regression with the dispersion theta estimated jointly with the regression coefficients, in the spirit of MASS::glm.nb().
  • fastglm_hurdle() — a two-part count model with a binary zero / non-zero component and a zero-truncated Poisson or NB count component, the same model as pscl::hurdle().
  • fastglm_zi() — a zero-inflated Poisson or NB regression, with a binary inflation component layered on the original (untruncated) count distribution, the same model as pscl::zeroinfl().

All three reuse the same C++ IRLS solver as fastglm() itself; the outer iteration (joint NB MLE, the EM loop in zero-inflation, the inner theta-Brent for unknown-theta NB hurdle / ZI) likewise lives in C++. Cameron and Trivedi (1998) is the standard textbook reference for the count-data theory underlying this vignette, and Zeileis, Kleiber, and Jackman (2008) gives the pscl implementation we benchmark against.

library(fastglm)
suppressPackageStartupMessages({
    library(MASS)        # glm.nb, rnegbin
    library(pscl)        # hurdle, zeroinfl, bioChemists
    library(microbenchmark)
})

Negative-binomial regression

fastglm_nb() is a drop-in alternative to MASS::glm.nb() (Venables and Ripley, 2002) for the NB2 model with Var(Y) = mu + mu^2 / theta. Both theta and beta are estimated by maximum likelihood, alternating an IRLS update for beta at fixed theta with a 1-D Brent root-find for theta at fixed beta. Both loops run in C++.

A small-sample comparison on MASS::quine (school absences):

data(quine)
X <- model.matrix(~ Eth + Sex + Age + Lrn, data = quine)
y <- quine$Days

fit_f <- fastglm_nb(X, y)
fit_m <- MASS::glm.nb(Days ~ Eth + Sex + Age + Lrn, data = quine)

c(theta_fastglm = fit_f$theta, theta_glm.nb = fit_m$theta)
#> theta_fastglm  theta_glm.nb 
#>      1.274893      1.274893

max(abs(unname(coef(fit_f)) - unname(coef(fit_m))))
#> [1] 3.067281e-08
abs(as.numeric(logLik(fit_f)) - as.numeric(logLik(fit_m)))
#> [1] 3.410605e-13

Coefficients and theta agree to roughly 1e-8 on the same MLE.

A timing comparison, on a moderately sized simulated NB(theta = 2) example:

set.seed(1)
n  <- 5e4
X  <- cbind(1, matrix(rnorm(n * 3), n, 3))
mu <- exp(X %*% c(0.5, 0.4, -0.2, 0.3))
y  <- MASS::rnegbin(n, mu = mu, theta = 2)
df <- data.frame(y = y, x1 = X[, 2], x2 = X[, 3], x3 = X[, 4])

mb_nb <- microbenchmark(
    fastglm_nb = fastglm_nb(X, y),
    glm.nb     = MASS::glm.nb(y ~ x1 + x2 + x3, data = df),
    times = 5L
)
print(mb_nb)
#> Unit: milliseconds
#>        expr      min       lq     mean   median       uq      max neval
#>  fastglm_nb 185.3687 185.4818 186.7029 185.6752 188.4846  188.504     5
#>      glm.nb 897.6814 904.4765 945.5907 911.0589 924.5146 1090.222     5

The native NB family kernel is also exposed directly through negbin(theta, link) for the case when theta is known (or estimated separately). Holding theta fixed at the joint MLE recovers the fastglm_nb() regression coefficients exactly, since fastglm_nb() is just IRLS at the converged theta:

fit_joint <- fastglm_nb(X, y)
fit_known <- fastglm(X, y, family = negbin(theta = fit_joint$theta, link = "log"))
max(abs(unname(coef(fit_known)) - unname(coef(fit_joint))))
#> [1] 1.624216e-07

Hurdle models

Hurdle models, introduced by Mullahy (1986), factorize a count distribution into two independent pieces:

  • a binary regression for 1(y > 0) over the full sample (the zero part);
  • a zero-truncated Poisson or NB regression for y > 0 over the positive subset (the count part).

Because the two parts share no parameters, the joint likelihood factorizes and they can be fit independently. fastglm’s C++ driver fits both parts using the same IRLS solver as fastglm(), with new FAM_POIS_TRUNC_* / FAM_NB_TRUNC_* family codes that handle the truncation correction stably (expm1, log1p near mu = 0).

The formula uses the Formula package convention y ~ x1 + x2 | z1 + z2; the right-hand side after | specifies the zero-part design (it defaults to the count-part design if absent).

data(bioChemists, package = "pscl")
fit_f <- fastglm_hurdle(art ~ ., data = bioChemists, dist = "poisson")
fit_p <- pscl::hurdle (art ~ ., data = bioChemists, dist = "poisson")

max(abs(unname(coef(fit_f, "count")) - unname(fit_p$coefficients$count)))
#> [1] 3.144634e-07
max(abs(unname(coef(fit_f, "zero"))  - unname(fit_p$coefficients$zero)))
#> [1] 1.27146e-10
abs(as.numeric(logLik(fit_f)) - as.numeric(logLik(fit_p)))
#> [1] 1.07093e-10

coef() and vcov() accept a model = c("full", "count", "zero") argument so each part can be inspected separately:

coef(fit_f, model = "count")
#> (Intercept)    femWomen  marMarried        kid5         phd        ment 
#>  0.67113934 -0.22858262  0.09648498 -0.14218724 -0.01272657  0.01874550
coef(fit_f, model = "zero")
#> (Intercept)    femWomen  marMarried        kid5         phd        ment 
#>  0.23679601 -0.25115113  0.32623358 -0.28524872  0.02221940  0.08012135

For NB count parts, the dispersion is estimated by an inner Brent MLE that runs between outer IRLS iterations; the outer-loop tolerance is controlled by outer.tol / outer.maxit.

A timing comparison on a 4000-observation simulated example:

set.seed(11)
n  <- 4000
x1 <- rnorm(n);  x2 <- rnorm(n)
lam    <- exp(0.7 + 0.4 * x1 - 0.3 * x2)
is_pos <- rbinom(n, 1, plogis(-0.4 + 0.5 * x1 + 0.2 * x2))
yt     <- integer(n)
for (i in seq_len(n)) {
    repeat { v <- rpois(1, lam[i]); if (v > 0) { yt[i] <- v; break } }
}
y  <- ifelse(is_pos == 1, yt, 0L)
df <- data.frame(y = y, x1 = x1, x2 = x2)

mb_hurdle <- microbenchmark(
    fastglm_hurdle = fastglm_hurdle(y ~ x1 + x2, data = df, dist = "poisson"),
    pscl_hurdle    = pscl::hurdle (y ~ x1 + x2, data = df, dist = "poisson"),
    times = 5L
)
print(mb_hurdle)
#> Unit: milliseconds
#>            expr       min        lq      mean    median        uq       max
#>  fastglm_hurdle  3.120328  3.873188  4.397503  4.596362  4.677564  5.720073
#>     pscl_hurdle 57.615037 58.952616 60.870631 62.488692 62.580469 62.716340
#>  neval
#>      5
#>      5

Zero-inflated models

Zero inflation, introduced by Lambert (1992), differs from a hurdle in that the two components share a latent variable: a y = 0 outcome could come from the inflation component (with probability pi_i) or from the count component (with probability 1 - pi_i and a count-side zero). The likelihood for y = 0 is

\[ \Pr(Y_i = 0) = \pi_i + (1 - \pi_i) f(0; \mu_i) \]

and for y > 0 it is (1 - pi_i) f(y; mu_i). The two-component mixture rules out a closed-form factorization, so fastglm uses an EM algorithm:

  • E-step: posterior tau_i = P(Z_i = 1 | y_i) for each observation, computed in log-space with logsumexp to avoid catastrophic cancellation when pi_i is near 0 or 1.
  • M-step: a binomial logit / probit / cloglog / log fit with response tau (continuous), weights 1; followed by a Poisson or NB fit on the original y with prior weights 1 - tau. For NB, an inner Brent MLE re-estimates theta after each count-side step.

The final observed-information vcov comes from a numerical Jacobian of the analytical observed score at the EM fixed point, which is stable and cheap (block-diagonal per (gamma, beta, theta)). The complete EM driver — E-step, both M-steps, the inner theta MLE, and the score Jacobian — runs in C++.

The formula syntax matches fastglm_hurdle() exactly:

fit_f <- fastglm_zi(art ~ ., data = bioChemists, dist = "poisson",
                    em.tol = 1e-10, em.maxit = 300L)
fit_p <- pscl::zeroinfl(art ~ ., data = bioChemists, dist = "poisson")

max(abs(unname(coef(fit_f, "count")) - unname(fit_p$coefficients$count)))
#> [1] 2.654458e-06
max(abs(unname(coef(fit_f, "zero"))  - unname(fit_p$coefficients$zero)))
#> [1] 1.835971e-05
abs(as.numeric(logLik(fit_f)) - as.numeric(logLik(fit_p)))
#> [1] 1.011585e-09

EM is iterative, so the agreement is slightly looser than for hurdle (where the two parts are closed-form independent fits), but coefficients and the joint log-likelihood still match to about 1e-5.

A timing comparison on a simulated zero-inflated Poisson:

set.seed(21)
n  <- 3000
x1 <- rnorm(n);  x2 <- rnorm(n)
eta_c <- 0.7 + 0.4 * x1 - 0.3 * x2
eta_z <- -0.4 + 0.5 * x1 + 0.2 * x2
z     <- rbinom(n, 1, plogis(eta_z))
y     <- ifelse(z == 1, 0L, rpois(n, exp(eta_c)))
df    <- data.frame(y = y, x1 = x1, x2 = x2)

mb_zi <- microbenchmark(
    fastglm_zi = fastglm_zi(y ~ x1 + x2, data = df, dist = "poisson"),
    pscl_zi    = pscl::zeroinfl(y ~ x1 + x2, data = df, dist = "poisson"),
    times = 5L
)
print(mb_zi)
#> Unit: milliseconds
#>        expr      min       lq     mean   median       uq      max neval
#>  fastglm_zi 16.54667 16.57797 17.24770 16.66254 16.80419 19.64714     5
#>     pscl_zi 68.29164 68.30286 68.54763 68.36196 68.60416 69.17753     5

References

  • Cameron, A. C. and Trivedi, P. K. (1998). Regression Analysis of Count Data. Cambridge University Press.

  • Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of Econometrics, 33(3), 341–365. doi:10.1016/0304-4076(86)90002-3

  • Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1–14. doi:10.2307/1269547

  • Zeileis, A., Kleiber, C., and Jackman, S. (2008). Regression models for count data in R. Journal of Statistical Software, 27(8), 1–25. doi:10.18637/jss.v027.i08

  • Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer.