Title: | Two-Part Estimation of Treatment Rules for Semi-Continuous Data |
---|---|
Description: | Implements the methodology of Huling, Smith, and Chen (2020) <doi:10.1080/01621459.2020.1801449>, which allows for subgroup identification for semi-continuous outcomes by estimating individualized treatment rules. It uses a two-part modeling framework to handle semi-continuous data by separately modeling the positive part of the outcome and an indicator of whether each outcome is positive, but still results in a single treatment rule. High dimensional data is handled with a cooperative lasso penalty, which encourages the coefficients in the two models to have the same sign. |
Authors: | Jared Huling [aut, cre] |
Maintainer: | Jared Huling <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.0.2 |
Built: | 2024-10-31 18:37:19 UTC |
Source: | https://github.com/jaredhuling/personalized2part |
Cross validation for hd2part models
cv.hd2part( x, z, x_s, s, weights = rep(1, NROW(x)), weights_s = rep(1, NROW(x_s)), offset = NULL, offset_s = NULL, lambda = NULL, type.measure = c("mae", "mse", "sep-auc-mse", "sep-auc-mae"), nfolds = 10, foldid = NULL, grouped = TRUE, keep = FALSE, parallel = FALSE, ... )
cv.hd2part( x, z, x_s, s, weights = rep(1, NROW(x)), weights_s = rep(1, NROW(x_s)), offset = NULL, offset_s = NULL, lambda = NULL, type.measure = c("mae", "mse", "sep-auc-mse", "sep-auc-mae"), nfolds = 10, foldid = NULL, grouped = TRUE, keep = FALSE, parallel = FALSE, ... )
x |
an n x p matrix of covariates for the zero part data, where each row is an observation and each column is a predictor. MUST be ordered such that the first n_s rows align with the observations in x_s and s |
z |
a length n vector of responses taking values 1 and 0, where 1 indicates the response is positive and zero indicates the response has value 0. MUST be ordered such that the first n_s values align with the observations in x_s and s |
x_s |
an n_s x p matrix of covariates (which is a submatrix of x) for the positive part data, where each row is an observation and each column is a predictor |
s |
a length n_s vector of responses taking strictly positive values |
weights |
a length n vector of observation weights for the zero part data |
weights_s |
a length n_s vector of observation weights for the positive part data |
offset |
a length n vector of offset terms for the zero part data |
offset_s |
a length n_s vector of offset terms for the positive part data |
lambda |
A user supplied lambda sequence. By default, the program computes its own lambda sequence based on nlambda and lambda.min.ratio. Supplying a value of lambda overrides this. |
type.measure |
measure to evaluate for cross-validation. Will add more description later |
nfolds |
number of folds for cross-validation. default is 10. 3 is smallest value allowed. |
foldid |
an optional vector of values between 1 and nfold specifying which fold each observation belongs to. |
grouped |
Like in glmnet, this is an experimental argument, with default |
keep |
If |
parallel |
If TRUE, use parallel foreach to fit each fold. Must register parallel before hand, such as doMC. |
... |
other parameters to be passed to |
set.seed(1)
set.seed(1)
Fits subgroup identification models
fit_subgroup_2part( x, y, trt, propensity.func = NULL, propensity.func.positive = NULL, match.id = NULL, augment.func.zero = NULL, augment.func.positive = NULL, cutpoint = 1, larger.outcome.better = TRUE, penalize.ate = TRUE, y_eps = 1e-06, ... )
fit_subgroup_2part( x, y, trt, propensity.func = NULL, propensity.func.positive = NULL, match.id = NULL, augment.func.zero = NULL, augment.func.positive = NULL, cutpoint = 1, larger.outcome.better = TRUE, penalize.ate = TRUE, y_eps = 1e-06, ... )
x |
The design matrix (not including intercept term) |
y |
The nonnegative response vector |
trt |
treatment vector with each element equal to a 0 or a 1, with 1 indicating treatment status is active. |
propensity.func |
function that inputs the design matrix x and the treatment vector trt and outputs
the propensity score, ie Pr(trt = 1 | X = x). Function should take two arguments 1) x and 2) trt. See example below.
For a randomized controlled trial this can simply be a function that returns a constant equal to the proportion
of patients assigned to the treatment group, i.e.:
|
propensity.func.positive |
function that inputs the design matrix x and the treatment vector trt and outputs
the propensity score for units with positive outcome values, ie Pr(trt = 1 | X = x, Z = 1). Function should take
two arguments 1) x and 2) trt. See example below.
For a randomized controlled trial this can simply be a function that returns a constant equal to the proportion
of patients assigned to the treatment group, i.e.:
|
match.id |
a (character, factor, or integer) vector with length equal to the number of observations in Example 1: Example 2: augment.func <- function(x, y, trt) { data <- data.frame(x, y, trt) lmod <- glm(y ~ x * trt, family = binomial()) ## get predictions when trt = 1 data$trt <- 1 preds_1 <- predict(lmod, data, type = "response") ## get predictions when trt = -1 data$trt <- -1 preds_n1 <- predict(lmod, data, type = "response") ## return predictions averaged over trt return(0.5 * (preds_1 + preds_n1)) } |
augment.func.zero |
(similar to augment.func.positive) function which inputs the
indicators of whether each response is positive ( |
augment.func.positive |
(similar to augment.func.zero) function which inputs the positive part response
(ie all observations in |
cutpoint |
numeric value for patients with benefit scores above which
(or below which if |
larger.outcome.better |
boolean value of whether a larger outcome is better/preferable. Set to |
penalize.ate |
should the treatment main effect (ATE) be penalized too? |
y_eps |
positive value above which observations in |
... |
options to be passed to |
set.seed(42) dat <- sim_semicontinuous_data(250, n.vars = 15) x <- dat$x y <- dat$y trt <- dat$trt prop_func <- function(x, trt) { propensmod <- glm(trt ~ x, family = binomial()) propens <- unname(fitted(propensmod)) propens } fitted_model <- fit_subgroup_2part(x, y, trt, prop_func, prop_func) fitted_model ## correlation of estimated covariate-conditional risk ratio and truth cor(fitted_model$benefit.scores, dat$treatment_risk_ratio, method = "spearman")
set.seed(42) dat <- sim_semicontinuous_data(250, n.vars = 15) x <- dat$x y <- dat$y trt <- dat$trt prop_func <- function(x, trt) { propensmod <- glm(trt ~ x, family = binomial()) propens <- unname(fitted(propensmod)) propens } fitted_model <- fit_subgroup_2part(x, y, trt, prop_func, prop_func) fitted_model ## correlation of estimated covariate-conditional risk ratio and truth cor(fitted_model$benefit.scores, dat$treatment_risk_ratio, method = "spearman")
This function fits penalized two part models with a logistic regression model for the zero part and a gamma regression model for the positive part. Each covariate's effect has either a group lasso or cooperative lasso penalty for its effects for the two consituent models
hd2part( x, z, x_s, s, weights = rep(1, NROW(x)), weights_s = rep(1, NROW(x_s)), offset = NULL, offset_s = NULL, penalty = c("grp.lasso", "coop.lasso"), penalty_factor = NULL, nlambda = 100L, lambda_min_ratio = ifelse(n_s < p, 0.05, 0.005), lambda = NULL, tau = 0, opposite_signs = FALSE, flip_beta_zero = FALSE, intercept_z = FALSE, intercept_s = FALSE, strongrule = TRUE, maxit_irls = 50, tol_irls = 1e-05, maxit_mm = 500, tol_mm = 1e-05, balance_likelihoods = TRUE )
hd2part( x, z, x_s, s, weights = rep(1, NROW(x)), weights_s = rep(1, NROW(x_s)), offset = NULL, offset_s = NULL, penalty = c("grp.lasso", "coop.lasso"), penalty_factor = NULL, nlambda = 100L, lambda_min_ratio = ifelse(n_s < p, 0.05, 0.005), lambda = NULL, tau = 0, opposite_signs = FALSE, flip_beta_zero = FALSE, intercept_z = FALSE, intercept_s = FALSE, strongrule = TRUE, maxit_irls = 50, tol_irls = 1e-05, maxit_mm = 500, tol_mm = 1e-05, balance_likelihoods = TRUE )
x |
an n x p matrix of covariates for the zero part data, where each row is an observation and each column is a predictor |
z |
a length n vector of responses taking values 1 and 0, where 1 indicates the response is positive and zero indicates the response has value 0. |
x_s |
an n_s x p matrix of covariates (which is a submatrix of x) for the positive part data, where each row is an observation and each column is a predictor |
s |
a length n_s vector of responses taking strictly positive values |
weights |
a length n vector of observation weights for the zero part data |
weights_s |
a length n_s vector of observation weights for the positive part data |
offset |
a length n vector of offset terms for the zero part data |
offset_s |
a length n_s vector of offset terms for the positive part data |
penalty |
either |
penalty_factor |
a length p vector of penalty adjustment factors corresponding to each covariate. A value of 0 in the jth location indicates no penalization on the jth variable, and any positive value will indicate a multiplicative factor on top of the common penalization amount. The default value is 1 for all variables |
nlambda |
the number of lambda values. The default is 100. |
lambda_min_ratio |
Smallest value for |
lambda |
a user supplied sequence of penalization tuning parameters. By default, the program automatically
chooses a sequence of lambda values based on |
tau |
value between 0 and 1 for sparse group mixing penalty. 0 implies either group lasso or coop lasso and 1 implies lasso |
opposite_signs |
a boolean variable indicating whether the signs of coefficients across models should be encouraged to have
opposite signs instead of the same signs. Default is |
flip_beta_zero |
should we flip the signs of the parameters for the zero part model? Defaults to |
intercept_z |
whether or not to include an intercept in the zero part model. Default is |
intercept_s |
whether or not to include an intercept in the positive part model. Default is |
strongrule |
should a strong rule be used? Defaults to |
maxit_irls |
maximum number of IRLS iterations |
tol_irls |
convergence tolerance for IRLS iterations |
maxit_mm |
maximum number of MM iterations. Note that for |
tol_mm |
convergence tolerance for MM iterations. Note that for |
balance_likelihoods |
should the likelihoods be balanced so variables would enter both models at the same value of lambda
if the penalty were a lasso penalty? Recommended to keep at the default, |
library(personalized2part)
library(personalized2part)
This function fits penalized gamma GLMs
hdgamma( x, y, weights = rep(1, NROW(x)), offset = NULL, penalty_factor = NULL, nlambda = 100L, lambda_min_ratio = ifelse(n < p, 0.05, 0.005), lambda = NULL, tau = 0, intercept = TRUE, strongrule = TRUE, maxit_irls = 50, tol_irls = 1e-05, maxit_mm = 500, tol_mm = 1e-05 )
hdgamma( x, y, weights = rep(1, NROW(x)), offset = NULL, penalty_factor = NULL, nlambda = 100L, lambda_min_ratio = ifelse(n < p, 0.05, 0.005), lambda = NULL, tau = 0, intercept = TRUE, strongrule = TRUE, maxit_irls = 50, tol_irls = 1e-05, maxit_mm = 500, tol_mm = 1e-05 )
x |
an n x p matrix of covariates for the zero part data, where each row is an observation and each column is a predictor |
y |
a length n vector of responses taking strictly positive values. |
weights |
a length n vector of observation weights |
offset |
a length n vector of offset terms |
penalty_factor |
a length p vector of penalty adjustment factors corresponding to each covariate. A value of 0 in the jth location indicates no penalization on the jth variable, and any positive value will indicate a multiplicative factor on top of the common penalization amount. The default value is 1 for all variables |
nlambda |
the number of lambda values. The default is 100. |
lambda_min_ratio |
Smallest value for |
lambda |
a user supplied sequence of penalization tuning parameters. By default, the program automatically
chooses a sequence of lambda values based on |
tau |
a scalar numeric value between 0 and 1 (included) which is a mixing parameter for sparse group lasso penalty. 0 indicates group lasso and 1 indicates lasso, values in between reflect different emphasis on group and lasso penalties |
intercept |
whether or not to include an intercept. Default is |
strongrule |
should a strong rule be used? |
maxit_irls |
maximum number of IRLS iterations |
tol_irls |
convergence tolerance for IRLS iterations |
maxit_mm |
maximum number of MM iterations. Note that for |
tol_mm |
convergence tolerance for MM iterations. Note that for |
library(personalized2part)
library(personalized2part)
Fits a penalized gamma augmentation model via cross fitting and returns vector of length n of out of sample predictions on the link scale from cross fitting
HDtweedie_kfold_aug( x, y, trt, wts = NULL, K = 10, p = 1.5, interactions = FALSE )
HDtweedie_kfold_aug( x, y, trt, wts = NULL, K = 10, p = 1.5, interactions = FALSE )
x |
an n x p matrix of covariates for the zero part data, where each row is an observation and each column is a predictor. MUST be ordered such that the first n_s rows align with the observations in x_s and s |
y |
a length n vector of responses taking positive values |
trt |
a length n vector of treatment variables with 1 indicating treatment and -1 indicating control |
wts |
a length n vector of sample weights |
K |
number of folds for cross fitting |
p |
tweedie mixing parameter. See |
interactions |
boolean variable of whether or not to fit model with interactions. For predictions, interactions will be integrated out |
Plot method for hd2part fitted objects
## S3 method for class 'hd2part' plot( x, model = c("zero", "positive"), xvar = c("loglambda", "norm", "lambda"), labsize = 0.6, xlab = iname, ylab = NULL, main = paste(model, "model"), ... ) ## S3 method for class 'cv.hd2part' plot(x, sign.lambda = 1, ...)
## S3 method for class 'hd2part' plot( x, model = c("zero", "positive"), xvar = c("loglambda", "norm", "lambda"), labsize = 0.6, xlab = iname, ylab = NULL, main = paste(model, "model"), ... ) ## S3 method for class 'cv.hd2part' plot(x, sign.lambda = 1, ...)
x |
fitted "hd2part" model object |
model |
either |
xvar |
What is on the X-axis. |
labsize |
size of labels for variable names. If labsize = 0, then no variable names will be plotted |
xlab |
label for x-axis |
ylab |
label for y-axis |
main |
main title for plot |
... |
other graphical parameters for the plot |
sign.lambda |
Either plot against log(lambda) (default) or its negative if |
set.seed(123) set.seed(123)
set.seed(123) set.seed(123)
Prediction function for fitted cross validation hd2part objects
## S3 method for class 'cv.hd2part' predict( object, newx, model = c("zero", "positive"), s = c("lambda.min", "lambda.1se"), type = c("link", "model_response", "response", "coefficients", "nonzero"), ... )
## S3 method for class 'cv.hd2part' predict( object, newx, model = c("zero", "positive"), s = c("lambda.min", "lambda.1se"), type = c("link", "model_response", "response", "coefficients", "nonzero"), ... )
object |
fitted |
newx |
Matrix of new values for |
model |
either |
s |
Value(s) of the penalty parameter lambda at which predictions are required. Default is the entire sequence used to create
the model. For |
type |
Type of prediction required. |
... |
arguments to be passed to |
set.seed(123)
set.seed(123)
Prediction method for two part fitted objects
## S3 method for class 'hd2part' predict( object, newx, s = NULL, model = c("zero", "positive"), type = c("link", "model_response", "response", "coefficients", "nonzero"), newoffset = NULL, ... )
## S3 method for class 'hd2part' predict( object, newx, s = NULL, model = c("zero", "positive"), type = c("link", "model_response", "response", "coefficients", "nonzero"), newoffset = NULL, ... )
object |
fitted "hd2part" model object |
newx |
Matrix of new values for |
s |
Value(s) of the penalty parameter lambda for the zero part at which predictions are required. Default is the entire sequence used to create the model. |
model |
either |
type |
Type of prediction required. |
newoffset |
f an offset is used in the fit, then one must be supplied for making predictions |
... |
not used |
An object depending on the type argument
set.seed(1)
set.seed(1)
Generates semicontinuous data with heterogeneity of treatment effect
sim_semicontinuous_data(n.obs = 1000, n.vars = 25)
sim_semicontinuous_data(n.obs = 1000, n.vars = 25)
n.obs |
number of observations |
n.vars |
number of variables. Must be at least 10 |
returns list with values y
for outcome, x
for design matrix, trt
for
treatment assignments, betanonzero
for true coefficients for treatment-covariate interactions for model for
whether or not a response is nonzero, betapos
for true coefficients for treatment-covariate interactions
for positive model, treatment_risk_ratio
for the true covariate-conditional treatment effect risk ratio for
each observation, pi.x
for the true underlying propensity score