This function uses three different methods (glmnet, xgboost, ranger) in order to select important features.

feature_selection(
  X,
  y,
  method = NULL,
  params_glmnet = NULL,
  params_xgboost = NULL,
  params_ranger = NULL,
  xgb_sort = NULL,
  CV_folds = 5,
  stratified_regr = FALSE,
  scale_coefs_glmnet = FALSE,
  cores_glmnet = NULL,
  verbose = FALSE
)

Arguments

X

a sparse Matrix, a matrix or a data frame

y

a vector of length representing the response variable

method

one of 'glmnet-lasso', 'xgboost', 'ranger'

params_glmnet

a list of parameters for the glmnet model

params_xgboost

a list of parameters for the xgboost model

params_ranger

a list of parameters for the ranger model

xgb_sort

sort the xgboost features by "Gain", "Cover" or "Frequency" ( defaults to "Frequency")

CV_folds

a number specifying the number of folds for cross validation

stratified_regr

a boolean determining if the folds in regression should be stratified

scale_coefs_glmnet

if TRUE, less important coefficients will be smaller than the more important ones (ranking/plotting by magnitude possible)

cores_glmnet

an integer determining the number of cores to register in glmnet

verbose

outputs info

Value

a data frame with the most important features

Details

This function returns the important features using one of the glmnet, xgboost or ranger algorithms. The glmnet algorithm can take either a sparse matrix, a matrix or a data frame and returns a data frame with non zero coefficients. The xgboost algorithm can take either a sparse matrix, a matrix or a data frame and returns the importance of the features in form of a data frame, furthermore it is possible to sort the features using one of the "Gain", "Cover" or "Frequency" methods. The ranger algorithm can take either a matrix or a data frame and returns the important features using one of the 'impurity' or 'permutation' methods.

Author

Lampros Mouselimis

Examples

if (FALSE) { #........... # regression #........... data(iris) X = iris[, -5] y = X[, 1] X = X[, -1] params_glmnet = list(alpha = 1, family = 'gaussian', nfolds = 3, parallel = TRUE) res = feature_selection(X, y, method = 'glmnet-lasso', params_glmnet = params_glmnet, CV_folds = 5, cores_glmnet = 5) #...................... # binary classification #...................... y = iris[, 5] y = as.character(y) y[y == 'setosa'] = 'virginica' X = iris[, -5] params_ranger = list(write.forest = TRUE, probability = TRUE, num.threads = 6, num.trees = 50, verbose = FALSE, classification = TRUE, mtry = 2, min.node.size = 5, importance = 'impurity') res = feature_selection(X, y, method = 'ranger', params_ranger = params_ranger, CV_folds = 5) #.......................... # multiclass classification #.......................... y = iris[, 5] multiclass_xgboost = ifelse(y == 'setosa', 0, ifelse(y == 'virginica', 1, 2)) X = iris[, -5] params_xgboost = list(params = list("objective" = "multi:softprob", "bst:eta" = 0.35, "subsample" = 0.65, "num_class" = 3, "max_depth" = 6, "colsample_bytree" = 0.65, "nthread" = 2), nrounds = 50, print.every.n = 50, verbose = 0, maximize = FALSE) res = feature_selection(X, multiclass_xgboost, method = 'xgboost', params_xgboost = params_xgboost, CV_folds = 5) }