Feature selection — feature_selection • FeatureSelection

This function uses three different methods (glmnet, xgboost, ranger) in order to select important features.

feature_selection(
  X,
  y,
  method = NULL,
  params_glmnet = NULL,
  params_xgboost = NULL,
  params_ranger = NULL,
  xgb_sort = NULL,
  CV_folds = 5,
  stratified_regr = FALSE,
  scale_coefs_glmnet = FALSE,
  cores_glmnet = NULL,
  verbose = FALSE
)

Arguments

X	a sparse Matrix, a matrix or a data frame
y	a vector of length representing the response variable
method	one of 'glmnet-lasso', 'xgboost', 'ranger'
params_glmnet	a list of parameters for the glmnet model
params_xgboost	a list of parameters for the xgboost model
params_ranger	a list of parameters for the ranger model
xgb_sort	sort the xgboost features by "Gain", "Cover" or "Frequency" ( defaults to "Frequency")
CV_folds	a number specifying the number of folds for cross validation
stratified_regr	a boolean determining if the folds in regression should be stratified
scale_coefs_glmnet	if TRUE, less important coefficients will be smaller than the more important ones (ranking/plotting by magnitude possible)
cores_glmnet	an integer determining the number of cores to register in glmnet
verbose	outputs info

Value

a data frame with the most important features

Details

This function returns the important features using one of the glmnet, xgboost or ranger algorithms. The glmnet algorithm can take either a sparse matrix, a matrix or a data frame and returns a data frame with non zero coefficients. The xgboost algorithm can take either a sparse matrix, a matrix or a data frame and returns the importance of the features in form of a data frame, furthermore it is possible to sort the features using one of the "Gain", "Cover" or "Frequency" methods. The ranger algorithm can take either a matrix or a data frame and returns the important features using one of the 'impurity' or 'permutation' methods.

Author

Lampros Mouselimis

Examples


if (FALSE) {

#...........
# regression
#...........

data(iris)

X = iris[, -5]
y = X[, 1]
X = X[, -1]

params_glmnet = list(alpha = 1,
                     family = 'gaussian',
                     nfolds = 3,
                     parallel = TRUE)

res = feature_selection(X,
                        y,
                        method = 'glmnet-lasso',
                        params_glmnet = params_glmnet,
                        CV_folds = 5,
                        cores_glmnet = 5)

#......................
# binary classification
#......................

y = iris[, 5]
y = as.character(y)
y[y == 'setosa'] = 'virginica'
X = iris[, -5]

params_ranger = list(write.forest = TRUE,
                     probability = TRUE,
                     num.threads = 6,
                     num.trees = 50,
                     verbose = FALSE,
                     classification = TRUE,
                     mtry = 2,
                     min.node.size = 5,
                     importance = 'impurity')

res = feature_selection(X,
                        y,
                        method = 'ranger',
                        params_ranger = params_ranger,
                         CV_folds = 5)

#..........................
# multiclass classification
#..........................

y = iris[, 5]
multiclass_xgboost = ifelse(y == 'setosa', 0, ifelse(y == 'virginica', 1, 2))
X = iris[, -5]

params_xgboost = list(params = list("objective" = "multi:softprob",
                                    "bst:eta" = 0.35,
                                    "subsample" = 0.65,
                                     "num_class" = 3,
                                     "max_depth" = 6,
                                     "colsample_bytree" = 0.65,
                                     "nthread" = 2),
                        nrounds = 50,
                         print.every.n = 50,
                         verbose = 0,
                         maximize = FALSE)

res = feature_selection(X,
                        multiclass_xgboost,
                        method = 'xgboost',
                        params_xgboost = params_xgboost,
                        CV_folds = 5)
}