This function uses three different methods (glmnet, xgboost, ranger) in order to select important features.
feature_selection( X, y, method = NULL, params_glmnet = NULL, params_xgboost = NULL, params_ranger = NULL, xgb_sort = NULL, CV_folds = 5, stratified_regr = FALSE, scale_coefs_glmnet = FALSE, cores_glmnet = NULL, verbose = FALSE )
X | a sparse Matrix, a matrix or a data frame |
---|---|
y | a vector of length representing the response variable |
method | one of 'glmnet-lasso', 'xgboost', 'ranger' |
params_glmnet | a list of parameters for the glmnet model |
params_xgboost | a list of parameters for the xgboost model |
params_ranger | a list of parameters for the ranger model |
xgb_sort | sort the xgboost features by "Gain", "Cover" or "Frequency" ( defaults to "Frequency") |
CV_folds | a number specifying the number of folds for cross validation |
stratified_regr | a boolean determining if the folds in regression should be stratified |
scale_coefs_glmnet | if TRUE, less important coefficients will be smaller than the more important ones (ranking/plotting by magnitude possible) |
cores_glmnet | an integer determining the number of cores to register in glmnet |
verbose | outputs info |
a data frame with the most important features
This function returns the important features using one of the glmnet, xgboost or ranger algorithms. The glmnet algorithm can take either a sparse matrix, a matrix or a data frame and returns a data frame with non zero coefficients. The xgboost algorithm can take either a sparse matrix, a matrix or a data frame and returns the importance of the features in form of a data frame, furthermore it is possible to sort the features using one of the "Gain", "Cover" or "Frequency" methods. The ranger algorithm can take either a matrix or a data frame and returns the important features using one of the 'impurity' or 'permutation' methods.
Lampros Mouselimis
if (FALSE) { #........... # regression #........... data(iris) X = iris[, -5] y = X[, 1] X = X[, -1] params_glmnet = list(alpha = 1, family = 'gaussian', nfolds = 3, parallel = TRUE) res = feature_selection(X, y, method = 'glmnet-lasso', params_glmnet = params_glmnet, CV_folds = 5, cores_glmnet = 5) #...................... # binary classification #...................... y = iris[, 5] y = as.character(y) y[y == 'setosa'] = 'virginica' X = iris[, -5] params_ranger = list(write.forest = TRUE, probability = TRUE, num.threads = 6, num.trees = 50, verbose = FALSE, classification = TRUE, mtry = 2, min.node.size = 5, importance = 'impurity') res = feature_selection(X, y, method = 'ranger', params_ranger = params_ranger, CV_folds = 5) #.......................... # multiclass classification #.......................... y = iris[, 5] multiclass_xgboost = ifelse(y == 'setosa', 0, ifelse(y == 'virginica', 1, 2)) X = iris[, -5] params_xgboost = list(params = list("objective" = "multi:softprob", "bst:eta" = 0.35, "subsample" = 0.65, "num_class" = 3, "max_depth" = 6, "colsample_bytree" = 0.65, "nthread" = 2), nrounds = 50, print.every.n = 50, verbose = 0, maximize = FALSE) res = feature_selection(X, multiclass_xgboost, method = 'xgboost', params_xgboost = params_xgboost, CV_folds = 5) }