mlampros Organizing and Sharing thoughts, Receiving constructive feedback

San Francisco Crime Classification competition

In this blog post, I’ll explain my approach for the San Francisco Crime Classification competition, in which I participated for the past two months. This competition was hosted by kaggle, a free online platform for predictive modelling and analytics. I ended up in the first 60 places out of 2335 participants and so far is my best personal result. This competition belongs to the knowledge competitions, meaning that the submissions of the participants are evaluated on the whole test data, so there wasn’t any danger of overfitting the leaderboard, as after every submission the true (end) leaderboard score was calculated (no secrets). Furthermore, there weren’t any ranking points, so no particular gain except for learning new methods on how to tackle machine learning problems.

the data set

The competition started one year ago, so there were some hints in the forum on how to exclude outliers or on how to approach this particular problem.

The dataset contained incidents derived from SFPD Crime Incident Reporting system and was collected from 1/1/2003 to 5/13/2015. The training set and test set rotated every week, meaning that week 1,3,5,7… belonged to test set whereas week 2,4,6,8 belonged to the training set. The following table shows the Category, which was the target of the competition,


train <- read.csv("train.csv", stringsAsFactors = F)

sort(table(train$Category), decreasing = T)


              LARCENY/THEFT              OTHER OFFENSES                NON-CRIMINAL                     ASSAULT 
                     174900                      126182                       92304                       76876 
              DRUG/NARCOTIC               VEHICLE THEFT                   VANDALISM                    WARRANTS 
                      53971                       53781                       44725                       42214 
                   BURGLARY              SUSPICIOUS OCC              MISSING PERSON                     ROBBERY 
                      36755                       31414                       25989                       23000 
                      FRAUD      FORGERY/COUNTERFEITING             SECONDARY CODES                 WEAPON LAWS 
                      16679                       10609                        9985                        8555 
               PROSTITUTION                    TRESPASS             STOLEN PROPERTY       SEX OFFENSES FORCIBLE 
                       7484                        7326                        4540                        4388 
         DISORDERLY CONDUCT                 DRUNKENNESS           RECOVERED VEHICLE                  KIDNAPPING 
                       4320                        4280                        3138                        2341 
DRIVING UNDER THE INFLUENCE                     RUNAWAY                 LIQUOR LAWS                       ARSON 
                       2268                        1946                        1903                        1513 
                  LOITERING                EMBEZZLEMENT                     SUICIDE             FAMILY OFFENSES 
                       1225                        1166                         508                         491 
                 BAD CHECKS                     BRIBERY                   EXTORTION   SEX OFFENSES NON FORCIBLE 
                        406                         289                         256                         148 
                   GAMBLING     PORNOGRAPHY/OBSCENE MAT                        TREA 
                        146                          22                           6 


The data can be downloaded directly from kaggle.


head(train)

               Dates       Category                       Descript DayOfWeek PdDistrict     Resolution 
1 2015-05-13 23:53:00       WARRANTS                 WARRANT ARREST Wednesday   NORTHERN ARREST, BOOKED
2 2015-05-13 23:53:00 OTHER OFFENSES       TRAFFIC VIOLATION ARREST Wednesday   NORTHERN ARREST, BOOKED
3 2015-05-13 23:33:00 OTHER OFFENSES       TRAFFIC VIOLATION ARREST Wednesday   NORTHERN ARREST, BOOKED 
4 2015-05-13 23:30:00  LARCENY/THEFT   GRAND THEFT FROM LOCKED AUTO Wednesday   NORTHERN           NONE
5 2015-05-13 23:30:00  LARCENY/THEFT   GRAND THEFT FROM LOCKED AUTO Wednesday       PARK           NONE
6 2015-05-13 23:30:00  LARCENY/THEFT GRAND THEFT FROM UNLOCKED AUTO Wednesday  INGLESIDE           NONE

                  Address         X        Y
        OAK ST / LAGUNA ST -122.4259 37.77460
        OAK ST / LAGUNA ST -122.4259 37.77460
 VANNESS AV / GREENWICH ST -122.4244 37.80041
  1500 Block of LOMBARD ST -122.4270 37.80087
 100 Block of BRODERICK ST -122.4387 37.77154
       0 Block of TEDDY AV -122.4033 37.71343
   

test <- read.csv("test.csv", stringsAsFactors = F)
head(test)

  Id               Dates DayOfWeek PdDistrict                  Address         X        Y
1  0 2015-05-10 23:59:00    Sunday    BAYVIEW  2000 Block of THOMAS AV -122.3996 37.73505
2  1 2015-05-10 23:51:00    Sunday    BAYVIEW       3RD ST / REVERE AV -122.3915 37.73243
3  2 2015-05-10 23:50:00    Sunday   NORTHERN   2000 Block of GOUGH ST -122.4260 37.79221
4  3 2015-05-10 23:45:00    Sunday  INGLESIDE 4700 Block of MISSION ST -122.4374 37.72141
5  4 2015-05-10 23:45:00    Sunday  INGLESIDE 4700 Block of MISSION ST -122.4374 37.72141
6  5 2015-05-10 23:40:00    Sunday    TARAVAL    BROAD ST / CAPITOL AV -122.4590 37.71317


I’ve red the data with stringsAsFactors = F, because I wanted the categorical columns to be in character and not in factor form, as I had to do some string preprocessing,


train = train[, -c(3,6)]



Furthermore, I removed the 3rd and 6th columns for two reasons: firstly they do appear only in the train data and secondly, I couldn’t find a way to take advantage of the Descript and Resolution.


train$Id = sort(seq(-nrow(train), -1, 1), decreasing = T)

test$Category = rep('none', nrow(test))

train = train[, c(8, 1:7)]

test = test[, c(1:2,8,3:7)]

train = rbind(train, test)


In order to horizontally join the train with the test data ( so that string preprocessing is feasible), I had to add an Id column to the train and a Category column to the test data. Then I shifted the order of columns so that the column names of the train data match the column names of the test data.


addr = train$Address

library(parallel)

rem_sl = unlist(mclapply(addr, function(x) stringr::str_replace(x, "/", ""), mc.cores = 4))

rem_sl1 = unlist(mclapply(rem_sl, function(x) stringr::str_replace_all(x, pattern=" ", repl=""), mc.cores = 4))

rem_sl2 = as.vector(sapply(rem_sl1, tolower))

train$Address = rem_sl2


To continue, I did some preprocessing of the Address column, as it appeared that some of these were in lower case, whereas others in upper case. First, I replaced c(‘/’, “ “) with an empty string “” and then I converted all addresses to lower case.


date = train$Dates           


library(lubridate)

date1 = ymd_hms(date)

Year = year(date1)

Month = month(date1)

YDay = yday(date1)

WDay = wday(date1)

char_wdays = weekdays(date1)

Day = day(date1)

Hour = hour(date1)

Minute = minute(date1)


The Dates column was from a classification point of view important too, so I used the lubridate package to extract the year, month, yearsday, weekday (in numeric form), weekday (in character form),day, hour and minutes.


remov = data.frame(date_tmp = date1, order_dat = 1:length(date1))

remov1 = remov[order(remov$date_tmp, decreasing = F), ]

remov2 = cbind(remov1, order_out = 1:nrow(remov1))

remov3 = remov2[order(remov2$order_dat, decreasing = F), ]

ORD_rows = remov3$order_out


The training set and test set rotated every week, as I mentioned in the beginning, so I thought that a feature that tracks the order of the weeks could add some predictive power to the model,


library(zoo)

yq <- as.yearqtr(as.yearmon(as.Date(train$Dates), "%m/%d/%Y") + 1/12)

Season <- factor(format(yq, "%q"), levels = 1:4, labels = c("winter", "spring", "summer", "fall"))

Season = as.numeric(Season)

newy = which(Month == 1 & Day == 1)

newy1 = which(Month == 12 & Day == 31)

newal = rep(0, length(Month))

newal[newy] = 1

newal[newy1] = 1

train1 = data.frame(year = Year, month = Month, yday = YDay, weekday = WDay, day = Day, hour = Hour, minutes = Minute, season = Season, newy = newal, ORD_date = ORD_rows)

train1 = cbind(train, train1)


Furthermore, I utilized the zoo library to mark some periods of the year (seasons, new-year-event),


library(dplyr)

DISTRICTS = lapply(unique(train1$PdDistrict), function(x) filter(train1, PdDistrict == x))

median_outliers = function(sublist) {
  
  if (max(sublist$X) == -120.5 || max(sublist$Y) == 90.00) {
    
    sublist$X[which(sublist$X == -120.5)] = median(sublist$X)
    sublist$Y[which(sublist$Y == 90.00)] = median(sublist$Y)
  }
  
  sublist
}

distr = lapply(DISTRICTS, function(x) median_outliers(x))

distr1 = do.call(rbind, distr)


There were some potential outliers in the latitude and longitude data (X,Y columns), which should be replaced with the corresponding median of each district’s X and Y. Here, I used the filter function of the dplyr package as it was faster than the subset function of the base R.


address_frequency = function(sublist) {
  
  tmp_df = data.frame(table(sublist$Address))
  
  tmp_df = tmp_df[order(tmp_df$Freq, decreasing = T), ]
  
  tmp_df[1, ]$Var1
}


gcd.hf <- function(long1, lat1, long2, lat2) {                 # http://www.r-bloggers.com/great-circle-distance-calculations-in-r/                                                      
  
  R <- 6371                                               # Earth mean radius [km]
  
  delta.long <- (long2 - long1)
  
  delta.lat <- (lat2 - lat1)
  
  a <- sin(delta.lat/2) ^ 2 + cos(lat1) * cos(lat2) * sin(delta.long/2) ^ 2
  
  c <- 2 * asin(min(1,sqrt(a)))
  
  d = R * c
  
  return(d)                                                # Distance in km
}


get_reference_address = function(initial_data, split_column) {           # function to calculate km-distances
  
  s_col = lapply(unique(initial_data[, split_column]), function(x) initial_data[initial_data[, split_column] == x, ])
  
  reference_address = lapply(s_col, function(x) as.character(address_frequency(x)))
  
  reference_lon_lat = lapply(1:length(s_col), function(x) filter(s_col[[x]], Address == reference_address[[x]])[1, c('X','Y')])
  
  Distance = lapply(1:length(s_col), function(f) sapply(1:nrow(s_col[[f]]), function(x) gcd.hf(s_col[[f]][x, 7], s_col[[f]][x, 8], 
                    
                    reference_lon_lat[[f]]$X, reference_lon_lat[[f]]$Y)))
  
  tmp_id = do.call(rbind, s_col)$Id
  
  tmp_df = data.frame(id = tmp_id, unlist(Distance))
  
  colnames(tmp_df) = c('Id', paste('dist_', stringr::str_trim(split_column, side = 'both' )))
  
  return(tmp_df)
}


lst_out = list()

for (i in c('PdDistrict', 'weekday', 'day', 'hour', 'season')) {
  
  cat(i, '\n')
  
  lst_out[[i]] = get_reference_address(distr1, i)
}


merg = merge(lst_out[[1]], lst_out[[2]], by.x = 'Id', by.y = 'Id')

merg = merge(merg, lst_out[[3]], by.x = 'Id', by.y = 'Id')

merg = merge(merg, lst_out[[4]], by.x = 'Id', by.y = 'Id')

merg = merge(merg, lst_out[[5]], by.x = 'Id', by.y = 'Id')



The previous long code chunk takes advantage of the latitude and longitude data to calculate distance features. The idea behind the script was to spot, first, for each district (‘PdDistrict’) the locations with high crime frequency. Then, I extended the features by doing the same for ‘weekday’, ‘day’, ‘hour’ and ‘season’.


ndf = merge(distr1, merg, by.x = 'Id', by.y = 'Id')

ndf$`dist_ weekday` = log(ndf$`dist_ weekday` + 1)

ndf$`dist_ day` = sqrt(ndf$`dist_ day` + 1)

ndf$`dist_ hour` = 2 * sqrt(ndf$`dist_ hour` + 3/8)

ndf = ndf[, -c(2, 4)]


Then, I merged the distance-features (merg) with the initial data (distr1) and I took the log of the ‘dist_ weekday’, the sqrt of the ‘dist_ day’ and the Anscombe transform of the ‘dist_ hour’, as I observed some correlation of those transforms with the response variable. Additionally, I removed the 2nd (Dates) and 4th (DayOfWeek) columns, because Dates have been already preprocessed and the DayOfWeek appeared in numeric form already (weekday).


table(ndf$PdDistrict)

   BAYVIEW    CENTRAL  INGLESIDE    MISSION   NORTHERN       PARK   RICHMOND   SOUTHERN    TARAVAL TENDERLOIN 
    179022     171590     158929     240357     212313      99512      90181     314638     132213     163556 

pdD = as.factor(ndf$PdDistrict)

mdM = model.matrix(~.-1, data.frame(pdD))

ndf$PdDistrict = NULL

ndf = cbind(ndf, mdM)


I converted the PdDistrict predictor to dummy variables, as it didn’t have many Levels (like the Address column) and, in binarized form, it could add more predictive power to the model.


ndf$Address = as.numeric(as.factor(ndf$Address))

ntrain = filter(ndf, Id < 0)

ntrain$Id = NULL

response = ntrain$Category

y = c(0:38)[ match(response, sort(unique(response))) ]

ntrain$Category = NULL

ntest = filter(ndf, Id >= 0)

ID_TEST = as.integer(ntest$Id)

ntest$Id = NULL

ntest$Category = NULL


Finally, I split the end-data (ndf2) to train and test, I converted the response variable (y) to numeric (0:38) (so that it is compatible with the xgboost algorithm) and I removed redundant columns from both train (the Id) and test (the Category).


library(Matrix)

ntrain = Matrix(as.matrix(ntrain), sparse = T)

ntest = Matrix(as.matrix(ntest), sparse = T)



Some of the columns are highly sparse, thus converting the data to a sparse matrix could speed up the training. For this purpose, I used the Matrix library.

xgboost algorithm


The purpose of the competition was to decrease the Multi-class log loss thus, I used a corresponding function (MultiLogLoss) and additionally I built a validation function, which I used internally to evaluate the folds in xgboost (VALID_FUNC).



VALID_FUNC = function(EVAL_METRIC, arg_actual, arg_predicted, inverse_order = FALSE) {
  
  if (inverse_order == TRUE) {
    
    args_list = list(arg_predicted, arg_actual)
  }
  
  else {
    
    args_list = list(arg_actual, arg_predicted)
  }
  
  result = do.call(EVAL_METRIC, args_list)
  
  result
}




MultiLogLoss = function (y_true, y_pred) {

  if (is.factor(y_true)) {
  
    y_true_mat <- matrix(0, nrow = length(y_true), ncol = length(levels(y_true)))
    
    sample_levels <- as.integer(y_true)
    
    for (i in 1:length(y_true)) y_true_mat[i, sample_levels[i]] <- 1
    
    y_true <- y_true_mat
  }
  
  eps <- 1e-15
  
  N <- dim(y_pred)[1]
  
  y_pred <- pmax(pmin(y_pred, 1 - eps), eps)
  
  MultiLogLoss <- (-1/N) * sum(y_true * log(y_pred))
  
  return(MultiLogLoss)
}


I used the xgboost algorithm because it works pretty well with big data and gives good results as well. I performed a 4-fold cross-validation and at each fold, I also predicted the test data. The following function was used to evaluate each fold and to get the predictions from the unknown test data,


xgboost_cv = function(RESP, data, TEST, repeats, Folds, idx_train = NULL, param, num_rounds, print_every_n  = 10, 

                      early_stop = 10, maximize = FALSE, verbose = 1, EVAL_METRIC, set_seed = 2) {
  
  start = Sys.time()
  
  library(caret)
  library(xgboost)
  library(Metrics)
  
  out_ALL = list()
  
  for (j in 1:repeats) {
    
    cat('REPEAT', j, '\n')
    
    TEST_lst = PARAMS = PREDS_tr = PREDS_te = list()
    
    if (is.numeric(Folds)) {
      
      if (is.null(set_seed)) {
        
        sample_seed = sample(seq(1, 1000000, 1), 1)}
      
      else {
        
        sample_seed = set_seed
      }
      
      set.seed(sample_seed)
      folds = createFolds(RESP, k = Folds, list = TRUE)}
    
    else {
      
      if (is.null(idx_train)) stop(simpleError('give index of train data in form of a vector'))
      
      out_idx = 1:dim(data)[1]
      folds = lapply(1:length(Folds), function(x) out_idx[which(idx_train %in% Folds[[x]])])
    }
    
    tr_er <- tes_er <- rep(NA, length(folds))
    
    for (i in 1:length(folds)) {
      
      cat('fold', i, '\n')
      
      dtrain <- xgb.DMatrix(data = data[unlist(folds[-i]), ], label = RESP[unlist(folds[-i])])
      
      dtest <- xgb.DMatrix(data = data[unlist(folds[i]), ], label = RESP[unlist(folds[i])])
      
      watchlist <- list(train = dtrain, test = dtest)
      
      fit = xgb.train(param, dtrain, nround = num_rounds, print.every.n  = print_every_n, watchlist = watchlist, 
      
            early.stop.round = early_stop, maximize = maximize, verbose = verbose)
      
      PARAMS[[i]] = list(param = param, bst_round = fit$bestInd)
      
      pred_tr = predict(fit, data[unlist(folds[-i]), ], ntreelimit = fit$bestInd)
      pred_tr = matrix(pred_tr, nrow =  dim(data[unlist(folds[-i]), ])[1], ncol = length(unique(y)), byrow = TRUE)

      pred_te = predict(fit, data[unlist(folds[i]), ], ntreelimit = fit$bestInd)
      pred_te = matrix(pred_te, nrow =  dim(data[unlist(folds[i]), ])[1], ncol = length(unique(y)), byrow = TRUE)

      tr_er[i] = VALID_FUNC(EVAL_METRIC, as.factor(RESP[unlist(folds[-i])]), pred_tr)
      tes_er[i] = VALID_FUNC(EVAL_METRIC, as.factor(RESP[unlist(folds[i])]), pred_te)
      
      tmp_TEST = matrix(predict(fit, TEST, ntreelimit = fit$bestInd), nrow =  dim(TEST)[1], ncol = length(unique(y)), byrow = TRUE)
      
      TEST_lst[[paste0('preds_', i)]] = tmp_TEST
      
      cat('---------------------------------------------------------------------------', '\n')
      
      save(tmp_TEST, file = paste('sfcc_', paste(sample(1:1000000000, 1), '_REPEAT_save.RDATA', sep = ""), sep = ""))
      
      gc()
    }
    
    out_ALL[[j]] = list(TEST_lst = TEST_lst, PARAMS = PARAMS, sample_seed = sample_seed, tr_er = tr_er,PREDS_tr = PREDS_tr, 
    
                        PREDS_te = PREDS_te, tes_er = tes_er)

    cat('================================================================================================================', '\n')
    
    gc()
  }
  
  end = Sys.time()
  
  return(list(res = out_ALL, time = end - start))
}



I experimented with different parameter settings, but the following one is a good trade-off between running time and performance, as it runs in 2.89 hours and gives a leaderboard score of 2.238 Multi-class log-loss. To improve the leaderboard score in this competition I averaged 6 models with different parameter settings. I observed that a learning rate (eta) of 0.145 and a number of rounds (num_rounds) 320 gave the best results,


params = list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = 39, "booster" = "gbtree", "bst:eta" = 0.245, 

              "subsample" = 0.7, "max_depth" = 7, "colsample_bytree" = 0.7, "nthread" = 6, "scale_pos_weight" = 0.0, 
              
              "min_child_weight" = 0.0, "max_delta_step" = 1.0) 


fit = xgboost_cv(y, ntrain, ntest, repeats = 1, Folds = 4, idx_train = NULL, params, num_rounds = 145, print_every_n = 5, 
  
                early_stop = 10, maximize = FALSE, verbose = 1, MultiLogLoss)


Before, submitting the csv-predictions, I had to calculate the train and test error for each fold, then to average the predictions of the unknown test data and to add the column names of the sample submission (the column names had to be in the correct form, otherwise the submission wasn’t accepted, thus check.names = F when reading the data was necessary),


tr_er = unlist(lapply(fit$res, function(x) x$tr_er))

tes_er = unlist(lapply(fit$res, function(x) x$tes_er))

cat('log loss of train is :', mean(tr_er), '\n')

# log loss of train is : 1.952143

cat('log loss of test is :',  mean(tes_er), '\n')

# log loss of test is : 2.232479

lap = unlist(lapply(fit$res, function(x) x$TEST_lst), recursive = FALSE)

avg_dfs = (lap[[1]] + lap[[2]] + lap[[3]] + lap[[4]])/4

subms = data.frame(ID_TEST, avg_dfs)

sampleSubmission <- read.csv("sampleSubmission.csv", check.names = F)

colnames(subms) = colnames(sampleSubmission)

subms = subms[order(subms$Id, decreasing = F), ]

write.csv(subms, "xgb_post_submission_train_error_1_95_test_error_2_2324.csv", row.names=FALSE, quote = FALSE)


Finally, I’ll utilize the FeatureSelection package and especially xgboost and ranger to plot the important variables,


ntr = as.matrix(ntrain)

colnames(ntr) = make.names(colnames(ntr))


library(FeatureSelection)


params_xgboost = list(params = list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = 39, "booster" = "gbtree", 

                                    "bst:eta" = 0.5, "subsample" = 0.65, "max_depth" = 5, "colsample_bytree" = 0.65, "nthread" = 6, 
                                    
                                    "scale_pos_weight" = 0.0, "min_child_weight" = 0.0, "max_delta_step" = 1.0), 
                                    
                      nrounds = 50, print.every.n = 5, verbose = 1, maximize = F)
                      

params_ranger = list(write.forest = TRUE, probability = TRUE, num.threads = 6, num.trees = 50, verbose = FALSE, classification = TRUE, 
                     
                     mtry = 4, min.node.size = 5, importance = 'impurity')
                     

params_features = list(keep_number_feat = NULL, union = F)


feat = wrapper_feat_select(ntr, y, params_glmnet = NULL, params_xgboost = params_xgboost, params_ranger = params_ranger, xgb_sort = NULL,
                           
                           CV_folds = 4, stratified_regr = FALSE, cores_glmnet = 2, params_features = params_features)


params_barplot = list(keep_features = 28, horiz = TRUE, cex.names = 0.8)

barplot_feat_select(feat, params_barplot, xgb_sort = 'Cover')

Alt text


The complete script of this blog post can be found as a single file in my Github account.

comments powered by Disqus