# San Francisco Crime Classification competition

09 Jun 2016In this blog post, I’ll explain my approach for the San Francisco Crime Classification competition, in which I participated for the past two months. This competition was hosted by kaggle, a free online platform for predictive modelling and analytics. I ended up in the first 60 places out of 2335 participants and so far is my best personal result. This competition belongs to the knowledge competitions, meaning that the submissions of the participants are evaluated on the whole test data, so there wasn’t any danger of overfitting the leaderboard, as after every submission the true (end) leaderboard score was calculated (no secrets). Furthermore, there weren’t any ranking points, so no particular gain except for learning new methods on how to tackle machine learning problems.

### the data set

The competition started one year ago, so there were some hints in the forum on how to exclude outliers or on how to approach this particular problem.

The dataset contained incidents derived from SFPD Crime Incident Reporting system and was collected from 1/1/2003 to 5/13/2015. The training set and test set rotated every week, meaning that week 1,3,5,7… belonged to test set whereas week 2,4,6,8 belonged to the training set. The following table shows the *Category*, which was the target of the competition,

```
train <- read.csv("train.csv", stringsAsFactors = F)
sort(table(train$Category), decreasing = T)
LARCENY/THEFT OTHER OFFENSES NON-CRIMINAL ASSAULT
174900 126182 92304 76876
DRUG/NARCOTIC VEHICLE THEFT VANDALISM WARRANTS
53971 53781 44725 42214
BURGLARY SUSPICIOUS OCC MISSING PERSON ROBBERY
36755 31414 25989 23000
FRAUD FORGERY/COUNTERFEITING SECONDARY CODES WEAPON LAWS
16679 10609 9985 8555
PROSTITUTION TRESPASS STOLEN PROPERTY SEX OFFENSES FORCIBLE
7484 7326 4540 4388
DISORDERLY CONDUCT DRUNKENNESS RECOVERED VEHICLE KIDNAPPING
4320 4280 3138 2341
DRIVING UNDER THE INFLUENCE RUNAWAY LIQUOR LAWS ARSON
2268 1946 1903 1513
LOITERING EMBEZZLEMENT SUICIDE FAMILY OFFENSES
1225 1166 508 491
BAD CHECKS BRIBERY EXTORTION SEX OFFENSES NON FORCIBLE
406 289 256 148
GAMBLING PORNOGRAPHY/OBSCENE MAT TREA
146 22 6
```

The data can be downloaded directly from kaggle.

```
head(train)
Dates Category Descript DayOfWeek PdDistrict Resolution
1 2015-05-13 23:53:00 WARRANTS WARRANT ARREST Wednesday NORTHERN ARREST, BOOKED
2 2015-05-13 23:53:00 OTHER OFFENSES TRAFFIC VIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED
3 2015-05-13 23:33:00 OTHER OFFENSES TRAFFIC VIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED
4 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Wednesday NORTHERN NONE
5 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO Wednesday PARK NONE
6 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM UNLOCKED AUTO Wednesday INGLESIDE NONE
Address X Y
OAK ST / LAGUNA ST -122.4259 37.77460
OAK ST / LAGUNA ST -122.4259 37.77460
VANNESS AV / GREENWICH ST -122.4244 37.80041
1500 Block of LOMBARD ST -122.4270 37.80087
100 Block of BRODERICK ST -122.4387 37.77154
0 Block of TEDDY AV -122.4033 37.71343
test <- read.csv("test.csv", stringsAsFactors = F)
head(test)
Id Dates DayOfWeek PdDistrict Address X Y
1 0 2015-05-10 23:59:00 Sunday BAYVIEW 2000 Block of THOMAS AV -122.3996 37.73505
2 1 2015-05-10 23:51:00 Sunday BAYVIEW 3RD ST / REVERE AV -122.3915 37.73243
3 2 2015-05-10 23:50:00 Sunday NORTHERN 2000 Block of GOUGH ST -122.4260 37.79221
4 3 2015-05-10 23:45:00 Sunday INGLESIDE 4700 Block of MISSION ST -122.4374 37.72141
5 4 2015-05-10 23:45:00 Sunday INGLESIDE 4700 Block of MISSION ST -122.4374 37.72141
6 5 2015-05-10 23:40:00 Sunday TARAVAL BROAD ST / CAPITOL AV -122.4590 37.71317
```

I’ve red the data with *stringsAsFactors = F*, because I wanted the categorical columns to be in character and not in factor form, as I had to do some string preprocessing,

```
train = train[, -c(3,6)]
```

Furthermore, I removed the 3rd and 6th columns for two reasons: firstly they do appear only in the train data and secondly, I couldn’t find a way to take advantage of the *Descript* and *Resolution*.

```
train$Id = sort(seq(-nrow(train), -1, 1), decreasing = T)
test$Category = rep('none', nrow(test))
train = train[, c(8, 1:7)]
test = test[, c(1:2,8,3:7)]
train = rbind(train, test)
```

In order to horizontally join the train with the test data ( so that string preprocessing is feasible), I had to add an *Id* column to the train and a *Category* column to the test data. Then I shifted the order of columns so that the column names of the train data match the column names of the test data.

```
addr = train$Address
library(parallel)
rem_sl = unlist(mclapply(addr, function(x) stringr::str_replace(x, "/", ""), mc.cores = 4))
rem_sl1 = unlist(mclapply(rem_sl, function(x) stringr::str_replace_all(x, pattern=" ", repl=""), mc.cores = 4))
rem_sl2 = as.vector(sapply(rem_sl1, tolower))
train$Address = rem_sl2
```

To continue, I did some preprocessing of the **Address** column, as it appeared that some of these were in lower case, whereas others in upper case. First, I replaced c(‘/’, “ “) with an empty string “” and then I converted all addresses to lower case.

```
date = train$Dates
library(lubridate)
date1 = ymd_hms(date)
Year = year(date1)
Month = month(date1)
YDay = yday(date1)
WDay = wday(date1)
char_wdays = weekdays(date1)
Day = day(date1)
Hour = hour(date1)
Minute = minute(date1)
```

The **Dates** column was from a classification point of view important too, so I used the *lubridate* package to extract the *year*, *month*, *yearsday*, *weekday* (in numeric form), *weekday* (in character form),*day*, *hour* and *minutes*.

```
remov = data.frame(date_tmp = date1, order_dat = 1:length(date1))
remov1 = remov[order(remov$date_tmp, decreasing = F), ]
remov2 = cbind(remov1, order_out = 1:nrow(remov1))
remov3 = remov2[order(remov2$order_dat, decreasing = F), ]
ORD_rows = remov3$order_out
```

The training set and test set rotated every week, as I mentioned in the beginning, so I thought that a feature that tracks the order of the weeks could add some predictive power to the model,

```
library(zoo)
yq <- as.yearqtr(as.yearmon(as.Date(train$Dates), "%m/%d/%Y") + 1/12)
Season <- factor(format(yq, "%q"), levels = 1:4, labels = c("winter", "spring", "summer", "fall"))
Season = as.numeric(Season)
newy = which(Month == 1 & Day == 1)
newy1 = which(Month == 12 & Day == 31)
newal = rep(0, length(Month))
newal[newy] = 1
newal[newy1] = 1
train1 = data.frame(year = Year, month = Month, yday = YDay, weekday = WDay, day = Day, hour = Hour, minutes = Minute, season = Season, newy = newal, ORD_date = ORD_rows)
train1 = cbind(train, train1)
```

Furthermore, I utilized the *zoo* library to mark some periods of the year (seasons, new-year-event),

```
library(dplyr)
DISTRICTS = lapply(unique(train1$PdDistrict), function(x) filter(train1, PdDistrict == x))
median_outliers = function(sublist) {
if (max(sublist$X) == -120.5 || max(sublist$Y) == 90.00) {
sublist$X[which(sublist$X == -120.5)] = median(sublist$X)
sublist$Y[which(sublist$Y == 90.00)] = median(sublist$Y)
}
sublist
}
distr = lapply(DISTRICTS, function(x) median_outliers(x))
distr1 = do.call(rbind, distr)
```

There were some potential outliers in the *latitude* and *longitude* data (X,Y columns), which should be replaced with the corresponding median of each district’s X and Y. Here, I used the *filter* function of the dplyr package as it was faster than the *subset* function of the base R.

```
address_frequency = function(sublist) {
tmp_df = data.frame(table(sublist$Address))
tmp_df = tmp_df[order(tmp_df$Freq, decreasing = T), ]
tmp_df[1, ]$Var1
}
gcd.hf <- function(long1, lat1, long2, lat2) { # http://www.r-bloggers.com/great-circle-distance-calculations-in-r/
R <- 6371 # Earth mean radius [km]
delta.long <- (long2 - long1)
delta.lat <- (lat2 - lat1)
a <- sin(delta.lat/2) ^ 2 + cos(lat1) * cos(lat2) * sin(delta.long/2) ^ 2
c <- 2 * asin(min(1,sqrt(a)))
d = R * c
return(d) # Distance in km
}
get_reference_address = function(initial_data, split_column) { # function to calculate km-distances
s_col = lapply(unique(initial_data[, split_column]), function(x) initial_data[initial_data[, split_column] == x, ])
reference_address = lapply(s_col, function(x) as.character(address_frequency(x)))
reference_lon_lat = lapply(1:length(s_col), function(x) filter(s_col[[x]], Address == reference_address[[x]])[1, c('X','Y')])
Distance = lapply(1:length(s_col), function(f) sapply(1:nrow(s_col[[f]]), function(x) gcd.hf(s_col[[f]][x, 7], s_col[[f]][x, 8],
reference_lon_lat[[f]]$X, reference_lon_lat[[f]]$Y)))
tmp_id = do.call(rbind, s_col)$Id
tmp_df = data.frame(id = tmp_id, unlist(Distance))
colnames(tmp_df) = c('Id', paste('dist_', stringr::str_trim(split_column, side = 'both' )))
return(tmp_df)
}
lst_out = list()
for (i in c('PdDistrict', 'weekday', 'day', 'hour', 'season')) {
cat(i, '\n')
lst_out[[i]] = get_reference_address(distr1, i)
}
merg = merge(lst_out[[1]], lst_out[[2]], by.x = 'Id', by.y = 'Id')
merg = merge(merg, lst_out[[3]], by.x = 'Id', by.y = 'Id')
merg = merge(merg, lst_out[[4]], by.x = 'Id', by.y = 'Id')
merg = merge(merg, lst_out[[5]], by.x = 'Id', by.y = 'Id')
```

The previous long code chunk takes advantage of the *latitude* and *longitude* data to calculate **distance** features. The idea behind the script was to spot, first, for each district (*‘PdDistrict’*) the locations with high crime frequency. Then, I extended the features by doing the same for *‘weekday’*, *‘day’*, *‘hour’* and *‘season’*.

```
ndf = merge(distr1, merg, by.x = 'Id', by.y = 'Id')
ndf$`dist_ weekday` = log(ndf$`dist_ weekday` + 1)
ndf$`dist_ day` = sqrt(ndf$`dist_ day` + 1)
ndf$`dist_ hour` = 2 * sqrt(ndf$`dist_ hour` + 3/8)
ndf = ndf[, -c(2, 4)]
```

Then, I merged the distance-features (merg) with the initial data (distr1) and I took the *log* of the *‘dist_ weekday’*, the *sqrt* of the *‘dist_ day’* and the *Anscombe transform* of the *‘dist_ hour’*, as I observed some correlation of those transforms with the response variable. Additionally, I removed the 2nd (Dates) and 4th (DayOfWeek) columns, because *Dates* have been already preprocessed and the *DayOfWeek* appeared in numeric form already (*weekday*).

```
table(ndf$PdDistrict)
BAYVIEW CENTRAL INGLESIDE MISSION NORTHERN PARK RICHMOND SOUTHERN TARAVAL TENDERLOIN
179022 171590 158929 240357 212313 99512 90181 314638 132213 163556
pdD = as.factor(ndf$PdDistrict)
mdM = model.matrix(~.-1, data.frame(pdD))
ndf$PdDistrict = NULL
ndf = cbind(ndf, mdM)
```

I converted the *PdDistrict* predictor to dummy variables, as it didn’t have many Levels (like the Address column) and, in binarized form, it could add more predictive power to the model.

```
ndf$Address = as.numeric(as.factor(ndf$Address))
ntrain = filter(ndf, Id < 0)
ntrain$Id = NULL
response = ntrain$Category
y = c(0:38)[ match(response, sort(unique(response))) ]
ntrain$Category = NULL
ntest = filter(ndf, Id >= 0)
ID_TEST = as.integer(ntest$Id)
ntest$Id = NULL
ntest$Category = NULL
```

Finally, I split the end-data (ndf2) to train and test, I converted the response variable (y) to numeric (0:38) (so that it is compatible with the xgboost algorithm) and I removed redundant columns from both train (the *Id*) and test (the *Category*).

```
library(Matrix)
ntrain = Matrix(as.matrix(ntrain), sparse = T)
ntest = Matrix(as.matrix(ntest), sparse = T)
```

Some of the columns are highly sparse, thus converting the data to a sparse matrix could speed up the training. For this purpose, I used the Matrix library.

### xgboost algorithm

The purpose of the competition was to decrease the *Multi-class log loss* thus, I used a corresponding function (*MultiLogLoss*) and additionally I built a validation function, which I used internally to evaluate the folds in xgboost (*VALID_FUNC*).

```
VALID_FUNC = function(EVAL_METRIC, arg_actual, arg_predicted, inverse_order = FALSE) {
if (inverse_order == TRUE) {
args_list = list(arg_predicted, arg_actual)
}
else {
args_list = list(arg_actual, arg_predicted)
}
result = do.call(EVAL_METRIC, args_list)
result
}
MultiLogLoss = function (y_true, y_pred) {
if (is.factor(y_true)) {
y_true_mat <- matrix(0, nrow = length(y_true), ncol = length(levels(y_true)))
sample_levels <- as.integer(y_true)
for (i in 1:length(y_true)) y_true_mat[i, sample_levels[i]] <- 1
y_true <- y_true_mat
}
eps <- 1e-15
N <- dim(y_pred)[1]
y_pred <- pmax(pmin(y_pred, 1 - eps), eps)
MultiLogLoss <- (-1/N) * sum(y_true * log(y_pred))
return(MultiLogLoss)
}
```

I used the xgboost algorithm because it works pretty well with big data and gives good results as well. I performed a 4-fold cross-validation and at each fold, I also predicted the test data. The following function was used to evaluate each fold and to get the predictions from the unknown test data,

```
xgboost_cv = function(RESP, data, TEST, repeats, Folds, idx_train = NULL, param, num_rounds, print_every_n = 10,
early_stop = 10, maximize = FALSE, verbose = 1, EVAL_METRIC, set_seed = 2) {
start = Sys.time()
library(caret)
library(xgboost)
library(Metrics)
out_ALL = list()
for (j in 1:repeats) {
cat('REPEAT', j, '\n')
TEST_lst = PARAMS = PREDS_tr = PREDS_te = list()
if (is.numeric(Folds)) {
if (is.null(set_seed)) {
sample_seed = sample(seq(1, 1000000, 1), 1)}
else {
sample_seed = set_seed
}
set.seed(sample_seed)
folds = createFolds(RESP, k = Folds, list = TRUE)}
else {
if (is.null(idx_train)) stop(simpleError('give index of train data in form of a vector'))
out_idx = 1:dim(data)[1]
folds = lapply(1:length(Folds), function(x) out_idx[which(idx_train %in% Folds[[x]])])
}
tr_er <- tes_er <- rep(NA, length(folds))
for (i in 1:length(folds)) {
cat('fold', i, '\n')
dtrain <- xgb.DMatrix(data = data[unlist(folds[-i]), ], label = RESP[unlist(folds[-i])])
dtest <- xgb.DMatrix(data = data[unlist(folds[i]), ], label = RESP[unlist(folds[i])])
watchlist <- list(train = dtrain, test = dtest)
fit = xgb.train(param, dtrain, nround = num_rounds, print.every.n = print_every_n, watchlist = watchlist,
early.stop.round = early_stop, maximize = maximize, verbose = verbose)
PARAMS[[i]] = list(param = param, bst_round = fit$bestInd)
pred_tr = predict(fit, data[unlist(folds[-i]), ], ntreelimit = fit$bestInd)
pred_tr = matrix(pred_tr, nrow = dim(data[unlist(folds[-i]), ])[1], ncol = length(unique(y)), byrow = TRUE)
pred_te = predict(fit, data[unlist(folds[i]), ], ntreelimit = fit$bestInd)
pred_te = matrix(pred_te, nrow = dim(data[unlist(folds[i]), ])[1], ncol = length(unique(y)), byrow = TRUE)
tr_er[i] = VALID_FUNC(EVAL_METRIC, as.factor(RESP[unlist(folds[-i])]), pred_tr)
tes_er[i] = VALID_FUNC(EVAL_METRIC, as.factor(RESP[unlist(folds[i])]), pred_te)
tmp_TEST = matrix(predict(fit, TEST, ntreelimit = fit$bestInd), nrow = dim(TEST)[1], ncol = length(unique(y)), byrow = TRUE)
TEST_lst[[paste0('preds_', i)]] = tmp_TEST
cat('---------------------------------------------------------------------------', '\n')
save(tmp_TEST, file = paste('sfcc_', paste(sample(1:1000000000, 1), '_REPEAT_save.RDATA', sep = ""), sep = ""))
gc()
}
out_ALL[[j]] = list(TEST_lst = TEST_lst, PARAMS = PARAMS, sample_seed = sample_seed, tr_er = tr_er,PREDS_tr = PREDS_tr,
PREDS_te = PREDS_te, tes_er = tes_er)
cat('================================================================================================================', '\n')
gc()
}
end = Sys.time()
return(list(res = out_ALL, time = end - start))
}
```

I experimented with different parameter settings, but the following one is a good trade-off between running time and performance, as it runs in 2.89 hours and gives a leaderboard score of 2.238 Multi-class log-loss. To improve the leaderboard score in this competition I averaged 6 models with different parameter settings. I observed that a learning rate (eta) of 0.145 and a number of rounds (num_rounds) 320 gave the best results,

```
params = list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = 39, "booster" = "gbtree", "bst:eta" = 0.245,
"subsample" = 0.7, "max_depth" = 7, "colsample_bytree" = 0.7, "nthread" = 6, "scale_pos_weight" = 0.0,
"min_child_weight" = 0.0, "max_delta_step" = 1.0)
fit = xgboost_cv(y, ntrain, ntest, repeats = 1, Folds = 4, idx_train = NULL, params, num_rounds = 145, print_every_n = 5,
early_stop = 10, maximize = FALSE, verbose = 1, MultiLogLoss)
```

Before, submitting the csv-predictions, I had to calculate the train and test error for each fold, then to average the predictions of the unknown test data and to add the column names of the sample submission (the column names had to be in the correct form, otherwise the submission wasn’t accepted, thus check.names = F when reading the data was necessary),

```
tr_er = unlist(lapply(fit$res, function(x) x$tr_er))
tes_er = unlist(lapply(fit$res, function(x) x$tes_er))
cat('log loss of train is :', mean(tr_er), '\n')
# log loss of train is : 1.952143
cat('log loss of test is :', mean(tes_er), '\n')
# log loss of test is : 2.232479
lap = unlist(lapply(fit$res, function(x) x$TEST_lst), recursive = FALSE)
avg_dfs = (lap[[1]] + lap[[2]] + lap[[3]] + lap[[4]])/4
subms = data.frame(ID_TEST, avg_dfs)
sampleSubmission <- read.csv("sampleSubmission.csv", check.names = F)
colnames(subms) = colnames(sampleSubmission)
subms = subms[order(subms$Id, decreasing = F), ]
write.csv(subms, "xgb_post_submission_train_error_1_95_test_error_2_2324.csv", row.names=FALSE, quote = FALSE)
```

Finally, I’ll utilize the *FeatureSelection* package and especially xgboost and ranger to plot the important variables,

```
ntr = as.matrix(ntrain)
colnames(ntr) = make.names(colnames(ntr))
library(FeatureSelection)
params_xgboost = list(params = list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = 39, "booster" = "gbtree",
"bst:eta" = 0.5, "subsample" = 0.65, "max_depth" = 5, "colsample_bytree" = 0.65, "nthread" = 6,
"scale_pos_weight" = 0.0, "min_child_weight" = 0.0, "max_delta_step" = 1.0),
nrounds = 50, print.every.n = 5, verbose = 1, maximize = F)
params_ranger = list(write.forest = TRUE, probability = TRUE, num.threads = 6, num.trees = 50, verbose = FALSE, classification = TRUE,
mtry = 4, min.node.size = 5, importance = 'impurity')
params_features = list(keep_number_feat = NULL, union = F)
feat = wrapper_feat_select(ntr, y, params_glmnet = NULL, params_xgboost = params_xgboost, params_ranger = params_ranger, xgb_sort = NULL,
CV_folds = 4, stratified_regr = FALSE, cores_glmnet = 2, params_features = params_features)
params_barplot = list(keep_features = 28, horiz = TRUE, cex.names = 0.8)
barplot_feat_select(feat, params_barplot, xgb_sort = 'Cover')
```

The complete script of this blog post can be found as a single file in my Github account.