mlampros Organizing and Sharing thoughts, Receiving constructive feedback

Non Metric Space (Approximate) Library in R

The nmslibR package is a wrapper of NMSLIB, which according to the authors “… is a similarity search library and a toolkit for evaluation of similarity search methods. The goal of the project is to create an effective and comprehensive toolkit for searching in generic non-metric spaces. Being comprehensive is important, because no single method is likely to be sufficient in all cases. Also note that exact solutions are hardly efficient in high dimensions and/or non-metric spaces. Hence, the main focus is on approximate methods”.

I’ve searched for some time (before wrapping NMSLIB) for a nearest neighbor library which can work with high dimensional data and can scale with big datasets. I’ve already written a package for k-nearest-neighbor search (KernelKnn), however, it’s based on brute force and unfortunately, it requires a certain computation time if the data consists of many rows. The nmslibR package, besides the main functionality of the NMSLIB python library, also includes an Approximate Kernel k-nearest function, which as I will show in the next lines is both fast and accurate. A comparison of NMSLIB with other popular approximate k-nearest-neighbor methods can be found here.

Continue reading...

Regularized Greedy Forest in R

This blog post is about my newly released RGF package (the blog post consists mainly of the package Vignette). The RGF package is a wrapper of the Regularized Greedy Forest python package, which also includes a Multi-core implementation (FastRGF). Portability from Python to R was made possible using the reticulate package and the installation requires basic knowledge of Python. Except for the Linux Operating System, the installation on Macintosh and Windows might be somehow cumbersome (on windows the package currently can be used only from within the command prompt). Detailed installation instructions for all three Operating Systems can be found in the README.md file and in the rgf_python Github repository.

Continue reading...

Statoil / C-CORE Iceberg Classifier Competition

For the last two months, I had participated in a machine learning competition organized by Kaggle (platform for predictive modeling and analytics), where I ended up in the top 1 % on the private leaderboard or 24th out of 3343 participants. I thought it would be worth writing a blog post in order to both share my experience / insights and keep a reference of key features for satellite imagery ( Sentinel-1 satellite data and specifically HH - transmit/receive horizontally - and HV - transmit horizontally and receive vertically ) in case it might be useful in the future.

Continue reading...

Geospatial Queries using Pymongo in R

Since I submitted the geojsonR package I was interested in running geospatial MongoDB queries using GeoJson data. I decided to use PyMongo (through the reticulate package) after opening two Github issues here and here. In my opinion, the PyMongo library is huge and covers a lot of things however, my intention was to be able to run geospatial queries from within R.

The GeoMongo package

The GeoMongo package allows the user,

  • to insert and query only GeoJson data using the geomongo R6 class
  • to read data in either json (through the geojsonR package) or BSON format (I’ll explain later when BSON is necessary for inserting data)
  • to validate a json instance using a schema using the json_schema_validator() function (input parameters are R named lists)
  • to utilize MongoDB console commands using the mongodb_console() function. The mongodb_console() function takes advantage of the base R system() function. For instance, MongoDB console commands are necessary in case of bulk import / export of data as documented here and here.

Continue reading...

Fuzzy string Matching using fuzzywuzzyR and the reticulate package in R

I recently released an (other one) R package on CRAN - fuzzywuzzyR - which ports the fuzzywuzzy python library in R. “fuzzywuzzy does fuzzy string matching by using the Levenshtein Distance to calculate the differences between sequences (of character strings).”

There is no big news here as in R already exist similar packages such as the stringdist package. Why then creating the package? Well, I intend to participate in a recently launched kaggle competition and one popular method to build features (predictors) is fuzzy string matching as explained in this blog post. My (second) aim was to use the (newly released from Rstudio) reticulate package, which “provides an R interface to Python modules, classes, and functions” and makes the process of porting python code in R not cumbersome.

First, I’ll explain the functionality of the fuzzywuzzyR package and then I’ll give some examples on how to take advantage of the reticulate package in R.

Continue reading...