07 Aug 2017
Since I submitted the geojsonR package I was interested in running geospatial MongoDB queries using GeoJson data. I decided to use PyMongo (through the reticulate package) after opening two Github issues here and here. In my opinion, the PyMongo library is huge and covers a lot of things however, my intention was to be able to run geospatial queries from within R.
The GeoMongo package
The GeoMongo package allows the user,
- to insert and query only GeoJson data using the geomongo R6 class
- to read data in either json (through the geojsonR package) or BSON format (I’ll explain later when BSON is necessary for inserting data)
- to validate a json instance using a schema using the json_schema_validator() function (input parameters are R named lists)
- to utilize MongoDB console commands using the mongodb_console() function. The mongodb_console() function takes advantage of the base R system() function. For instance, MongoDB console commands are necessary in case of bulk import / export of data as documented here and here.
13 Apr 2017
I recently released an (other one) R package on CRAN - fuzzywuzzyR - which ports the fuzzywuzzy python library in R. “fuzzywuzzy does fuzzy string matching by using the Levenshtein Distance to calculate the differences between sequences (of character strings).”
There is no big news here as in R already exist similar packages such as the stringdist package. Why then creating the package? Well, I intend to participate in a recently launched kaggle competition and one popular method to build features (predictors) is fuzzy string matching as explained in this blog post. My (second) aim was to use the (newly released from Rstudio) reticulate package, which “provides an R interface to Python modules, classes, and functions” and makes the process of porting python code in R not cumbersome.
First, I’ll explain the functionality of the fuzzywuzzyR package and then I’ll give some examples on how to take advantage of the reticulate package in R.
29 Mar 2017
This blog post is about my recently released package on CRAN, geojsonR. The following notes and examples are based mainly on the package Vignette.
10 Jan 2017
This blog post (which has many similarities with the previous one) explains the functionality of the textTinyPy package which can be installed from pypi using,
The package has been tested on Linux using python 2.7. It is based on the same C++ source code as the textTinyR package, but it has a slightly different structure and it’s wrapped in Python using Cython. It will work properly only if the following requirements are satisfied / installed:
05 Jan 2017
This blog post is about my recently released package on CRAN, textTinyR. The following notes and examples are based mainly on the package Vignette.
The advantage of the textTinyR package lies in its ability to process big text data files in batches efficiently. For this purpose, it offers functions for splitting, parsing, tokenizing and creating a vocabulary. Moreover, it includes functions for building either a document-term matrix or a term-document matrix and extracting information from those (term-associations, most frequent terms). Lastly, it embodies functions for calculating token statistics (collocations, look-up tables, string dissimilarities) and functions to work with sparse matrices. The source code is based mainly on C++11 and exported in R through the Rcpp, RcppArmadillo and BH packages.