mlampros Organizing and Sharing thoughts, Receiving constructive feedback

Fuzzy string Matching using fuzzywuzzyR and the reticulate package in R

I recently released an (other one) R package on CRAN - fuzzywuzzyR - which ports the fuzzywuzzy python library in R. “fuzzywuzzy does fuzzy string matching by using the Levenshtein Distance to calculate the differences between sequences (of character strings).”

There is no big news here as in R already exist similar packages such as the stringdist package. Why then creating the package? Well, I intend to participate in a recently launched kaggle competition and one popular method to build features (predictors) is fuzzy string matching as explained in this blog post. My (second) aim was to use the (newly released from Rstudio) reticulate package, which “provides an R interface to Python modules, classes, and functions” and makes the process of porting python code in R not cumbersome.

First, I’ll explain the functionality of the fuzzywuzzyR package and then I’ll give some examples on how to take advantage of the reticulate package in R.

Continue reading...

Processing of GeoJson data in R

This blog post is about my recently released package on CRAN, geojsonR. The following notes and examples are based mainly on the package Vignette.

GeoJSON is an open standard format designed for representing simple geographical features, along with their non-spatial attributes, based on JavaScript Object Notation. The features include points (therefore addresses and locations), line strings (therefore streets, highways and boundaries), polygons (countries, provinces, tracts of land), and multi-part collections of these types. GeoJSON features need not represent entities of the physical world only; mobile routing and navigation apps, for example, might describe their service coverage using GeoJSON. The GeoJSON format differs from other GIS standards in that it was written and is maintained not by a formal standards organization, but by an Internet working group of developers.”

Continue reading...

Text Processing using the textTinyPy package in Python

This blog post (which has many similarities with the previous one) explains the functionality of the textTinyPy package which can be installed from pypi using,

  • pip install textTinyPy

The package has been tested on Linux using python 2.7. It is based on the same C++ source code as the textTinyR package, but it has a slightly different structure and it’s wrapped in Python using Cython. It will work properly only if the following requirements are satisfied / installed:

Continue reading...

Text Processing using the textTinyR package

This blog post is about my recently released package on CRAN, textTinyR. The following notes and examples are based mainly on the package Vignette.

The advantage of the textTinyR package lies in its ability to process big text data files in batches efficiently. For this purpose, it offers functions for splitting, parsing, tokenizing and creating a vocabulary. Moreover, it includes functions for building either a document-term matrix or a term-document matrix and extracting information from those (term-associations, most frequent terms). Lastly, it embodies functions for calculating token statistics (collocations, look-up tables, string dissimilarities) and functions to work with sparse matrices. The source code is based mainly on C++11 and exported in R through the Rcpp, RcppArmadillo and BH packages.

Continue reading...

Clustering using the ClusterR package

This blog post is about clustering and specifically about my recently released package on CRAN, ClusterR. The following notes and examples are based mainly on the package Vignette.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is the main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Continue reading...