Fuzzy character string matching ( ratios )

# init <- FuzzMatcher$new(decoding = NULL)

Details

the decoding parameter is useful in case of non-ascii character strings. If this parameter is not NULL then the force_ascii parameter (if applicable) is internally set to FALSE. Decoding applies only to python 2 configurations, as in python 3 character strings are decoded to unicode by default.

the Partial_token_set_ratio method works in the following way : 1. Find all alphanumeric tokens in each string, 2. treat them as a set, 3. construct two strings of the form, <sorted_intersection><sorted_remainder>, 4. take ratios of those two strings, 5. controls for unordered partial matches (HERE partial match is TRUE)

the Partial_token_sort_ratio method returns the ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparing.

the Ratio method returns a ration in form of an integer value based on a SequenceMatcher-like class, which is built on top of the Levenshtein package (https://github.com/miohtama/python-Levenshtein)

the QRATIO method performs a quick ratio comparison between two strings. Runs full_process from utils on both strings. Short circuits if either of the strings is empty after processing.

the WRATIO method returns a measure of the sequences' similarity between 0 and 100, using different algorithms. Steps in the order they occur : 1. Run full_process from utils on both strings, 2. Short circuit if this makes either string empty, 3. Take the ratio of the two processed strings (fuzz.ratio), 4. Run checks to compare the length of the strings (If one of the strings is more than 1.5 times as long as the other use partial_ratio comparisons - scale partial results by 0.9 - this makes sure only full results can return 100 - If one of the strings is over 8 times as long as the other instead scale by 0.6), 5. Run the other ratio functions (if using partial ratio functions call partial_ratio, partial_token_sort_ratio and partial_token_set_ratio scale all of these by the ratio based on length otherwise call token_sort_ratio and token_set_ratio all token based comparisons are scaled by 0.95 - on top of any partial scalars) 6. Take the highest value from these results round it and return it as an integer.

the UWRATIO method returns a measure of the sequences' similarity between 0 and 100, using different algorithms. Same as WRatio but preserving unicode

the UQRATIO method returns a Unicode quick ratio. It calls QRATIO with force_ascii set to FALSE.

the Token_sort_ratio method returns a measure of the sequences' similarity between 0 and 100 but sorting the token before comparing

the Partial_ratio returns the ratio of the most similar substring as a number between 0 and 100.

the Token_set_ratio method works in the following way : 1. Find all alphanumeric tokens in each string, 2. treat them as a set, 3. construct two strings of the form, <sorted_intersection><sorted_remainder>, 4. take ratios of those two strings, 5. controls for unordered partial matches (HERE partial match is FALSE)

Methods

FuzzMatcher$new(decoding = NULL)
--------------
Partial_token_set_ratio(string1 = NULL, string2 = NULL, force_ascii = TRUE, full_process = TRUE)
--------------
Partial_token_sort_ratio(string1 = NULL, string2 = NULL, force_ascii = TRUE, full_process = TRUE)
--------------
Ratio(string1 = NULL, string2 = NULL)
--------------
QRATIO(string1 = NULL, string2 = NULL, force_ascii = TRUE)
--------------
WRATIO(string1 = NULL, string2 = NULL, force_ascii = TRUE)
--------------
UWRATIO(string1 = NULL, string2 = NULL)
--------------
UQRATIO(string1 = NULL, string2 = NULL)
--------------
Token_sort_ratio(string1 = NULL, string2 = NULL, force_ascii = TRUE, full_process = TRUE)
--------------
Partial_ratio(string1 = NULL, string2 = NULL)
--------------
Token_set_ratio(string1 = NULL, string2 = NULL, force_ascii = TRUE, full_process = TRUE)

References

https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/fuzz.py, https://docs.python.org/3/library/codecs.html#standard-encodings

Methods

Method `new()`

Usage

FuzzMatcher$new(decoding = NULL)

Arguments

decoding: either NULL or a character string. If not NULL then the decoding parameter takes one of the standard python encodings (such as 'utf-8'). See the details and references link for more information.

Method `Partial_token_set_ratio()`

Usage

FuzzMatcher$Partial_token_set_ratio(
  string1 = NULL,
  string2 = NULL,
  force_ascii = TRUE,
  full_process = TRUE
)

Arguments

string1: a character string.
string2: a character string.
force_ascii: allow only ASCII characters (force convert to ascii)
full_process: either TRUE or FALSE. If TRUE then it process the string by : 1. removing all but letters and numbers, 2. trim whitespace, 3. force to lower case

Method `Partial_token_sort_ratio()`

Usage

FuzzMatcher$Partial_token_sort_ratio(
  string1 = NULL,
  string2 = NULL,
  force_ascii = TRUE,
  full_process = TRUE
)

Arguments

string1: a character string.
string2: a character string.
force_ascii: allow only ASCII characters (force convert to ascii)
full_process: either TRUE or FALSE. If TRUE then it process the string by : 1. removing all but letters and numbers, 2. trim whitespace, 3. force to lower case

Method `Ratio()`

Usage

FuzzMatcher$Ratio(string1 = NULL, string2 = NULL)

Arguments

string1: a character string.
string2: a character string.

Method `QRATIO()`

Usage

FuzzMatcher$QRATIO(string1 = NULL, string2 = NULL, force_ascii = TRUE)

Arguments

string1: a character string.
string2: a character string.
force_ascii: allow only ASCII characters (force convert to ascii)

Method `WRATIO()`

Usage

FuzzMatcher$WRATIO(string1 = NULL, string2 = NULL, force_ascii = TRUE)

Arguments

string1: a character string.
string2: a character string.
force_ascii: allow only ASCII characters (force convert to ascii)

Method `UWRATIO()`

Usage

FuzzMatcher$UWRATIO(string1 = NULL, string2 = NULL)

Arguments

string1: a character string.
string2: a character string.

Method `UQRATIO()`

Usage

FuzzMatcher$UQRATIO(string1 = NULL, string2 = NULL)

Arguments

string1: a character string.
string2: a character string.

Method `Token_sort_ratio()`

Usage

FuzzMatcher$Token_sort_ratio(
  string1 = NULL,
  string2 = NULL,
  force_ascii = TRUE,
  full_process = TRUE
)

Arguments

string1: a character string.
string2: a character string.
force_ascii: allow only ASCII characters (force convert to ascii)
full_process: either TRUE or FALSE. If TRUE then it process the string by : 1. removing all but letters and numbers, 2. trim whitespace, 3. force to lower case

Method `Partial_ratio()`

Usage

FuzzMatcher$Partial_ratio(string1 = NULL, string2 = NULL)

Arguments

string1: a character string.
string2: a character string.

Method `Token_set_ratio()`

Usage

FuzzMatcher$Token_set_ratio(
  string1 = NULL,
  string2 = NULL,
  force_ascii = TRUE,
  full_process = TRUE
)

Arguments

string1: a character string.
string2: a character string.
force_ascii: allow only ASCII characters (force convert to ascii)
full_process: either TRUE or FALSE. If TRUE then it process the string by : 1. removing all but letters and numbers, 2. trim whitespace, 3. force to lower case

Method `clone()`

The objects of this class are cloneable with this method.

Usage

FuzzMatcher$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples


try({
  if (reticulate::py_available(initialize = FALSE)) {

    if (check_availability()) {

      library(fuzzywuzzyR)

      s1 = "Atlanta Falcons"

      s2 = "New York Jets"

      init = FuzzMatcher$new()

      init$Partial_token_set_ratio(string1 = s1,
                                   string2 = s2,
                                   force_ascii = TRUE,
                                   full_process = TRUE)

      init$Partial_token_sort_ratio(string1 = s1,
                                    string2 = s2,
                                    force_ascii = TRUE,
                                    full_process = TRUE)

      init$Ratio(string1 = s1, string2 = s2)

      init$QRATIO(string1 = s1, string2 = s2, force_ascii = TRUE)

      init$WRATIO(string1 = s1, string2 = s2, force_ascii = TRUE)

      init$UWRATIO(string1 = s1, string2 = s2)

      init$UQRATIO(string1 = s1, string2 = s2)

      init$Token_sort_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE)

      init$Partial_ratio(string1 = s1, string2 = s2)

      init$Token_set_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE)
    }
  }
}, silent=TRUE)

Fuzzy character string matching ( ratios )

Details

Methods

References

Methods

Public methods

Method new()

Usage

Arguments

Method Partial_token_set_ratio()

Usage

Arguments

Method Partial_token_sort_ratio()

Usage

Arguments

Method Ratio()

Usage

Arguments

Method QRATIO()

Usage

Arguments

Method WRATIO()

Usage

Arguments

Method UWRATIO()

Usage

Arguments

Method UQRATIO()

Usage

Arguments

Method Token_sort_ratio()

Usage

Arguments

Method Partial_ratio()

Usage

Arguments

Method Token_set_ratio()

Usage

Arguments

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `Partial_token_set_ratio()`

Method `Partial_token_sort_ratio()`

Method `Ratio()`

Method `QRATIO()`

Method `WRATIO()`

Method `UWRATIO()`

Method `UQRATIO()`

Method `Token_sort_ratio()`

Method `Partial_ratio()`

Method `Token_set_ratio()`

Method `clone()`