token statistics

token statistics

# utl <- token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL,

#                               file_delimiter = ' ', n_gram_delimiter = "_")

Details

the path_2vector function returns the words of a folder or file to a vector ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file

the freq_distribution function returns a named-unsorted vector frequency_distribution in R for EITHER a folder, a file OR a character string vector. A specific subset of the result can be retrieved using the print_frequency function

the count_character function returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string vector. A specific number of character words can be retrieved using the print_count_character function

the collocation_words function returns a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string vector. A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ). The input to the function should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ). I can retrieve a specific frequency table by using the print_collocations function

the string_dissimilarity_matrix function returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character string vector only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).

the look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams. The words for each n-gram can be retrieved using the print_words_lookup_tbl function. The input can be a character string vector only.

Methods

token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = ' ', n_gram_delimiter = "_")
--------------
path_2vector()
--------------
freq_distribution()
--------------
print_frequency(subset = NULL)
--------------
count_character()
--------------
print_count_character(number = NULL)
--------------
collocation_words()
--------------
print_collocations(word = NULL)
--------------
string_dissimilarity_matrix(dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1.0, upper = TRUE, diagonal = TRUE, threads = 1)
--------------
look_up_table(n_grams = NULL)
--------------
print_words_lookup_tbl(n_gram = NULL)

Methods

Public methods

token_stats$new()
token_stats$path_2vector()
token_stats$freq_distribution()
token_stats$print_frequency()
token_stats$count_character()
token_stats$print_count_character()
token_stats$collocation_words()
token_stats$print_collocations()
token_stats$string_dissimilarity_matrix()
token_stats$look_up_table()
token_stats$print_words_lookup_tbl()
token_stats$clone()

Method `new()`

Usage

token_stats$new(
  x_vec = NULL,
  path_2folder = NULL,
  path_2file = NULL,
  file_delimiter = "\n",
  n_gram_delimiter = "_"
)

Arguments

x_vec: either NULL or a string character vector
path_2folder: either NULL or a valid path to a folder (each file in the folder should include words separated by a delimiter)
path_2file: either NULL or a valid path to a file
file_delimiter: either NULL or a character string specifying the file delimiter
n_gram_delimiter: either NULL or a character string specifying the n-gram delimiter. It is used in the collocation_words function

Method `path_2vector()`

Usage

token_stats$path_2vector()

Method `freq_distribution()`

Usage

token_stats$freq_distribution()

Method `print_frequency()`

Usage

token_stats$print_frequency(subset = NULL)

Arguments

subset: either NULL or a vector specifying the subset of data to keep (number of rows of the print_frequency function)

Method `count_character()`

Usage

token_stats$count_character()

Method `print_count_character()`

Usage

token_stats$print_count_character(number = NULL)

Arguments

number: a numeric value for the print_count_character function. All words with number of characters equal to the number parameter will be returned.

Method `collocation_words()`

Usage

token_stats$collocation_words()

Method `print_collocations()`

Usage

token_stats$print_collocations(word = NULL)

Arguments

word: a character string for the print_collocations and print_prob_next functions

Method `string_dissimilarity_matrix()`

Usage

token_stats$string_dissimilarity_matrix(
  dice_n_gram = 2,
  method = "dice",
  split_separator = " ",
  dice_thresh = 1,
  upper = TRUE,
  diagonal = TRUE,
  threads = 1
)

Arguments

dice_n_gram: a numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix function
method: a character string specifying the method to use in the string_dissimilarity_matrix function. One of dice, levenshtein or cosine.
split_separator: a character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix function. The cosine method uses sentences, so for a sentence : "this_is_a_word_sentence" the split_separator should be "_"
dice_thresh: a float number to use to threshold the data if method is dice in the string_dissimilarity_matrix function. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0.
upper: either TRUE or FALSE. If TRUE then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the upper part will be filled with NA's
diagonal: either TRUE or FALSE. If TRUE then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the diagonal will be filled with NA's
threads: a numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix function

Method `look_up_table()`

Usage

token_stats$look_up_table(n_grams = NULL)

Arguments

n_grams: a numeric value specifying the n-grams in the look_up_table function

Method `print_words_lookup_tbl()`

Usage

token_stats$print_words_lookup_tbl(n_gram = NULL)

Arguments

n_gram: a character string specifying the n-gram to use in the print_words_lookup_tbl function

Method `clone()`

The objects of this class are cloneable with this method.

Usage

token_stats$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples



library(textTinyR)

expl = c('one_word_token', 'two_words_token', 'three_words_token', 'four_words_token')

tk <- token_stats$new(x_vec = expl, path_2folder = NULL, path_2file = NULL)

#-------------------------
# frequency distribution:
#-------------------------

tk$freq_distribution()

# tk$print_frequency()


#------------------
# count characters:
#------------------

cnt <- tk$count_character()

# tk$print_count_character(number = 4)


#----------------------
# collocation of words:
#----------------------

col <- tk$collocation_words()

# tk$print_collocations(word = 'five')


#-----------------------------
# string dissimilarity matrix:
#-----------------------------

dism <- tk$string_dissimilarity_matrix(method = 'levenshtein')


#---------------------
# build a look-up-table:
#---------------------

lut <- tk$look_up_table(n_grams = 3)

# tk$print_words_lookup_tbl(n_gram = 'e_w')

Details

Methods

Methods

Public methods

Method new()

Usage

Arguments

Method path_2vector()

Usage

Method freq_distribution()

Usage

Method print_frequency()

Usage

Arguments

Method count_character()

Usage

Method print_count_character()

Usage

Arguments

Method collocation_words()

Usage

Method print_collocations()

Usage

Arguments

Method string_dissimilarity_matrix()

Usage

Arguments

Method look_up_table()

Usage

Arguments

Method print_words_lookup_tbl()

Usage

Arguments

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `path_2vector()`

Method `freq_distribution()`

Method `print_frequency()`

Method `count_character()`

Method `print_count_character()`

Method `collocation_words()`

Method `print_collocations()`

Method `string_dissimilarity_matrix()`

Method `look_up_table()`

Method `print_words_lookup_tbl()`

Method `clone()`