token statistics
token statistics
# utl <- token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, # file_delimiter = ' ', n_gram_delimiter = "_")
the path_2vector function returns the words of a folder or file to a vector ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file
the freq_distribution function returns a named-unsorted vector frequency_distribution in R for EITHER a folder, a file OR a character string vector. A specific subset of the result can be retrieved using the print_frequency function
the count_character function returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string vector. A specific number of character words can be retrieved using the print_count_character function
the collocation_words function returns a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string vector. A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ). The input to the function should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ). I can retrieve a specific frequency table by using the print_collocations function
the string_dissimilarity_matrix function returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character string vector only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).
the look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams. The words for each n-gram can be retrieved using the print_words_lookup_tbl function. The input can be a character string vector only.
token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = ' ', n_gram_delimiter = "_")
--------------
path_2vector()
--------------
freq_distribution()
--------------
print_frequency(subset = NULL)
--------------
count_character()
--------------
print_count_character(number = NULL)
--------------
collocation_words()
--------------
print_collocations(word = NULL)
--------------
string_dissimilarity_matrix(dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1.0, upper = TRUE, diagonal = TRUE, threads = 1)
--------------
look_up_table(n_grams = NULL)
--------------
print_words_lookup_tbl(n_gram = NULL)
new()
token_stats$new( x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = "\n", n_gram_delimiter = "_" )
x_vec
either NULL or a string character vector
path_2folder
either NULL or a valid path to a folder (each file in the folder should include words separated by a delimiter)
path_2file
either NULL or a valid path to a file
file_delimiter
either NULL or a character string specifying the file delimiter
n_gram_delimiter
either NULL or a character string specifying the n-gram delimiter. It is used in the collocation_words function
path_2vector()
token_stats$path_2vector()
freq_distribution()
token_stats$freq_distribution()
print_frequency()
token_stats$print_frequency(subset = NULL)
subset
either NULL or a vector specifying the subset of data to keep (number of rows of the print_frequency function)
count_character()
token_stats$count_character()
print_count_character()
token_stats$print_count_character(number = NULL)
number
a numeric value for the print_count_character function. All words with number of characters equal to the number parameter will be returned.
collocation_words()
token_stats$collocation_words()
print_collocations()
token_stats$print_collocations(word = NULL)
word
a character string for the print_collocations and print_prob_next functions
string_dissimilarity_matrix()
token_stats$string_dissimilarity_matrix( dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1, upper = TRUE, diagonal = TRUE, threads = 1 )
dice_n_gram
a numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix function
method
a character string specifying the method to use in the string_dissimilarity_matrix function. One of dice, levenshtein or cosine.
split_separator
a character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix function. The cosine method uses sentences, so for a sentence : "this_is_a_word_sentence" the split_separator should be "_"
dice_thresh
a float number to use to threshold the data if method is dice in the string_dissimilarity_matrix function. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0.
upper
either TRUE or FALSE. If TRUE then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the upper part will be filled with NA's
diagonal
either TRUE or FALSE. If TRUE then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the diagonal will be filled with NA's
threads
a numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix function
look_up_table()
token_stats$look_up_table(n_grams = NULL)
n_grams
a numeric value specifying the n-grams in the look_up_table function
print_words_lookup_tbl()
token_stats$print_words_lookup_tbl(n_gram = NULL)
n_gram
a character string specifying the n-gram to use in the print_words_lookup_tbl function
clone()
The objects of this class are cloneable with this method.
token_stats$clone(deep = FALSE)
deep
Whether to make a deep clone.
library(textTinyR) expl = c('one_word_token', 'two_words_token', 'three_words_token', 'four_words_token') tk <- token_stats$new(x_vec = expl, path_2folder = NULL, path_2file = NULL) #------------------------- # frequency distribution: #------------------------- tk$freq_distribution() # tk$print_frequency() #------------------ # count characters: #------------------ cnt <- tk$count_character() # tk$print_count_character(number = 4) #---------------------- # collocation of words: #---------------------- col <- tk$collocation_words() # tk$print_collocations(word = 'five') #----------------------------- # string dissimilarity matrix: #----------------------------- dism <- tk$string_dissimilarity_matrix(method = 'levenshtein') #--------------------- # build a look-up-table: #--------------------- lut <- tk$look_up_table(n_grams = 3) # tk$print_words_lookup_tbl(n_gram = 'e_w')