token statistics

token statistics

# utl <- token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL,

#                               file_delimiter = ' ', n_gram_delimiter = "_")

Details

the path_2vector function returns the words of a folder or file to a vector ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file

the freq_distribution function returns a named-unsorted vector frequency_distribution in R for EITHER a folder, a file OR a character string vector. A specific subset of the result can be retrieved using the print_frequency function

the count_character function returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string vector. A specific number of character words can be retrieved using the print_count_character function

the collocation_words function returns a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string vector. A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ). The input to the function should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ). I can retrieve a specific frequency table by using the print_collocations function

the string_dissimilarity_matrix function returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character string vector only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).

the look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams. The words for each n-gram can be retrieved using the print_words_lookup_tbl function. The input can be a character string vector only.

Methods

token_stats$new(x_vec = NULL, path_2folder = NULL, path_2file = NULL, file_delimiter = ' ', n_gram_delimiter = "_")

--------------

path_2vector()

--------------

freq_distribution()

--------------

print_frequency(subset = NULL)

--------------

count_character()

--------------

print_count_character(number = NULL)

--------------

collocation_words()

--------------

print_collocations(word = NULL)

--------------

string_dissimilarity_matrix(dice_n_gram = 2, method = "dice", split_separator = " ", dice_thresh = 1.0, upper = TRUE, diagonal = TRUE, threads = 1)

--------------

look_up_table(n_grams = NULL)

--------------

print_words_lookup_tbl(n_gram = NULL)

Methods

Public methods


Method new()

Usage

token_stats$new(
  x_vec = NULL,
  path_2folder = NULL,
  path_2file = NULL,
  file_delimiter = "\n",
  n_gram_delimiter = "_"
)

Arguments

x_vec

either NULL or a string character vector

path_2folder

either NULL or a valid path to a folder (each file in the folder should include words separated by a delimiter)

path_2file

either NULL or a valid path to a file

file_delimiter

either NULL or a character string specifying the file delimiter

n_gram_delimiter

either NULL or a character string specifying the n-gram delimiter. It is used in the collocation_words function


Method path_2vector()

Usage

token_stats$path_2vector()


Method freq_distribution()

Usage

token_stats$freq_distribution()


Method print_frequency()

Usage

token_stats$print_frequency(subset = NULL)

Arguments

subset

either NULL or a vector specifying the subset of data to keep (number of rows of the print_frequency function)


Method count_character()

Usage

token_stats$count_character()


Method print_count_character()

Usage

token_stats$print_count_character(number = NULL)

Arguments

number

a numeric value for the print_count_character function. All words with number of characters equal to the number parameter will be returned.


Method collocation_words()

Usage

token_stats$collocation_words()


Method print_collocations()

Usage

token_stats$print_collocations(word = NULL)

Arguments

word

a character string for the print_collocations and print_prob_next functions


Method string_dissimilarity_matrix()

Usage

token_stats$string_dissimilarity_matrix(
  dice_n_gram = 2,
  method = "dice",
  split_separator = " ",
  dice_thresh = 1,
  upper = TRUE,
  diagonal = TRUE,
  threads = 1
)

Arguments

dice_n_gram

a numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix function

method

a character string specifying the method to use in the string_dissimilarity_matrix function. One of dice, levenshtein or cosine.

split_separator

a character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix function. The cosine method uses sentences, so for a sentence : "this_is_a_word_sentence" the split_separator should be "_"

dice_thresh

a float number to use to threshold the data if method is dice in the string_dissimilarity_matrix function. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0.

upper

either TRUE or FALSE. If TRUE then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the upper part will be filled with NA's

diagonal

either TRUE or FALSE. If TRUE then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix function will be shown. Otherwise the diagonal will be filled with NA's

threads

a numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix function


Method look_up_table()

Usage

token_stats$look_up_table(n_grams = NULL)

Arguments

n_grams

a numeric value specifying the n-grams in the look_up_table function


Method print_words_lookup_tbl()

Usage

token_stats$print_words_lookup_tbl(n_gram = NULL)

Arguments

n_gram

a character string specifying the n-gram to use in the print_words_lookup_tbl function


Method clone()

The objects of this class are cloneable with this method.

Usage

token_stats$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples



library(textTinyR)

expl = c('one_word_token', 'two_words_token', 'three_words_token', 'four_words_token')

tk <- token_stats$new(x_vec = expl, path_2folder = NULL, path_2file = NULL)

#-------------------------
# frequency distribution:
#-------------------------

tk$freq_distribution()

# tk$print_frequency()


#------------------
# count characters:
#------------------

cnt <- tk$count_character()

# tk$print_count_character(number = 4)


#----------------------
# collocation of words:
#----------------------

col <- tk$collocation_words()

# tk$print_collocations(word = 'five')


#-----------------------------
# string dissimilarity matrix:
#-----------------------------

dism <- tk$string_dissimilarity_matrix(method = 'levenshtein')


#---------------------
# build a look-up-table:
#---------------------

lut <- tk$look_up_table(n_grams = 3)

# tk$print_words_lookup_tbl(n_gram = 'e_w')