Term matrices and statistics ( document-term-matrix, term-document-matrix)

# utl <- sparse_term_matrix$new(vector_data = NULL, file_data = NULL,

#                                      document_term_matrix = TRUE)

Details

the Term_Matrix function takes either a character vector of strings or a text file and after tokenization and transformation returns either a document-term-matrix or a term-document-matrix

the triplet_data function returns the triplet data, which is used internally (in c++), to construct the Term Matrix. The triplet data could be usefull for secondary purposes, such as in word vector representations.

the global_term_weights function returns a list of length two. The first sublist includes the terms and the second sublist the global-term-weights. The tf_idf parameter should be set to FALSE and the normalize parameter to NULL. This function is normally used in conjuction with word-vector-embeddings.

the Term_Matrix_Adjust function removes sparse terms from a sparse matrix using a sparsity threshold

the term_associations function finds the associations between the given terms (Terms argument) and all the other terms in the corpus by calculating their correlation. There is also the option to keep a specific number of terms from the output table using the keep_terms parameter.

the most_frequent_terms function returns the most frequent terms of the corpus using the output of the sparse matrix. The user has the option to keep a specific number of terms from the output table using the keep_terms parameter.

Stemming of the english language is done using the porter2-stemmer, for details see https://github.com/smassung/porter2_stemmer

Methods

sparse_term_matrix$new(vector_data = NULL, file_data = NULL, document_term_matrix = TRUE)
--------------
Term_Matrix(sort_terms = FALSE, to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " .,;:()?!", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", print_every_rows = 1000, normalize = NULL, tf_idf = FALSE, threads = 1, verbose = FALSE)
--------------
triplet_data()
--------------
global_term_weights()
--------------
Term_Matrix_Adjust(sparsity_thresh = 1.0)
--------------
term_associations(Terms = NULL, keep_terms = NULL, verbose = FALSE)
--------------
most_frequent_terms(keep_terms = NULL, threads = 1, verbose = FALSE)

Methods

Public methods

sparse_term_matrix$new()
sparse_term_matrix$Term_Matrix()
sparse_term_matrix$triplet_data()
sparse_term_matrix$global_term_weights()
sparse_term_matrix$Term_Matrix_Adjust()
sparse_term_matrix$term_associations()
sparse_term_matrix$most_frequent_terms()
sparse_term_matrix$clone()

Method `new()`

Usage

sparse_term_matrix$new(
  vector_data = NULL,
  file_data = NULL,
  document_term_matrix = TRUE
)

Arguments

vector_data: either NULL or a character vector of documents
file_data: either NULL or a valid character path to a text file
document_term_matrix: either TRUE or FALSE. If TRUE then a document-term-matrix will be returned, otherwise a term-document-matrix

Method `Term_Matrix()`

Usage

sparse_term_matrix$Term_Matrix(
  sort_terms = FALSE,
  to_lower = FALSE,
  to_upper = FALSE,
  utf_locale = "",
  remove_char = "",
  remove_punctuation_string = FALSE,
  remove_punctuation_vector = FALSE,
  remove_numbers = FALSE,
  trim_token = FALSE,
  split_string = FALSE,
  split_separator = " \r\n\t.,;:()?!//",
  remove_stopwords = FALSE,
  language = "english",
  min_num_char = 1,
  max_num_char = Inf,
  stemmer = NULL,
  min_n_gram = 1,
  max_n_gram = 1,
  skip_n_gram = 1,
  skip_distance = 0,
  n_gram_delimiter = " ",
  print_every_rows = 1000,
  normalize = NULL,
  tf_idf = FALSE,
  threads = 1,
  verbose = FALSE
)

Arguments

sort_terms: either TRUE or FALSE specifying if the initial terms should be sorted ( so that the output sparse matrix is sorted in alphabetical order )
to_lower: either TRUE or FALSE. If TRUE the character string will be converted to lower case
to_upper: either TRUE or FALSE. If TRUE the character string will be converted to upper case
utf_locale: the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.
remove_char: a string specifying the specific characters that should be removed from a text file. If the remove_char is "" then no removal of characters take place
remove_punctuation_string: either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)
remove_punctuation_vector: either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)
remove_numbers: either TRUE or FALSE. If TRUE then any numbers in the character string will be removed
trim_token: either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right)
split_string: either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.
split_separator: a character string specifying the character delimiter(s)
remove_stopwords: either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded.
language: a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu
min_num_char: an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned
max_num_char: an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000)
stemmer: a character string specifying the stemming method. Available method is the porter2_stemmer. See details for more information.
min_n_gram: an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.
max_n_gram: an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.
skip_n_gram: an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1.
skip_distance: an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.
n_gram_delimiter: a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)
print_every_rows: a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function in case of big files.
normalize: either NULL or one of 'l1' or 'l2' normalization.
tf_idf: either TRUE or FALSE. If TRUE then the term-frequency-inverse-document-frequency will be returned
threads: an integer specifying the number of cores to run in parallel
verbose: either TRUE or FALSE. If TRUE then information will be printed out

Method `triplet_data()`

Usage

sparse_term_matrix$triplet_data()

Method `global_term_weights()`

Usage

sparse_term_matrix$global_term_weights()

Method `Term_Matrix_Adjust()`

Usage

sparse_term_matrix$Term_Matrix_Adjust(sparsity_thresh = 1)

Arguments

sparsity_thresh: a float number between 0.0 and 1.0 specifying the sparsity threshold in the Term_Matrix_Adjust function

Method `term_associations()`

Usage

sparse_term_matrix$term_associations(
  Terms = NULL,
  keep_terms = NULL,
  verbose = FALSE
)

Arguments

Terms: a character vector specifying the character strings for which the associations will be calculated ( term_associations function )
keep_terms: either NULL or a numeric value specifying the number of terms to keep ( both in term_associations and most_frequent_terms functions )
verbose: either TRUE or FALSE. If TRUE then information will be printed out

Method `most_frequent_terms()`

Usage

sparse_term_matrix$most_frequent_terms(
  keep_terms = NULL,
  threads = 1,
  verbose = FALSE
)

Arguments

keep_terms: either NULL or a numeric value specifying the number of terms to keep ( both in term_associations and most_frequent_terms functions )
threads: an integer specifying the number of cores to run in parallel
verbose: either TRUE or FALSE. If TRUE then information will be printed out

Method `clone()`

The objects of this class are cloneable with this method.

Usage

sparse_term_matrix$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples


library(textTinyR)


# sm <- sparse_term_matrix$new(file_data = "/folder/my_data.txt",

#                              document_term_matrix = TRUE)

#--------------
# term matrix :
#--------------

# sm$Term_Matrix(sort_terms = TRUE, to_lower = TRUE,

#                trim_token = TRUE, split_string = TRUE,

#                remove_stopwords = TRUE, normalize = 'l1',

#                stemmer = 'porter2_stemmer', threads = 1 )

#---------------
# triplet data :
#---------------

# sm$triplet_data()


#----------------------
# global-term-weights :
#----------------------

# sm$global_term_weights()


#-------------------------
# removal of sparse terms:
#-------------------------

# sm$Term_Matrix_Adjust(sparsity_thresh = 0.995)


#-----------------------------------------------
# associations between terms of a sparse matrix:
#-----------------------------------------------


# sm$term_associations(Terms = c("word", "sentence"), keep_terms = 10)


#---------------------------------------------
# most frequent terms using the sparse matrix:
#---------------------------------------------


# sm$most_frequent_terms(keep_terms = 10, threads = 1)

Term matrices and statistics ( document-term-matrix, term-document-matrix)

Details

Methods

Methods

Public methods

Method new()

Usage

Arguments

Method Term_Matrix()

Usage

Arguments

Method triplet_data()

Usage

Method global_term_weights()

Usage

Method Term_Matrix_Adjust()

Usage

Arguments

Method term_associations()

Usage

Arguments

Method most_frequent_terms()

Usage

Arguments

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `Term_Matrix()`

Method `triplet_data()`

Method `global_term_weights()`

Method `Term_Matrix_Adjust()`

Method `term_associations()`

Method `most_frequent_terms()`

Method `clone()`