Conversion of text documents to word-vector-representation features ( Doc2Vec )

# utl <- Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL,

       #                    print_every_rows = 10000, verbose = FALSE,

       #                    copy_data = FALSE)

Value

a matrix

Details

the pre_processed_wv method should be used after the initialization of the Doc2Vec class, if the copy_data parameter is set to TRUE, in order to inspect the pre-processed word-vectors.

The global_term_weights method is part of the sparse_term_matrix R6 class of the textTinyR package. One can come to the correct global_term_weights by using the sparse_term_matrix class and by setting the tf_idf parameter to FALSE and the normalize parameter to NULL. In Doc2Vec class, if method equals to idf then the global_term_weights parameter should not be equal to NULL.

Explanation of the various methods :

sum_sqrt: Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be accumulated to a vector equal to the length of the wordvector (INITIAL_WORD_VECTOR). Then a scalar will be computed using this INITIAL_WORD_VECTOR in the following way : the INITIAL_WORD_VECTOR will be raised to the power of 2.0, then the resulted wordvector will be summed and the square-root will be calculated. The INITIAL_WORD_VECTOR will be divided by the resulted scalar
min_max_norm: Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be first min-max normalized and then will be accumulated to a vector equal to the length of the initial wordvector
idf: Assuming that a single sublist of the token list will be taken into consideration : the word-vector of each term in the sublist will be multiplied with the corresponding idf of the global weights term

Methods

Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE)
--------------
doc2vec_methods(method = "sum_sqrt", global_term_weights = NULL, threads = 1)
--------------
pre_processed_wv()

Methods

Method `new()`

Usage

Doc2Vec$new(
  token_list = NULL,
  word_vector_FILE = NULL,
  print_every_rows = 10000,
  verbose = FALSE,
  copy_data = FALSE
)

Arguments

token_list: either NULL or a list of tokenized text documents
word_vector_FILE: a valid path to a text file, where the word-vectors are saved
print_every_rows: a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function especially in case of big files.
verbose: either TRUE or FALSE. If TRUE then information will be printed out in the R session.
copy_data: either TRUE or FALSE. If FALSE then a pointer will be created and no copy of the initial data takes place (memory efficient especially for big datasets). This is an alternative way to pre-process the data.

Method `doc2vec_methods()`

Usage

Doc2Vec$doc2vec_methods(
  method = "sum_sqrt",
  global_term_weights = NULL,
  threads = 1
)

Arguments

method: a character string specifying the method to use. One of sum_sqrt, min_max_norm or idf. See the details section for more information.
global_term_weights: either NULL or the output of the global_term_weights method of the textTinyR package. See the details section for more information.
threads: a numeric value specifying the number of cores to run in parallel

Method `pre_processed_wv()`

Usage

Doc2Vec$pre_processed_wv()

Method `clone()`

The objects of this class are cloneable with this method.

Usage

Doc2Vec$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples


library(textTinyR)

#---------------------------------
# tokenized text in form of a list
#---------------------------------

tok_text = list(c('the', 'result', 'of'), c('doc2vec', 'are', 'vector', 'features'))

#-------------------------
# path to the word vectors
#-------------------------

PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR")


init = Doc2Vec$new(token_list = tok_text, word_vector_FILE = PATH)


out = init$doc2vec_methods(method = "sum_sqrt")

Conversion of text documents to word-vector-representation features ( Doc2Vec )

Value

Details

Methods

Methods

Public methods

Method new()

Usage

Arguments

Method doc2vec_methods()

Usage

Arguments

Method pre_processed_wv()

Usage

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `doc2vec_methods()`

Method `pre_processed_wv()`

Method `clone()`