R/utils.R
Doc2Vec.Rd
Conversion of text documents to word-vector-representation features ( Doc2Vec )
Conversion of text documents to word-vector-representation features ( Doc2Vec )
# utl <- Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, # print_every_rows = 10000, verbose = FALSE, # copy_data = FALSE)
a matrix
the pre_processed_wv method should be used after the initialization of the Doc2Vec class, if the copy_data parameter is set to TRUE, in order to inspect the pre-processed word-vectors.
The global_term_weights method is part of the sparse_term_matrix R6 class of the textTinyR package. One can come to the correct global_term_weights by using the sparse_term_matrix class and by setting the tf_idf parameter to FALSE and the normalize parameter to NULL. In Doc2Vec class, if method equals to idf then the global_term_weights parameter should not be equal to NULL.
Explanation of the various methods :
Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be accumulated to a vector equal to the length of the wordvector (INITIAL_WORD_VECTOR). Then a scalar will be computed using this INITIAL_WORD_VECTOR in the following way : the INITIAL_WORD_VECTOR will be raised to the power of 2.0, then the resulted wordvector will be summed and the square-root will be calculated. The INITIAL_WORD_VECTOR will be divided by the resulted scalar
Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be first min-max normalized and then will be accumulated to a vector equal to the length of the initial wordvector
Assuming that a single sublist of the token list will be taken into consideration : the word-vector of each term in the sublist will be multiplied with the corresponding idf of the global weights term
Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE)
--------------
doc2vec_methods(method = "sum_sqrt", global_term_weights = NULL, threads = 1)
--------------
pre_processed_wv()
new()
Doc2Vec$new( token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE )
token_list
either NULL or a list of tokenized text documents
word_vector_FILE
a valid path to a text file, where the word-vectors are saved
print_every_rows
a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function especially in case of big files.
verbose
either TRUE or FALSE. If TRUE then information will be printed out in the R session.
copy_data
either TRUE or FALSE. If FALSE then a pointer will be created and no copy of the initial data takes place (memory efficient especially for big datasets). This is an alternative way to pre-process the data.
doc2vec_methods()
Doc2Vec$doc2vec_methods( method = "sum_sqrt", global_term_weights = NULL, threads = 1 )
method
a character string specifying the method to use. One of sum_sqrt, min_max_norm or idf. See the details section for more information.
global_term_weights
either NULL or the output of the global_term_weights method of the textTinyR package. See the details section for more information.
threads
a numeric value specifying the number of cores to run in parallel
pre_processed_wv()
Doc2Vec$pre_processed_wv()
clone()
The objects of this class are cloneable with this method.
Doc2Vec$clone(deep = FALSE)
deep
Whether to make a deep clone.
library(textTinyR) #--------------------------------- # tokenized text in form of a list #--------------------------------- tok_text = list(c('the', 'result', 'of'), c('doc2vec', 'are', 'vector', 'features')) #------------------------- # path to the word vectors #------------------------- PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR") init = Doc2Vec$new(token_list = tok_text, word_vector_FILE = PATH) out = init$doc2vec_methods(method = "sum_sqrt")