Conversion of text documents to word-vector-representation features ( Doc2Vec )

Conversion of text documents to word-vector-representation features ( Doc2Vec )

# utl <- Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL,

       #                    print_every_rows = 10000, verbose = FALSE,

       #                    copy_data = FALSE)

Value

a matrix

Details

the pre_processed_wv method should be used after the initialization of the Doc2Vec class, if the copy_data parameter is set to TRUE, in order to inspect the pre-processed word-vectors.

The global_term_weights method is part of the sparse_term_matrix R6 class of the textTinyR package. One can come to the correct global_term_weights by using the sparse_term_matrix class and by setting the tf_idf parameter to FALSE and the normalize parameter to NULL. In Doc2Vec class, if method equals to idf then the global_term_weights parameter should not be equal to NULL.

Explanation of the various methods :

sum_sqrt

Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be accumulated to a vector equal to the length of the wordvector (INITIAL_WORD_VECTOR). Then a scalar will be computed using this INITIAL_WORD_VECTOR in the following way : the INITIAL_WORD_VECTOR will be raised to the power of 2.0, then the resulted wordvector will be summed and the square-root will be calculated. The INITIAL_WORD_VECTOR will be divided by the resulted scalar

min_max_norm

Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be first min-max normalized and then will be accumulated to a vector equal to the length of the initial wordvector

idf

Assuming that a single sublist of the token list will be taken into consideration : the word-vector of each term in the sublist will be multiplied with the corresponding idf of the global weights term

There might be slight differences in the output data for each method depending on the input value of the copy_data parameter (if it's either TRUE or FALSE).

Methods

Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE)

--------------

doc2vec_methods(method = "sum_sqrt", global_term_weights = NULL, threads = 1)

--------------

pre_processed_wv()

Methods

Public methods


Method new()

Usage

Doc2Vec$new(
  token_list = NULL,
  word_vector_FILE = NULL,
  print_every_rows = 10000,
  verbose = FALSE,
  copy_data = FALSE
)

Arguments

token_list

either NULL or a list of tokenized text documents

word_vector_FILE

a valid path to a text file, where the word-vectors are saved

print_every_rows

a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function especially in case of big files.

verbose

either TRUE or FALSE. If TRUE then information will be printed out in the R session.

copy_data

either TRUE or FALSE. If FALSE then a pointer will be created and no copy of the initial data takes place (memory efficient especially for big datasets). This is an alternative way to pre-process the data.


Method doc2vec_methods()

Usage

Doc2Vec$doc2vec_methods(
  method = "sum_sqrt",
  global_term_weights = NULL,
  threads = 1
)

Arguments

method

a character string specifying the method to use. One of sum_sqrt, min_max_norm or idf. See the details section for more information.

global_term_weights

either NULL or the output of the global_term_weights method of the textTinyR package. See the details section for more information.

threads

a numeric value specifying the number of cores to run in parallel


Method pre_processed_wv()

Usage

Doc2Vec$pre_processed_wv()


Method clone()

The objects of this class are cloneable with this method.

Usage

Doc2Vec$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples


library(textTinyR)

#---------------------------------
# tokenized text in form of a list
#---------------------------------

tok_text = list(c('the', 'result', 'of'), c('doc2vec', 'are', 'vector', 'features'))

#-------------------------
# path to the word vectors
#-------------------------

PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR")


init = Doc2Vec$new(token_list = tok_text, word_vector_FILE = PATH)


out = init$doc2vec_methods(method = "sum_sqrt")