cooccurence statistics
cooccurrence_statistics(train_data = NULL, vocab_input = NULL, output_cooccurences = NULL, symmetric_both = TRUE, context_words = 15, memory_gb = 4, MAX_product = 0, overflowLength = 0, trace = FALSE)
train_data | a character string specifying the path to the train text file |
---|---|
vocab_input | a character string specifying the path to the vocabulary text file (the output file of the vocabulary_counts function) |
output_cooccurences | a character string specifying the path to the output cooccurences text file |
symmetric_both | either TRUE or FALSE. If TRUE then both left and right context will be used, otherwise only the left context (see .pdf file in references for more information) |
context_words | the number of context words to the left (and to the right, if symmetric_both = TRUE). The default is 15 |
memory_gb | a float number specifying the limit for memory consumption, in GB -- based on simple heuristic, so not extremely accurate; the default is 4.0 |
MAX_product | a number specifying the size-limit of dense cooccurrence array by specifying the max product (integer) of the frequency counts of the two cooccurring words. Typically only needs adjustment for use with very large corpora |
overflowLength | a number specifying the length-limit of the sparse overflow array, which buffers cooccurrence data that does not fit in the dense array, before writing to disk. Typically only needs adjustment for use with very large corpora |
trace | either TRUE or FALSE. If TRUE information will be printed out |
a character string specifying the location of the saved data
https://github.com/stanfordnlp/GloVe
http://nlp.stanford.edu/projects/glove/
http://nlp.stanford.edu/pubs/glove.pdf
# library(GloveR) # co_mat = cooccurrence_statistics(train_data = '/data_GloveR/dat.txt', vocab_input = '/data_GloveR/VOCAB.txt', # output_cooccurences = '/data_GloveR/COOCUR.bin', symmetric_both = TRUE, context_words = 15, # memory_gb = 4.0, MAX_product = 0, overflowLength = 0, trace = TRUE)