cooccurence statistics

cooccurrence_statistics(train_data = NULL, vocab_input = NULL,
  output_cooccurences = NULL, symmetric_both = TRUE,
  context_words = 15, memory_gb = 4, MAX_product = 0,
  overflowLength = 0, trace = FALSE)

Arguments

train_data

a character string specifying the path to the train text file

vocab_input

a character string specifying the path to the vocabulary text file (the output file of the vocabulary_counts function)

output_cooccurences

a character string specifying the path to the output cooccurences text file

symmetric_both

either TRUE or FALSE. If TRUE then both left and right context will be used, otherwise only the left context (see .pdf file in references for more information)

context_words

the number of context words to the left (and to the right, if symmetric_both = TRUE). The default is 15

memory_gb

a float number specifying the limit for memory consumption, in GB -- based on simple heuristic, so not extremely accurate; the default is 4.0

MAX_product

a number specifying the size-limit of dense cooccurrence array by specifying the max product (integer) of the frequency counts of the two cooccurring words. Typically only needs adjustment for use with very large corpora

overflowLength

a number specifying the length-limit of the sparse overflow array, which buffers cooccurrence data that does not fit in the dense array, before writing to disk. Typically only needs adjustment for use with very large corpora

trace

either TRUE or FALSE. If TRUE information will be printed out

Value

a character string specifying the location of the saved data

References

https://github.com/stanfordnlp/GloVe

http://nlp.stanford.edu/projects/glove/

http://nlp.stanford.edu/pubs/glove.pdf

Examples

# library(GloveR) # co_mat = cooccurrence_statistics(train_data = '/data_GloveR/dat.txt', vocab_input = '/data_GloveR/VOCAB.txt', # output_cooccurences = '/data_GloveR/COOCUR.bin', symmetric_both = TRUE, context_words = 15, # memory_gb = 4.0, MAX_product = 0, overflowLength = 0, trace = TRUE)