All functions

COS_TEXT()

Cosine similarity for text documents

Count_Rows()

Number of rows of a file

Doc2Vec

Conversion of text documents to word-vector-representation features ( Doc2Vec )

JACCARD_DICE()

Jaccard or Dice similarity for text documents

TEXT_DOC_DISSIM()

Dissimilarity calculation of text documents

big_tokenize_transform

String tokenization and transformation for big data sets

bytes_converter()

bytes converter of a text file ( KB, MB or GB )

cluster_frequency()

Frequencies of an existing cluster object

cosine_distance()

cosine distance of two character strings (each string consists of more than one words)

dense_2sparse()

convert a dense matrix to a sparse matrix

dice_distance()

dice similarity of words using n-grams

dims_of_word_vecs()

dimensions of a word vectors file

levenshtein_distance()

levenshtein distance of two words

load_sparse_binary()

load a sparse matrix in binary format

matrix_sparsity()

sparsity percentage of a sparse matrix

read_characters()

read a specific number of characters from a text file

read_rows()

read a specific number of rows from a text file

save_sparse_binary()

save a sparse matrix in binary format

select_predictors()

Exclude highly correlated predictors

sparse_Means()

RowMens and colMeans for a sparse matrix

sparse_Sums()

RowSums and colSums for a sparse matrix

sparse_term_matrix

Term matrices and statistics ( document-term-matrix, term-document-matrix)

text_file_parser()

text file parser

text_intersect

intersection of words or letters in tokenized text

token_stats

token statistics

tokenize_transform_text()

String tokenization and transformation ( character string or path to a file )

tokenize_transform_vec_docs()

String tokenization and transformation ( vector of documents )

utf_locale()

utf-locale for the available languages

vocabulary_parser()

returns the vocabulary counts for small or medium ( xml and not only ) files