COS_TEXT()
|
Cosine similarity for text documents |
Count_Rows()
|
Number of rows of a file |
Doc2Vec
|
Conversion of text documents to word-vector-representation features ( Doc2Vec ) |
JACCARD_DICE()
|
Jaccard or Dice similarity for text documents |
TEXT_DOC_DISSIM()
|
Dissimilarity calculation of text documents |
big_tokenize_transform
|
String tokenization and transformation for big data sets |
bytes_converter()
|
bytes converter of a text file ( KB, MB or GB ) |
cluster_frequency()
|
Frequencies of an existing cluster object |
cosine_distance()
|
cosine distance of two character strings (each string consists of more than one words) |
dense_2sparse()
|
convert a dense matrix to a sparse matrix |
dice_distance()
|
dice similarity of words using n-grams |
dims_of_word_vecs()
|
dimensions of a word vectors file |
levenshtein_distance()
|
levenshtein distance of two words |
load_sparse_binary()
|
load a sparse matrix in binary format |
matrix_sparsity()
|
sparsity percentage of a sparse matrix |
read_characters()
|
read a specific number of characters from a text file |
read_rows()
|
read a specific number of rows from a text file |
save_sparse_binary()
|
save a sparse matrix in binary format |
select_predictors()
|
Exclude highly correlated predictors |
sparse_Means()
|
RowMens and colMeans for a sparse matrix |
sparse_Sums()
|
RowSums and colSums for a sparse matrix |
sparse_term_matrix
|
Term matrices and statistics ( document-term-matrix, term-document-matrix) |
text_file_parser()
|
text file parser |
text_intersect
|
intersection of words or letters in tokenized text |
token_stats
|
token statistics |
tokenize_transform_text()
|
String tokenization and transformation ( character string or path to a file ) |
tokenize_transform_vec_docs()
|
String tokenization and transformation ( vector of documents ) |
utf_locale()
|
utf-locale for the available languages |
vocabulary_parser()
|
returns the vocabulary counts for small or medium ( xml and not only ) files |