docs_matrix¶

Classes

docs_matrix document-term or term-document matrices

class docs_matrix.docs_matrix[source]¶

document-term or term-document matrices

Term_Matrix(vector_documents=None, path_2documents_file=None, sort_columns=False, LOCALE_UTF='', to_lower=False, to_upper=False, language='english', REMOVE_characters='', remove_punctuation_string=False, remove_numbers=False, trim_token=False, split_string=True, separator=' \r\n\t., ;:()?!//', remove_punctuation_vector=False, remove_stopwords=False, min_num_char=1, max_num_char=9223372036854775807, stemmer=None, min_n_gram=1, max_n_gram=1, skip_n_gram=1, skip_distance=0, n_gram_delimiter=' ', print_every_rows=1000, normalize=None, tf_idf=False, threads=1, verbose=False)[source]¶

Parameters:

vector_documents – either None or a character vector of documents
path_2documents_file – either None or a valid character path to a text file
sort_columns – either True or False specifying if the initial terms should be sorted ( so that the output sparse matrix is sorted in alphabetical order )
LOCALE_UTF – the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be ‘el_GR.UTF-8’ ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.
to_lower – either True or False. If True the character string will be converted to lower case
to_upper – either True or False. If True the character string will be converted to upper case
language –
a character string which defaults to english. If the remove_stopwords parameter is True then the corresponding stop words vector will be uploaded. Available languages ‘afrikaans’,

’arabic’, ‘armenian’, ‘basque’, ‘bengali’, ‘breton’, ‘bulgarian’, ‘catalan’, ‘croatian’, ‘czech’,’danish’, ‘dutch’, ‘english’, ‘estonian’, ‘finnish’, ‘french’, ‘galician’, ‘german’, ‘greek’, ‘hausa’, ‘hebrew’, ‘hindi’, ‘hungarian’, ‘indonesian’, ‘irish’, ‘italian’, ‘latvian’, ‘marathi’, ‘norwegian’, ‘persian’, ‘polish’, ‘portuguese’, ‘romanian’, ‘russian’, ‘slovak’, ‘slovenian’, ‘somalia’, ‘spanish’, ‘swahili’, ‘swedish’, ‘turkish’, ‘yoruba’, ‘zulu’
REMOVE_characters – a character string with specific characters that should be removed from the text file. If the remove_char is “” then no removal of characters take place
remove_punctuation_string – either True or False. If True then the punctuation of the character string will be removed (applies before the split function)
remove_numbers – either True or False. If True then any numbers in the character string will be removed
trim_token – either True or False. If True then the string will be trimmed (left and/or right)
split_string – either True or False. If True then the character string will be split using the separator as delimiter. The user can also specify multiple delimiters.
separator – a character string specifying the character delimiter(s)
remove_punctuation_vector – either True or False. If True then the punctuation of the vector of the character strings will be removed (after the string split has taken place)
remove_stopwords – either True, False or a character vector of user defined stop words. If True then by using the language parameter the corresponding stop words vector will be uploaded.
min_num_char – an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned
max_num_char – an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this method the Inf value translates to a word-length of 1000000000)
stemmer – a character string specifying the stemming method. Available method is porter2_stemmer.
min_n_gram – an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.
max_n_gram – an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.
skip_n_gram – an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1.
skip_distance – an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.
n_gram_delimiter – a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)
print_every_rows – a numeric value greater than 1 specifying the print intervals. Frequent output in the console can slow down the method in case of big files.
normalize – either None or one of ‘l1’ or ‘l2’ normalization.
tf_idf – either True or False. If True then the term-frequency-inverse-document-frequency will be returned
threads – an integer specifying the number of cores to run in parallel
verbose – either True or False. If True then information will be printed out

Example:

tm = docs_matrix()

tm.Term_Matrix(path_2documents_file = '/myfolder/input_file.txt', sort_columns = True, to_lower = True, split_string = True, tf_idf = True)

Note

The Term_Matrix method takes either a character list of strings or a text file and after tokenization and transformation it saves the terms, row-indices, column-indices and counts

triplet_data()[source]¶

Example:

tm = docs_matrix()

tm.Term_Matrix(path_2documents_file = '/myfolder/input_file.txt', sort_columns = True, to_lower = True, split_string = True, tf_idf = True)

trpl_dat = tm.triplet_data()

Note

This method returns the terms, row-indices, column-indices and counts ( or floats ).

The ‘triplet_data’ method is called after the ‘Term_Matrix’ method, otherwise the output will be an empty dictionary.

document_term_matrix(to_array=False)[source]¶

Parameters:	to_array – either True or False. If True then the output will be an numpy array, otherwise a sparse matrix

Example:

tm = docs_matrix()

tm.Term_Matrix(path_2documents_file = '/myfolder/input_file.txt', sort_columns = True, to_lower = True, split_string = True, tf_idf = True)

res_dtm = tm.document_term_matrix(to_array = True)

Note

This method should be called after the ‘Term_Matrix’ method is run. It returns a document-term-matrix

Here the sparse matrix format is a ‘csr_matrix’ because shape[0] < shape[1] (rows < columns)

term_document_matrix(to_array=False)[source]¶

Parameters:	to_array – either True or False. If True then the output will be an numpy array, otherwise a sparse matrix

Example:

tm = docs_matrix()

tm.Term_Matrix(path_2documents_file = '/myfolder/input_file.txt', sort_columns = True, to_lower = True, split_string = True, tf_idf = True)

res_tdm = tm.term_document_matrix(to_array = True)

Note

This method should be called after the ‘Term_Matrix’ method is run. It returns a term-document-matrix.

Here the sparse matrix format is a ‘csc_matrix’ because shape[0] > shape[1] (rows > columns)

corpus_terms()[source]¶

Note

The corpus_terms method returns the terms of the corpus. There are two different cases:

1st. either the ‘document_term_matrix’ or the ‘term_document_matrix’ was run first –> it returns all the terms of the corpus.

Example:
tm = docs_matrix()

tm.Term_Matrix(path_2documents_file = '/myfolder/input_file.txt', sort_columns = True, to_lower = True, split_string = True, tf_idf = True)

res_crp_all = tm.corpus_terms()
2nd. the ‘Term_Matrix_Adjust’ method was run first –> it retuns a reduced list of terms taking into account the output of the ‘Term_Matrix_Adjust’ method

Example:
tm = docs_matrix()

tm.Term_Matrix(path_2documents_file = '/myfolder/input_file.txt', sort_columns = True, to_lower = True, split_string = True, tf_idf = True)

res_adj = tm.Term_Matrix_Adjust(sparsity_thresh = 0.9)

res_crp_reduced = tm.corpus_terms()

Sparsity()[source]¶

Example:

tm = docs_matrix()

tm.Term_Matrix(path_2documents_file = '/myfolder/input_file.txt', sort_columns = True, to_lower = True, split_string = True, tf_idf = True)

tm.Sparsity()

Note

returns the sparsity of the initial term-matrix

Term_Matrix_Adjust(sparsity_thresh=1.0, to_array=False)[source]¶

Parameters:	sparsity_thresh – a float number between 0.0 and 1.0 specifying the sparsity threshold to_array – either True or False. If True then the output will be an numpy array, otherwise a sparse matrix

Example:

tm = docs_matrix()
    
tm.Term_Matrix(path_2documents_file = '/myfolder/input_file.txt', sort_columns = True, to_lower = True, split_string = True, tf_idf = True)
        
res_adj = tm.Term_Matrix_Adjust(sparsity_thresh = 0.9)

Note

The Term_Matrix_Adjust method removes sparse terms from the output matrix using a sparsity threshold

most_frequent_terms(keep_terms=None, threads=1, verbose=False)[source]¶

Parameters:	keep_terms – a numeric value specifying the number of rows (terms) to keep from the output data frame threads – an integer specifying the number of cores to run in parallel verbose – either True or False. If True then information will be printed out

Example:

tm = docs_matrix()

tm.Term_Matrix(path_2documents_file = '/myfolder/input_file.txt', sort_columns = True, to_lower = True, split_string = True, tf_idf = True)

res_freq = tm.most_frequent_terms(keep_terms = 10, threads = 1, verbose = False)

Note

The most_frequent_terms method returns the most frequent terms of the corpus using the output of the Term_matrix method. The user has the option

to keep a specific number of terms from the output table using the keep_terms parameter.

term_associations(Terms=None, keep_terms=None, verbose=False)[source]¶

Parameters:	Terms – a character list specifying the character strings for which the associations will be computed keep_terms – a numeric value specifying the number of rows (terms) to keep from the output data frame verbose – either True or False. If True then information will be printed out

Example:

tm = docs_matrix()

tm.Term_Matrix(path_2documents_file = '/myfolder/input_file.txt', sort_columns = True, to_lower = True, split_string = True, tf_idf = True)

res_assoc = tm.term_associations(Terms = ['this', 'word', 'that'], keep_terms = 10, verbose = False)

Note

The term_associations method finds the associations between the given terms (Terms argument) and all the other terms in the corpus by calculating their correlation.

There is also the option to keep a specific number of terms from the output table using the keep_terms parameter.