token_stats¶
Classes
token_stats |
functions to compute token statistics |
-
class
token_stats.
token_stats
[source]¶ functions to compute token statistics
-
path_2vector
(path_2folder=None, path_2file=None, file_delimiter='\n')[source]¶ Parameters: - path_2folder – either None or a valid path to a folder ( each file in the folder should include words separated by a delimiter )
- path_2file – either None or a valid path to a file
- file_delimiter – either None or a character string specifying the file delimiter
Example:
tks = token_stats() res = tks.path_2vector(path_2file = '/myfolder/vocab_file.txt')
Note
the path_2vector method returns the words of a folder or file to a list ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file
-
freq_distribution
(x_vector=None, path_2folder=None, path_2file=None, file_delimiter='\n', keep=None)[source]¶ Parameters: - x_vector – either None or a string character list
- path_2folder – either None or a valid path to a folder ( each file in the folder should include words separated by a delimiter )
- path_2file – either None or a valid path to a file
- file_delimiter – either None or a character string specifying the file delimiter
- keep – the number of lines to keep from the output data frame
Example:
tks = token_stats() res = tks.freq_distribution(path_2file = '/myfolder/vocab_file.txt', keep = 20)
Note
This method returns a frequency_distribution in form of a data frame for EITHER a folder, a file OR a character string list.
-
count_character
(x_vector=None, path_2folder=None, path_2file=None, file_delimiter='\n')[source]¶ Parameters: - x_vector – either None or a string character list
- path_2folder – either None or a valid path to a folder ( each file in the folder should include words separated by a delimiter )
- path_2file – either None or a valid path to a file
- file_delimiter – either None or a character string specifying the file delimiter
Example:
tks = token_stats() res = tks.count_character(path_2file = '/myfolder/vocab_file.txt')
Note
The count_character method returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string list.
-
print_count_character
(number=None)[source]¶ Parameters: number – a numeric value. All words with number of characters (see method count_character) equal to the number parameter will be returned. Example:
tks = token_stats() res = tks.count_character(path_2file = '/myfolder/vocab_file.txt') tks.print_count_character(number = 6)
Note
This method should be called after the ‘count_character’ method is run. Given the numeric parameter ‘number’ this method
prints all the words with number of characters equal to ‘number’
-
collocation_words
(x_vector=None, path_2folder=None, path_2file=None, file_delimiter='\n', n_gram_delimiter='_')[source]¶ Parameters: - x_vector – either None or a string character list
- path_2folder – either None or a valid path to a folder ( each file in the folder should include words separated by a delimiter )
- path_2file – either None or a valid path to a file
- file_delimiter – either None or a character string specifying the file delimiter
- n_gram_delimiter – either None or a character string specifying the n-gram delimiter.
Example:
tks = token_stats() res = tks.collocation_words(path_2file = '/myfolder/vocab_file.txt')
Note
The collocation_words method saves a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string list.
A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose
exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ).
The input to the method should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ).
-
print_collocations
(word=None)[source]¶ Parameters: number – a numeric value. All words with number of characters (see method count_character) equal to the number parameter will be returned. Example:
tks = token_stats() res = tks.collocation_words(path_2file = '/myfolder/vocab_file.txt') tks.print_collocations(word = 'aword')
Note
This method should be called after the ‘collocation_words’ method is run. It prints the collocations for a specific ‘word’
-
string_dissimilarity_matrix
(words_vector=None, dice_n_gram=2, method='dice', split_separator=' ', dice_thresh=1.0, upper=True, diagonal=True, threads=1)[source]¶ Parameters: words_vector – a string character list :param dice_n_gram a numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix method
Parameters: - method – a character string specifying the method to use in the string_dissimilarity_matrix method. One of dice, levenshtein or cosine
- split_separator – a character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix method. The cosine method uses sentences, so for a sentence : “this_is_a_word_sentence” the split_separator should be “_”
- dice_thresh – a float number to use to threshold the data if method is dice in the string_dissimilarity_matrix method. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0.
- upper – either True or False. If True then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix method will be shown. Otherwise the upper part will be filled with NA’s
- diagonal – either True or False. If True then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix method will be shown. Otherwise the diagonal will be filled with NA’s
- threads – a numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix method
Example:
tks = token_stats() vocab_lst = ['the', 'term', 'planet', 'is', 'ancient', 'with', 'ties', 'to'] res = tks.string_dissimilarity_matrix( words_vector = vocab_lst, dice_n_gram = 2, method = 'dice')
Note
The string_dissimilarity_matrix method returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character
string list only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).
-
look_up_table
(words_vector=None, n_grams=None)[source]¶ Parameters: - words_vector – a string character list
- n_grams – a numeric value specifying the n-grams
Example:
tks = token_stats() vocab_lst = ['the', 'term', 'planet', 'is', 'ancient', 'with', 'ties', 'to'] res = tks.look_up_table(words_vector = vocab_lst, n_grams = 4)
Note
The look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams.
The input can be a character string list only.
-
print_words_lookup_tbl
(n_gram=None)[source]¶ Parameters: n_gram – a character string specifying the n-gram Example:
tks = token_stats() vocab_lst = ['the', 'term', 'planet', 'is', 'ancient', 'with', 'ties', 'to'] res = tks.look_up_table(words_vector = vocab_lst, n_grams = 4) tks.print_words_lookup_tbl(n_gram = "_abo")
Note
This method should be called after the ‘look_up_table’ method is run. It returns words associated to n-grams in the look-up-table
-