token_stats

Classes

token_stats functions to compute token statistics
class token_stats.token_stats[source]

functions to compute token statistics

path_2vector(path_2folder=None, path_2file=None, file_delimiter='\n')[source]
Parameters:
  • path_2folder – either None or a valid path to a folder ( each file in the folder should include words separated by a delimiter )
  • path_2file – either None or a valid path to a file
  • file_delimiter – either None or a character string specifying the file delimiter

Example:

tks = token_stats()

res = tks.path_2vector(path_2file = '/myfolder/vocab_file.txt')

Note

the path_2vector method returns the words of a folder or file to a list ( using the file_delimiter to input the data ). Usage: read a vocabulary from a text file

freq_distribution(x_vector=None, path_2folder=None, path_2file=None, file_delimiter='\n', keep=None)[source]
Parameters:
  • x_vector – either None or a string character list
  • path_2folder – either None or a valid path to a folder ( each file in the folder should include words separated by a delimiter )
  • path_2file – either None or a valid path to a file
  • file_delimiter – either None or a character string specifying the file delimiter
  • keep – the number of lines to keep from the output data frame

Example:

tks = token_stats()

res = tks.freq_distribution(path_2file = '/myfolder/vocab_file.txt', keep = 20)

Note

This method returns a frequency_distribution in form of a data frame for EITHER a folder, a file OR a character string list.

count_character(x_vector=None, path_2folder=None, path_2file=None, file_delimiter='\n')[source]
Parameters:
  • x_vector – either None or a string character list
  • path_2folder – either None or a valid path to a folder ( each file in the folder should include words separated by a delimiter )
  • path_2file – either None or a valid path to a file
  • file_delimiter – either None or a character string specifying the file delimiter

Example:

tks = token_stats()

res = tks.count_character(path_2file = '/myfolder/vocab_file.txt')

Note

The count_character method returns the number of characters for each word of the corpus for EITHER a folder, a file OR a character string list.

print_count_character(number=None)[source]
Parameters:number – a numeric value. All words with number of characters (see method count_character) equal to the number parameter will be returned.

Example:

tks = token_stats()

res = tks.count_character(path_2file = '/myfolder/vocab_file.txt')

tks.print_count_character(number = 6)

Note

This method should be called after the ‘count_character’ method is run. Given the numeric parameter ‘number’ this method

prints all the words with number of characters equal to ‘number’

collocation_words(x_vector=None, path_2folder=None, path_2file=None, file_delimiter='\n', n_gram_delimiter='_')[source]
Parameters:
  • x_vector – either None or a string character list
  • path_2folder – either None or a valid path to a folder ( each file in the folder should include words separated by a delimiter )
  • path_2file – either None or a valid path to a file
  • file_delimiter – either None or a character string specifying the file delimiter
  • n_gram_delimiter – either None or a character string specifying the n-gram delimiter.

Example:

tks = token_stats()

res = tks.collocation_words(path_2file = '/myfolder/vocab_file.txt')

Note

The collocation_words method saves a co-occurence frequency table for n-grams for EITHER a folder, a file OR a character string list.

A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose

exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components ( http://nlp.stanford.edu/fsnlp/promo/colloc.pdf, page 172 ).

The input to the method should be text n-grams separated by a delimiter (for instance 3- or 4-ngrams ).

print_collocations(word=None)[source]
Parameters:number – a numeric value. All words with number of characters (see method count_character) equal to the number parameter will be returned.

Example:

tks = token_stats()

res = tks.collocation_words(path_2file = '/myfolder/vocab_file.txt')

tks.print_collocations(word = 'aword')

Note

This method should be called after the ‘collocation_words’ method is run. It prints the collocations for a specific ‘word’

string_dissimilarity_matrix(words_vector=None, dice_n_gram=2, method='dice', split_separator=' ', dice_thresh=1.0, upper=True, diagonal=True, threads=1)[source]
Parameters:words_vector – a string character list

:param dice_n_gram a numeric value specifying the n-gram for the dice method of the string_dissimilarity_matrix method

Parameters:
  • method – a character string specifying the method to use in the string_dissimilarity_matrix method. One of dice, levenshtein or cosine
  • split_separator – a character string specifying the string split separator if method equal cosine in the string_dissimilarity_matrix method. The cosine method uses sentences, so for a sentence : “this_is_a_word_sentence” the split_separator should be “_”
  • dice_thresh – a float number to use to threshold the data if method is dice in the string_dissimilarity_matrix method. It takes values between 0.0 and 1.0. The closer the thresh is to 0.0 the more values of the dissimilarity matrix will take the value of 1.0.
  • upper – either True or False. If True then both lower and upper parts of the dissimilarity matrix of the string_dissimilarity_matrix method will be shown. Otherwise the upper part will be filled with NA’s
  • diagonal – either True or False. If True then the diagonal of the dissimilarity matrix of the string_dissimilarity_matrix method will be shown. Otherwise the diagonal will be filled with NA’s
  • threads – a numeric value specifying the number of cores to use in parallel in the string_dissimilarity_matrix method

Example:

tks = token_stats()

vocab_lst = ['the', 'term', 'planet', 'is', 'ancient', 'with', 'ties', 'to']                

res = tks.string_dissimilarity_matrix( words_vector = vocab_lst, dice_n_gram = 2, method = 'dice')

Note

The string_dissimilarity_matrix method returns a string-dissimilarity-matrix using either the dice, levenshtein or cosine distance. The input can be a character

string list only. In case that the method is dice then the dice-coefficient (similarity) is calculated between two strings for a specific number of character n-grams ( dice_n_gram ).

look_up_table(words_vector=None, n_grams=None)[source]
Parameters:
  • words_vector – a string character list
  • n_grams – a numeric value specifying the n-grams

Example:

tks = token_stats()

vocab_lst = ['the', 'term', 'planet', 'is', 'ancient', 'with', 'ties', 'to']

res = tks.look_up_table(words_vector = vocab_lst, n_grams = 4)

Note

The look_up_table returns a look-up-list where the list-names are the n-grams and the list-vectors are the words associated with those n-grams.

The input can be a character string list only.

print_words_lookup_tbl(n_gram=None)[source]
Parameters:n_gram – a character string specifying the n-gram

Example:

tks = token_stats()

vocab_lst = ['the', 'term', 'planet', 'is', 'ancient', 'with', 'ties', 'to']

res = tks.look_up_table(words_vector = vocab_lst, n_grams = 4)

tks.print_words_lookup_tbl(n_gram = "_abo")

Note

This method should be called after the ‘look_up_table’ method is run. It returns words associated to n-grams in the look-up-table