returns the vocabulary counts for small or medium ( xml and not only ) files

vocabulary_parser(
  input_path_file = NULL,
  start_query = NULL,
  end_query = NULL,
  vocabulary_path_file = NULL,
  min_lines = 1,
  trimmed_line = FALSE,
  to_lower = FALSE,
  to_upper = FALSE,
  utf_locale = "",
  max_num_char = Inf,
  remove_char = "",
  remove_punctuation_string = FALSE,
  remove_punctuation_vector = FALSE,
  remove_numbers = FALSE,
  trim_token = FALSE,
  split_string = FALSE,
  split_separator = " \r\n\t.,;:()?!//",
  remove_stopwords = FALSE,
  language = "english",
  min_num_char = 1,
  stemmer = NULL,
  min_n_gram = 1,
  max_n_gram = 1,
  skip_n_gram = 1,
  skip_distance = 0,
  n_gram_delimiter = " ",
  threads = 1,
  verbose = FALSE
)

Arguments

input_path_file

a character string specifying a valid path to the input file

start_query

a character string. The start_query is the first word of the subset of the data and should appear frequently at the beginning of each line in the text file.

end_query

a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.

vocabulary_path_file

a character string specifying the output file where the vocabulary should be saved (after tokenization and transformation is applied).

min_lines

a numeric value specifying the minimum number of lines. For instance if min_lines = 2, then only subsets of text with more than 1 lines will be kept.

trimmed_line

either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query

to_lower

either TRUE or FALSE. If TRUE the character string will be converted to lower case

to_upper

either TRUE or FALSE. If TRUE the character string will be converted to upper case

utf_locale

the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.

max_num_char

an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000)

remove_char

a character string with specific characters that should be removed from the text file. If the remove_char is "" then no removal of characters take place

remove_punctuation_string

either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)

remove_punctuation_vector

either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)

remove_numbers

either TRUE or FALSE. If TRUE then any numbers in the character string will be removed

trim_token

either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right)

split_string

either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.

split_separator

a character string specifying the character delimiter(s)

remove_stopwords

either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded.

language

a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu

min_num_char

an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned

stemmer

a character string specifying the stemming method. Available method is the porter2_stemmer. See details for more information.

min_n_gram

an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.

max_n_gram

an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.

skip_n_gram

an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1.

skip_distance

an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.

n_gram_delimiter

a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)

threads

an integer specifying the number of cores to run in parallel

verbose

either TRUE or FALSE. If TRUE then information will be printed in the console

Details

The text file should have a structure (such as an xml-structure), so that subsets can be extracted using the start_query and end_query parameters

For big files the vocabulary_accumulator method of the big_tokenize_transform class is appropriate

Stemming of the english language is done using the porter2-stemmer, for details see https://github.com/smassung/porter2_stemmer

Examples


library(textTinyR)

# vps = vocabulary_parser(input_path_file = '/folder/input_data.txt',

#                         start_query = 'start_word', end_query = 'end_word',

#                         vocabulary_path_file = '/folder/vocab.txt',

#                         to_lower = TRUE, split_string = TRUE)