String tokenization and transformation for big data sets

String tokenization and transformation for big data sets

# utl <- big_tokenize_transform$new(verbose = FALSE)

Details

the big_text_splitter function splits a text file into sub-text-files using either the batches parameter (big-text-splitter-bytes) or both the batches and the end_query parameter (big-text-splitter-query). The end_query parameter (if not NULL) should be a character string specifying a word that appears repeatedly at the end of each line in the text file.

the big_text_parser function parses text files from an input folder and saves those processed files to an output folder. The big_text_parser is appropriate for files with a structure using the start- and end- query parameters.

the big_text_tokenizer function tokenizes and transforms the text files of a folder and saves those files to either a folder or a single file. There is also the option to save a frequency vocabulary of those transformed tokens to a file.

the vocabulary_accumulator function takes the resulted vocabulary files of the big_text_tokenizer and returns the vocabulary sums sorted in decreasing order. The parameter max_num_chars limits the number of the corpus using the number of characters of each word.

The ngram_sequential or ngram_overlap stemming method applies to each single batch and not to the whole corpus of the text file. Thus, it is possible that the stems of the same words for randomly selected batches might differ.

Methods

big_tokenize_transform$new(verbose = FALSE)

--------------

big_text_splitter(input_path_file = NULL, output_path_folder = NULL, end_query = NULL, batches = NULL, trimmed_line = FALSE)

--------------

big_text_parser(input_path_folder = NULL, output_path_folder = NULL, start_query = NULL, end_query = NULL, min_lines = 1, trimmed_line = FALSE)

--------------

big_text_tokenizer(input_path_folder = NULL, batches = NULL, read_file_delimiter = " ", to_lower = FALSE, to_upper = FALSE, utf_locale = "", remove_char = "", remove_punctuation_string = FALSE, remove_punctuation_vector = FALSE, remove_numbers = FALSE, trim_token = FALSE, split_string = FALSE, split_separator = " .,;:()?!", remove_stopwords = FALSE, language = "english", min_num_char = 1, max_num_char = Inf, stemmer = NULL, min_n_gram = 1, max_n_gram = 1, skip_n_gram = 1, skip_distance = 0, n_gram_delimiter = " ", concat_delimiter = NULL, path_2folder = "", stemmer_ngram = 4, stemmer_gamma = 0.0, stemmer_truncate = 3, stemmer_batches = 1, threads = 1, save_2single_file = FALSE, increment_batch_nr = 1, vocabulary_path_folder = NULL)

--------------

vocabulary_accumulator(input_path_folder = NULL, vocabulary_path_file = NULL, max_num_chars = 100)

Methods

Public methods


Method new()

Usage

big_tokenize_transform$new(verbose = FALSE)

Arguments

verbose

either TRUE or FALSE. If TRUE then information will be printed in the console


Method big_text_splitter()

Usage

big_tokenize_transform$big_text_splitter(
  input_path_file = NULL,
  output_path_folder = NULL,
  end_query = NULL,
  batches = NULL,
  trimmed_line = FALSE
)

Arguments

input_path_file

a character string specifying the path to the input file

output_path_folder

a character string specifying the folder where the output files should be saved

end_query

a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.

batches

a numeric value specifying the number of batches to use. The batches will be used to split the initial data into subsets. Those subsets will be either saved in files (big_text_splitter function) or will be used internally for low memory processing (big_text_tokenizer function).

trimmed_line

either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query


Method big_text_parser()

Usage

big_tokenize_transform$big_text_parser(
  input_path_folder = NULL,
  output_path_folder = NULL,
  start_query = NULL,
  end_query = NULL,
  min_lines = 1,
  trimmed_line = FALSE
)

Arguments

input_path_folder

a character string specifying the folder where the input files are saved

output_path_folder

a character string specifying the folder where the output files should be saved

start_query

a character string. The start_query is the first word of the subset of the data and should appear frequently at the beginning of each line int the text file.

end_query

a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.

min_lines

a numeric value specifying the minimum number of lines. For instance if min_lines = 2, then only subsets of text with more than 1 lines will be kept.

trimmed_line

either TRUE or FALSE. If FALSE then each line of the text file will be trimmed both sides before applying the start_query and end_query


Method big_text_tokenizer()

Usage

big_tokenize_transform$big_text_tokenizer(
  input_path_folder = NULL,
  batches = NULL,
  read_file_delimiter = "\n",
  to_lower = FALSE,
  to_upper = FALSE,
  utf_locale = "",
  remove_char = "",
  remove_punctuation_string = FALSE,
  remove_punctuation_vector = FALSE,
  remove_numbers = FALSE,
  trim_token = FALSE,
  split_string = FALSE,
  split_separator = " \r\n\t.,;:()?!//",
  remove_stopwords = FALSE,
  language = "english",
  min_num_char = 1,
  max_num_char = Inf,
  stemmer = NULL,
  min_n_gram = 1,
  max_n_gram = 1,
  skip_n_gram = 1,
  skip_distance = 0,
  n_gram_delimiter = " ",
  concat_delimiter = NULL,
  path_2folder = "",
  stemmer_ngram = 4,
  stemmer_gamma = 0,
  stemmer_truncate = 3,
  stemmer_batches = 1,
  threads = 1,
  save_2single_file = FALSE,
  increment_batch_nr = 1,
  vocabulary_path_folder = NULL
)

Arguments

input_path_folder

a character string specifying the folder where the input files are saved

batches

a numeric value specifying the number of batches to use. The batches will be used to split the initial data into subsets. Those subsets will be either saved in files (big_text_splitter function) or will be used internally for low memory processing (big_text_tokenizer function).

read_file_delimiter

the delimiter to use when the input file will be red (for instance a tab-delimiter or a new-line delimiter).

to_lower

either TRUE or FALSE. If TRUE the character string will be converted to lower case

to_upper

either TRUE or FALSE. If TRUE the character string will be converted to upper case

utf_locale

the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be 'el_GR.UTF-8' ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the function increases.

remove_char

a character string with specific characters that should be removed from the text file. If the remove_char is "" then no removal of characters take place

remove_punctuation_string

either TRUE or FALSE. If TRUE then the punctuation of the character string will be removed (applies before the split function)

remove_punctuation_vector

either TRUE or FALSE. If TRUE then the punctuation of the vector of the character strings will be removed (after the string split has taken place)

remove_numbers

either TRUE or FALSE. If TRUE then any numbers in the character string will be removed

trim_token

either TRUE or FALSE. If TRUE then the string will be trimmed (left and/or right)

split_string

either TRUE or FALSE. If TRUE then the character string will be split using the split_separator as delimiter. The user can also specify multiple delimiters.

split_separator

a character string specifying the character delimiter(s)

remove_stopwords

either TRUE, FALSE or a character vector of user defined stop words. If TRUE then by using the language parameter the corresponding stop words vector will be uploaded.

language

a character string which defaults to english. If the remove_stopwords parameter is TRUE then the corresponding stop words vector will be uploaded. Available languages are afrikaans, arabic, armenian, basque, bengali, breton, bulgarian, catalan, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hausa, hebrew, hindi, hungarian, indonesian, irish, italian, latvian, marathi, norwegian, persian, polish, portuguese, romanian, russian, slovak, slovenian, somalia, spanish, swahili, swedish, turkish, yoruba, zulu

min_num_char

an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned

max_num_char

an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this function the Inf value translates to a word-length of 1000000000)

stemmer

a character string specifying the stemming method. One of the following porter2_stemmer, ngram_sequential, ngram_overlap. See details for more information.

min_n_gram

an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.

max_n_gram

an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.

skip_n_gram

an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1. The skip_n_gram gives the (max.) n-grams using the skip_distance parameter. If skip_n_gram is greater than 1 then both min_n_gram and max_n_gram should be set to 1.

skip_distance

an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.

n_gram_delimiter

a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)

concat_delimiter

either NULL or a character string specifying the delimiter to use in order to concatenate the end-vector of character strings to a single character string (recommended in case that the end-vector should be saved to a file)

path_2folder

a character string specifying the path to the folder where the file(s) will be saved

stemmer_ngram

a numeric value greater than 1. Applies to both ngram_sequential and ngram_overlap methods. In case of ngram_sequential the first stemmer_ngram characters will be picked, whereas in the case of ngram_overlap the overlapping stemmer_ngram characters will be build.

stemmer_gamma

a float number greater or equal to 0.0. Applies only to ngram_sequential. Is a threshold value, which defines how much frequency deviation of two N-grams is acceptable. It is kept either zero or to a minimum value.

stemmer_truncate

a numeric value greater than 0. Applies only to ngram_sequential. The ngram_sequential is modified to use relative frequencies (float numbers between 0.0 and 1.0 for the ngrams of a specific word in the corpus) and the stemmer_truncate parameter controls the number of rounding digits for the ngrams of the word. The main purpose was to give the same relative frequency to words appearing approximately the same on the corpus.

stemmer_batches

a numeric value greater than 0. Applies only to ngram_sequential. Splits the corpus into batches with the option to run the batches in multiple threads.

threads

an integer specifying the number of cores to run in parallel

save_2single_file

either TRUE or FALSE. If TRUE then the output data will be saved in a single file. Otherwise the data will be saved in multiple files with incremented enumeration

increment_batch_nr

a numeric value. The enumeration of the output files will start from the increment_batch_nr. If the save_2single_file parameter is TRUE then the increment_batch_nr parameter won't be taken into consideration.

vocabulary_path_folder

either NULL or a character string specifying the output folder where the vocabulary batches should be saved (after tokenization and transformation is applied). Applies to the big_text_tokenizer method.


Method vocabulary_accumulator()

Usage

big_tokenize_transform$vocabulary_accumulator(
  input_path_folder = NULL,
  vocabulary_path_file = NULL,
  max_num_chars = 100
)

Arguments

input_path_folder

a character string specifying the folder where the input files are saved

vocabulary_path_file

either NULL or a character string specifying the output file where the vocabulary should be saved (after tokenization and transformation is applied). Applies to the vocabulary_accumulator method.

max_num_chars

a numeric value to limit the words of the output vocabulary to a maximum number of characters (applies to the vocabulary_accumulator function)


Method clone()

The objects of this class are cloneable with this method.

Usage

big_tokenize_transform$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples


library(textTinyR)


# fs <- big_tokenize_transform$new(verbose = FALSE)

#---------------
# file splitter:
#---------------

# fs$big_text_splitter(input_path_file = "input.txt",

#                      output_path_folder = "/folder/output/",

#                      end_query = "endword", batches = 5,

#                      trimmed_line = FALSE)


#-------------
# file parser:
#-------------

# fs$big_text_parser(input_path_folder = "/folder/output/",

#                    output_path_folder = "/folder/parser/",

#                    start_query = "startword", end_query = "endword",

#                    min_lines = 1, trimmed_line = TRUE)


#----------------
# file tokenizer:
#----------------


# fs$big_text_tokenizer(input_path_folder = "/folder/parser/",

#                       batches = 5, split_string=TRUE,

#                       to_lower = TRUE, trim_token = TRUE,

#                       max_num_char = 100, remove_stopwords = TRUE,

#                       stemmer = "porter2_stemmer", threads = 1,

#                       path_2folder="/folder/output_token/",

#                       vocabulary_path_folder="/folder/VOCAB/")

#-------------------
# vocabulary counts:
#-------------------


# fs$vocabulary_accumulator(input_path_folder = "/folder/VOCAB/",

#                           vocabulary_path_file = "/folder/vocab.txt",

#                           max_num_chars = 50)