utils¶

Classes

utils utility functions

class utils.utils[source]¶

utility functions

vocabulary_parser(input_path_file=None, vocabulary_path_file=None, start_query=None, end_query=None, min_lines=1, trimmed_line=False, language='english', LOCALE_UTF='', max_num_char=9223372036854775807, REMOVE_characters='', to_lower=False, to_upper=False, remove_punctuation_string=False, remove_punctuation_vector=False, remove_numbers=False, trim_token=False, split_string=False, separator=' \r\n\t., ;:()?!//', remove_stopwords=False, min_num_char=1, stemmer=None, min_n_gram=1, max_n_gram=1, n_gram_delimiter=' ', skip_n_gram=1, skip_distance=0, threads=1, verbose=False)[source]¶

Parameters:

input_path_file – a character string specifying the path to the input file
vocabulary_path_file – a character string specifying the output file where the vocabulary should be saved (after tokenization and transformation is applied).
start_query – a character string. The start_query is the first word of the subset of the data and should appear frequently at the beginning of each line int the text file.
end_query – a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.
min_lines – a numeric value specifying the minimum number of lines. For instance if min_lines = 2, then only subsets of text with more than 1 lines will be kept.
trimmed_line – either True or FALSE. If False then each line of the text file will be trimmed both sides before applying the start_query and end_query
language –
a character string which defaults to english. If the remove_stopwords parameter is True then the corresponding stop words vector will be uploaded. Available languages ‘afrikaans’,

’arabic’, ‘armenian’, ‘basque’, ‘bengali’, ‘breton’, ‘bulgarian’, ‘catalan’, ‘croatian’, ‘czech’,’danish’, ‘dutch’, ‘english’, ‘estonian’, ‘finnish’, ‘french’, ‘galician’, ‘german’, ‘greek’, ‘hausa’, ‘hebrew’, ‘hindi’, ‘hungarian’, ‘indonesian’, ‘irish’, ‘italian’, ‘latvian’, ‘marathi’, ‘norwegian’, ‘persian’, ‘polish’, ‘portuguese’, ‘romanian’, ‘russian’, ‘slovak’, ‘slovenian’, ‘somalia’, ‘spanish’, ‘swahili’, ‘swedish’, ‘turkish’, ‘yoruba’, ‘zulu’
LOCALE_UTF – the language specific locale to use in case that either the to_lower or the to_upper parameter is TRUE and the text file language is other than english. For instance if the language of a text file is greek then the utf_locale parameter should be ‘el_GR.UTF-8’ ( language_country.encoding ). A wrong utf-locale does not raise an error, however the runtime of the method increases.
max_num_char – an integer specifying the maximum number of characters to keep. The max_num_char should be less than or equal to Inf (in this method the Inf value translates to a word-length of 1000000000)
REMOVE_characters – a character string with specific characters that should be removed from the text file. If the remove_char is “” then no removal of characters take place
to_lower – either True or False. If True the character string will be converted to lower case
to_upper – either True or False. If True the character string will be converted to upper case
remove_punctuation_string – either True or False. If True then the punctuation of the character string will be removed (applies before the split method)
remove_punctuation_vector – either True or False. If True then the punctuation of the vector of the character strings will be removed (after the string split has taken place)
remove_numbers – either True or False. If True then any numbers in the character string will be removed
trim_token – either True or False. If True then the string will be trimmed (left and/or right)
split_string – either True or False. If True then the character string will be split using the separator as delimiter. The user can also specify multiple delimiters.
separator – a character string specifying the character delimiter(s)
remove_stopwords – either True, False or a character vector of user defined stop words. If True then by using the language parameter the corresponding stop words vector will be uploaded.
min_num_char – an integer specifying the minimum number of characters to keep. If the min_num_char is greater than 1 then character strings with more than 1 characters will be returned
stemmer – a character string specifying the stemming method. Available stemmer is porter2_stemmer.
min_n_gram – an integer specifying the minimum number of n-grams. The minimum number of min_n_gram is 1.
max_n_gram – an integer specifying the maximum number of n-grams. The minimum number of max_n_gram is 1.
n_gram_delimiter – a character string specifying the n-gram delimiter (applies to both n-gram and skip-n-gram cases)
skip_n_gram – an integer specifying the number of skip-n-grams. The minimum number of skip_n_gram is 1.
skip_distance – an integer specifying the skip distance between the words. The minimum value for the skip distance is 0, in which case simple n-grams will be returned.
threads – an integer specifying the number of cores to run in parallel
verbose – either True or False. If True then information will be printed out

Example:

utl = utils()

res = utl.vocabulary_parser(input_path_file = '/myfolder/input_file.txt', vocabulary_path_file = '/myfolder/output_VOCAB.txt', start_query = "<structure>", 

                            end_query = "</structure>" to_lower = True, split_string = True)

Note

Returns the vocabulary counts for small or medium ( xml ) files ( for big files the vocabulary_accumulator method of the big_text_files class is appropriate )

The text file should have a structure (such as an xml-structure), so that subsets can be extracted using the start_query and end_query parameters.

utf_locale(language='english')[source]¶

Parameters:	language – a character string specifying the language for which the utf-locale should be returned

Example:

utl = utils()

res = utl.utf_locale(language = "english")

Note

utf-locale for specific languages

This is a limited list of language-locale. The locale depends mostly on the text input.

bytes_converter(input_path_file=None, unit='MB')[source]¶

Parameters:	input_path_file – a character string specifying the path to the input file unit – a character string specifying the unit. One of KB, MB, GB

Example:

utl = utils()

res = utl.bytes_converter(input_path_file = '/myfolder/input_file.txt', unit = "MB")

Note

bytes converter using a text file ( KB, MB or GB )

text_file_parser(input_path_file=None, start_query=None, end_query=None, output_path_file=None, min_lines=1, trimmed_line=False, verbose=False)[source]¶

Parameters:

input_path_file – a character string specifying the path to the input file
start_query – a character string. The start_query is the first word of the subset of the data and should appear frequently at the beginning of each line int the text file.
end_query – a character string. The end_query is the last word of the subset of the data and should appear frequently at the end of each line in the text file.
output_path_file – a character string specifying the path to the output file
min_lines – a numeric value specifying the minimum number of lines. For instance if min_lines = 2, then only subsets of text with more than 1 lines will be kept.
trimmed_line – either True or FALSE. If False then each line of the text file will be trimmed both sides before applying the start_query and end_query
verbose – either True or False. If True then information will be printed out

Example:

utl = utils()

res = utl.text_file_parser(input_path_file = '/myfolder/input_file.txt', start_query = "<structure>", end_query = "</structure>", output_path_file = '/myfolder/output_file.txt')

Note

The text file should have a structure (such as an xml-structure), so that subsets can be extracted using the start_query and end_query parameters.

dice_distance(word1=None, word2=None, n_grams=2)[source]¶

Parameters:	word1 – a character string word2 – a character string n_grams – a value specifying the consecutive n-grams of the words

Example:

utl = utils()

res = utl.dice_distance(word1 = 'one_word', word2 = 'two_words', n_grams = 2)

Note

dice similarity of words using n-grams

levenshtein_distance(word1=None, word2=None)[source]¶

Parameters:	word1 – a character string word2 – a character string

Example:

utl = utils()

res = utl.levenshtein_distance(word1 = 'one_word', word2 = 'two_words')

Note

levenshtein distance of two words

cosine_distance(sentence1=None, sentence2=None, split_separator=' ')[source]¶

Parameters:	sentence1 – a character string consisting of multiple words sentence2 – a character string consisting of multiple words split_separator – a character string specifying the delimiter(s) to split the sentence

Example:

utl = utils()

res = utl.cosine_distance(sentence1 = 'this is one sentence', sentence2 = 'this is another sentence', split_separator = " ")

Note

cosine distance of two character strings (each string consists of more than one words)

read_characters(input_file=None, characters=100, write_2file='')[source]¶

Parameters:	input_file – a character string specifying a valid path to a text file characters – a numeric value specifying the number of characters to read write_2file – either an empty string (“”) or a character string specifying a valid output file to write the subset of the input file

Example:

utl = utils()

res = utl.read_characters(input_file = '/myfolder/input_file.txt', characters = 100, write_2file = "")

Note

read a specific number of characters from a text file

read_rows(input_file=None, read_delimiter='\n', rows=100, write_2file='')[source]¶

Parameters:	input_file – a character string specifying a valid path to a text file read_delimiter – a character string specifying the row delimiter of the text file rows – a numeric value specifying the number of rows to read write_2file – either an empty string (“”) or a character string specifying a valid output file to write the subset of the input file

Example:

utl = utils()

res = utl.read_rows(input_file = '/myfolder/input_file.txt', rows = 100, write_2file = "")

Note

read a specific number of rows from a text file

xml_parser_subroot_elements(input_path_file=None, xml_path=None, output_path_file=None, empty_key='')[source]¶

Parameters:	input_path_file – a character string specifying a valid path to an xml text file xml_path – a character string specifying the xml query output_path_file – a character string specifying a valid output file to write output empty_key – a character string specifying the replacement word in case that the key in the tree structure is empty

Note

xml file tree traversal for subroot’s attributes, elements and sub-elements using the boost library

The logic behind root-child-subchildren of an xml file is explained in http://www.w3schools.com/xml/xml_tree.asp

Example:

using the following structure as a 'FILE.xml'
---------------------------------------------

<mediawiki>
    <page>
        <title>AccessibleComputing</title>
        <revision>
            <id>631144794</id>
            <parentid>381202555</parentid>
            <timestamp>2014-10-26T04:50:23Z</timestamp>
        </revision>
    </page>
</mediawiki>


example to get a "subchild's element"   (here the xml_path equals to -->  "/root/child/subchild.element.sub-element")
-------------------------------------                

utl = utils()                

res = utl.xml_parser_subroot_elements(input_path_file = "FILE.xml", xml_path = "/mediawiki/page/revision.contributor.id", empty_key = "")


output
------

631144794

                

example to get a "subchild's attribute" ( by using the ".<xmlattr>." in the query )
---------------------------------------

the attribute in this .xml file is:     <redirect title="Computer accessibility"/>        


utl = utils()                

res = utl.xml_parser_subroot_elements(input_path_file = "FILE.xml", xml_path = "/mediawiki/page/redirect.<xmlattr>.title")


output
------

AccessibleComputing

xml_parser_root_elements(input_path_file=None, xml_root=None, output_path_file=None)[source]¶

Parameters:	input_path_file – a character string specifying a valid path to an xml text file xml_root – a character string specifying the xml query output_path_file – a character string specifying a valid output file to write output

Note

xml file tree traversal for a root’s attributes using the boost library ( repeated tree sturcture )

Example:

using the following structure as a 'FILE.xml'
---------------------------------------------

<MultiMessage>
    <Message structID="1710" msgID="0" length="50">
        <structure type="AppHeader">
        </structure>
    </Message>
    <Message structID="27057" msgID="27266" length="315">
        <structure type="Container">
            <productID value="166"/>
            <publishTo value="xyz"/>
            <templateID value="97845"/>
        </structure>
    </Message>
</MultiMessage>
<MultiMessage>
    <Message structID="1710" msgID="0" length="50">
        <structure type="AppHeader">
        </structure>
    </Message>
    <Message structID="27057" msgID="27266" length="315">
        <structure type="Container">
            <productID value="166"/>
            <publishTo value="xyz"/>
            <templateID value="97845"/>
        </structure>
    </Message>
</MultiMessage>



example to get a "child's attributes" (use only the root-element of the xml file as a parameter )
-------------------------------------


utl = utils()                

res = utl.xml_parser_root_elements(input_path_file = "FILE.xml", xml_root = "MultiMessage", output_path_file = "")


output
------

  child_keys child_values
0   structID         1710
1      msgID            0
2     length           50
3   structID        27057
4      msgID        27266
5     length          315