vignettes/the_fastText_R_package.Rmd
the_fastText_R_package.Rmd
This vignette explains the functionality of the fastText R package.
This R package is an interface to the fasttext library
for efficient learning of word representations and sentence
classification. The following functions are included,
fastText | |
---|---|
fasttext_interface | Interface for the fasttext library |
plot_progress_logs | Plot the progress of loss, learning-rate and word-counts |
printAnalogiesUsage | Print Usage Information when the command equals to ‘analogies’ |
printDumpUsage | Print Usage Information when the command equals to ‘dump’ |
printNNUsage | Print Usage Information when the command equals to ‘nn’ |
printPredictUsage | Print Usage Information when the command equals to ‘predict’ or ‘predict-prob’ |
printPrintNgramsUsage | Print Usage Information when the command equals to ‘print-ngrams’ |
printPrintSentenceVectorsUsage | Print Usage Information when the command equals to ‘print-sentence-vectors’ |
printPrintWordVectorsUsage | Print Usage Information when the command equals to ‘print-word-vectors’ |
printQuantizeUsage | Print Usage Information when the command equals to ‘quantize’ |
printTestLabelUsage | Print Usage Information when the command equals to ‘test-label’ |
printTestUsage | Print Usage Information when the command equals to ‘test’ |
printUsage | Print Usage Information for all parameters |
print_parameters | Print the parameters for a specific command |
I’ll explain each function separately based on example data. More information can be found in the package documentation.
This function prints information about the default parameters for a specific ‘command’. The ‘command’ can be for instance supervised, skipgram, cbow etc.,
library(fastText)
print_parameters(command = "supervised")
Empty input or output path.
:
The following arguments are mandatory-input training file path
-output output file path
:
The following arguments are optional-verbose verbosity level [2]
for the dictionary are optional:
The following arguments -minCount minimal number of word occurences [1]
-minCountLabel minimal number of label occurences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
for training are optional:
The following arguments -lr learning rate [0.1]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {ns, hs, softmax, one-vs-all} [softmax]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [false]
for quantization are optional:
The following arguments -cutoff number of words and ngrams to retain [0]
-retrain whether embeddings are finetuned if a cutoff is applied [false]
-qnorm whether the norm is quantized separately [false]
-qout whether the classifier is quantized [false]
-dsub size of each sub-vector [2]
in give_args_fasttext(args = c("fasttext", command)) :
Error -- args.cc file -- Args::parseArgs function EXIT_FAILURE
Each one of the functions which includes the words print and Usage allows a user to print information for this specific function. For instance,
printPredictUsage()
: fasttext predict[-prob] <model> <test-data> [<k>] [<th>]
usage
<model> model filename
<test-data> test data filename (if -, read from stdin)
<k> (optional; 1 by default) predict top k labels
<th> (optional; 0.0 by default) probability threshold
This function allows the user to run the various methods included in the fasttext library from within R. The data that I’ll use in the following code snippets can be downloaded as a .zip file (named as fastText_data) from my Github repository. The user should then unzip the file and make the extracted folder his / hers default directory (using the base R function setwd()) before running the following code chunks.
setwd('fastText_data') # make the extracted data the default directory
#------
# cbow
#------
library(fastText)
= list(command = 'cbow',
list_params lr = 0.1,
dim = 50,
input = "example_text.txt",
output = file.path(tempdir(), 'word_vectors'),
verbose = 2,
thread = 1)
= fasttext_interface(list_params,
res path_output = file.path(tempdir(), 'cbow_logs.txt'),
MilliSecs = 5,
remove_previous_file = TRUE,
print_process_time = TRUE)
Read 0M words: 8
Number of words: 0
Number of labels: 100.0% words/sec/thread: 2933 lr: 0.000000 loss: 4.060542 ETA: 0h 0m
Progress
: 3.542332 secs time to complete
The data is saved in the specified tempdir() folder for illustration purposes. The user is advised to specify his / her own folder.
#-----------
# supervised
#-----------
= list(command = 'supervised',
list_params lr = 0.1,
dim = 50,
input = file.path("cooking.stackexchange", "cooking.train"),
output = file.path(tempdir(), 'model_cooking'),
verbose = 2,
thread = 4)
= fasttext_interface(list_params,
res path_output = file.path(tempdir(), 'sup_logs.txt'),
MilliSecs = 5,
remove_previous_file = TRUE,
print_process_time = TRUE)
Read 0M words: 14543
Number of words: 735
Number of labels: 100.0% words/sec/thread: 63282 lr: 0.000000 loss: 10.049338 ETA: 0h 0m
Progress
: 3.449003 secs time to complete
The user has here also the option to plot the progress of loss, learning-rate and word-counts,
res = plot_progress_logs(path = file.path(tempdir(), 'sup_logs.txt'),
plot = TRUE)
dim(res)
The verbosity for the logs-file depends on,
parameters.
The next command can be utilized to ‘predict’ new data based on the output model,
#-------------------
# 'predict' function
#-------------------
list_params = list(command = 'predict',
model = file.path(tempdir(), 'model_cooking.bin'),
test_data = file.path('cooking.stackexchange', 'cooking.valid'),
k = 1,
th = 0.0)
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'predict_valid.txt'))
These output predictions will be of the following form ’__label__food-safety’ , where each line will represent a new label (number of lines of the input data must match the number of lines of the output data). With the ‘predict-prob’ command someone can obtain the probabilities of the labels as well,
#------------------------
# 'predict-prob' function
#------------------------
list_params = list(command = 'predict-prob',
model = file.path(tempdir(), 'model_cooking.bin'),
test_data = file.path('cooking.stackexchange', 'cooking.valid'),
k = 1,
th = 0.0)
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'predict_valid_prob.txt'))
Using ‘predict-prob’ the output predictions will be of the following form ’__label__baking 0.0282927’
Once the model was trained, someone can evaluate it by computing the precision and recall at ‘k’ on a test set. The ‘test’ command just prints the metrics in the R session,
#----------------
# 'test' function
#----------------
= list(command = 'test',
list_params model = file.path(tempdir(), 'model_cooking.bin'),
test_data = file.path('cooking.stackexchange', 'cooking.valid'),
k = 1,
th = 0.0)
= fasttext_interface(list_params)
res
3000
N @1 0.138
P@1 0.060 R
whereas the ‘test-label’ command allows the user to save,
#----------------------
# 'test-label' function
#----------------------
list_params = list(command = 'test-label',
model = file.path(tempdir(), 'model_cooking.bin'),
test_data = file.path('cooking.stackexchange', 'cooking.valid'),
k = 1,
th = 0.0)
res = fasttext_interface(list_params,
path_output = file.path(tempdir(), 'test_label_valid.txt'))
the output to the ‘test_label_valid.txt’ file, which includes the ‘Precision’ and ‘Recall’ for each unique label on the data set (‘cooking.stackexchange.txt’). That means the number of rows of the ‘test_label_valid.txt’ must be equal to the unique labels in the ‘cooking.stackexchange.txt’ data set. This can be verified using the following code snippet,
= read.delim(file.path("cooking.stackexchange", "cooking.stackexchange.txt"),
st_dat stringsAsFactors = FALSE)
= unlist(lapply(1:nrow(st_dat), function(y)
res_stackexch
strsplit(st_dat[y, ], " ")[[1]][which(sapply(strsplit(st_dat[y, ], " ")[[1]], function(x)
substr(x, 1, 9) == "__label__") == T)])
)
= read.table(file.path(tempdir(), 'test_label_valid.txt'),
test_label_valid quote="\"", comment.char="")
# number of unique labels of data equal to the rows of the 'test_label_valid.txt' file
length(unique(res_stackexch)) == nrow(test_label_valid)
1] TRUE
[
head(test_label_valid)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 F1-Score : 0.234244 Precision : 0.139535 Recall : 0.729167 __label__baking
2 F1-Score : 0.227746 Precision : 0.132571 Recall : 0.807377 __label__food-safety
3 F1-Score : 0.058824 Precision : 0.750000 Recall : 0.030612 __label__substitutions
4 F1-Score : 0.000000 Precision : -------- Recall : 0.000000 __label__equipment
5 F1-Score : 0.017699 Precision : 1.000000 Recall : 0.008929 __label__bread
6 F1-Score : 0.000000 Precision : -------- Recall : 0.000000 __label__chicken
.....
The user can also ‘quantize’ a supervised model to reduce its memory usage with the following command,
#---------------------
# 'quantize' function
#---------------------
= list(command = 'quantize',
list_params input = file.path(tempdir(), 'model_cooking.bin'),
output = file.path(tempdir(), 'model_cooking'))
= fasttext_interface(list_params)
res
print(list.files(tempdir(), pattern = '.ftz'))
1] "model_cooking.ftz" [
The quantize function is currenlty (as of 01/02/2019) single-threaded.
Based on the ‘queries.txt’ text file the user can save the word vectors to a file using the following command ( one vector per line ),
#----------------------------
# print-word-vectors function
#----------------------------
list_params = list(command = 'print-word-vectors',
model = file.path(tempdir(), 'model_cooking.bin'))
res = fasttext_interface(list_params,
path_input = 'queries.txt',
path_output = file.path(tempdir(), 'word_vecs_queries.txt'))
To compute vector representations of sentences or paragraphs use the following command,
#--------------------------------
# print-sentence-vectors function
#--------------------------------
library(fastText)
list_params = list(command = 'print-sentence-vectors',
model = file.path(tempdir(), 'model_cooking.bin'))
res = fasttext_interface(list_params,
path_input = 'text_sentence.txt',
path_output = file.path(tempdir(), 'word_sentence_queries.txt'))
Be aware that for the ‘print-sentence-vectors’ the ‘word_sentence_queries.txt’ file should consist of sentences of the following form,
</s>
How much does potato starch affect a cheese sauce recipein acidic environments</s>
Dangerous pathogens capable of growing </s>
How do I cover up the white spots on my cast iron stove</s>
How do I cover up the white spots on my cast iron stoveif the chef is not there</s>
Michelin Three Star Restaurant but ......
Therefore each line should end in EOS (end of sentence ) and that if at the end of the file a newline exists then the function will return an additional word vector. Thus the user should make sure that the input file does not include empty lines at the end of the file.
The ‘print-ngrams’ command prints the n-grams of a word in the R session or saves the n-grams to a file. But first the user should save the model and word-vectors with n-grams enabled (minn, maxn parameters)
#----------------------------------------
# 'skipgram' function with n-gram enabled
#----------------------------------------
= list(command = 'skipgram',
list_params lr = 0.1,
dim = 50,
input = "example_text.txt",
output = file.path(tempdir(), 'word_vectors'),
verbose = 2,
thread = 1,
minn = 2,
maxn = 2)
= fasttext_interface(list_params,
res path_output = file.path(tempdir(), 'skipgram_logs.txt'),
MilliSecs = 5)
#-----------------------
# 'print-ngram' function
#-----------------------
= list(command = 'print-ngrams',
list_params model = file.path(tempdir(), 'word_vectors.bin'),
word = 'word')
# save output to file
= fasttext_interface(list_params,
res path_output = file.path(tempdir(), 'ngrams.txt'))
# print output to console
= fasttext_interface(list_params,
res path_output = "")
# truncated output for the 'word' query
#--------------------------------------
<w -0.00983 0.00178 0.01929 0.00994 ......
0.01114 0.01570 0.01294 0.01284 ......
wo -0.01035 -0.01585 -0.01671 -0.00043 ......
or 0.01840 -0.00378 -0.01468 0.01088 ......
rd > 0.00749 0.00720 0.01171 0.01258 ...... d
The ‘nn’ command returns the nearest neighbors for a specific word based on the input model,
#--------------
# 'nn' function
#--------------
= list(command = 'nn',
list_params model = file.path(tempdir(), 'model_cooking.bin'),
k = 5,
query_word = 'sauce')
= fasttext_interface(list_params,
res path_output = file.path(tempdir(), 'nearest.txt'))
# 'nearest.txt'
#--------------
0.804595
rice </s> 0.799858
0.78893
Vide 0.788918
store 0.785977 cheese
The ‘analogies’ command works for triplets of words (separated by whitespace) and returns ‘k’ rows for each line (triplet) of the input file (separated by an empty line),
#---------------------
# 'analogies' function
#---------------------
= list(command = 'analogies',
list_params model = file.path(tempdir(), 'model_cooking.bin'),
k = 5)
= fasttext_interface(list_params,
res path_input = 'analogy_queries.txt',
path_output = file.path(tempdir(), 'analogies_output.txt'))
# 'analogies_output.txt'
#-----------------------
0.857213
batter 0.854491
I 0.851498
recipe? 0.845269
substituted 0.842508
flour
0.808651
covered 0.801348
calls 0.800051
fresh 0.797468
cold 0.793695
always
.............
Finally, the ‘dump’ function takes as ‘option’ one of the ‘args’, ‘dict’, ‘input’ or ‘output’ and dumps the output to a text file,
#--------------
# dump function
#--------------
= list(command = 'dump',
list_params model = file.path(tempdir(), 'model_cooking.bin'),
option = 'args')
= fasttext_interface(list_params,
res path_output = file.path(tempdir(), 'dump_data.txt'),
remove_previous_file = TRUE)
# 'dump_data.txt'
#----------------
50
dim 5
ws 5
epoch 1
minCount 5
neg 1
wordNgrams
loss softmax
model sup0
bucket 0
minn 0
maxn 100
lrUpdateRate 0.00010 t