pyabsa.utils.text_utils.bpe_tokenizer

Module Contents

Functions

train_bpe_tokenizer([corpus_files, base_tokenizer, ...])

Train a Byte-Pair Encoding tokenizer.

Attributes

tokenizer

pyabsa.utils.text_utils.bpe_tokenizer.train_bpe_tokenizer(corpus_files=None, base_tokenizer='roberta-base', save_path='bpe_tokenizer', vocab_size=60000, min_frequency=1000, special_tokens=None, **kwargs)[source]

Train a Byte-Pair Encoding tokenizer. :param base_tokenizer: The base tokenizer template from transformer tokenizers. :param e.g.: //huggingface.co/models :param you can pass any name of the pretrained tokenizer from https: //huggingface.co/models :param corpus_files: input corpus files organized line by line. :param save_path: save path of the tokenizer. :param vocab_size: the size of the vocabulary. :param min_frequency: the minimum frequency of the tokens. :param special_tokens: special tokens to add to the vocabulary. :param **kwargs:

Returns:

pyabsa.utils.text_utils.bpe_tokenizer.tokenizer[source]