pyabsa.framework.tokenizer_class.tokenizer_class

Module Contents

Classes

Tokenizer

PretrainedTokenizer

Functions

build_embedding_matrix(config, tokenizer[, cache_path])

Function to build an embedding matrix for a given tokenizer and config.

pad_and_truncate(sequence, max_seq_len, value, **kwargs)

Pad or truncate a sequence to a specified maximum sequence length.

_load_word_vec(path[, word2idx, embed_dim])

Loads word vectors from a given embedding file and returns a dictionary of word to vector mappings.

class pyabsa.framework.tokenizer_class.tokenizer_class.Tokenizer(config)[source]

Bases: object

static build_tokenizer(config, cache_path=None, pre_tokenizer=None, **kwargs)[source]
fit_on_text(text: str | List[str], **kwargs)[source]
text_to_sequence(text: str | List[str], padding='max_length', **kwargs)[source]

Convert input text to a sequence of token IDs.

Parameters: - text : str or list of str

Input text to be converted to a sequence of token IDs.

  • paddingstr, optional (default=”max_length”)

    Padding method to use when the sequence is shorter than the max_seq_len parameter.

  • **kwargs:

    Additional arguments that can be passed, such as reverse.

Returns: - sequence: list of int or list of list of int

Sequence of token IDs or list of sequences of token IDs, depending on whether the input text is a string or a list of strings.

sequence_to_text(sequence)[source]

Convert a sequence of token IDs to text.

Parameters: - sequence : list of int

Sequence of token IDs to be converted to text.

class pyabsa.framework.tokenizer_class.tokenizer_class.PretrainedTokenizer(config, **kwargs)[source]
text_to_sequence(text, **kwargs)[source]
text_to_sequence(text, **kwargs)[source]

Encodes the given text into a sequence of token IDs.

Parameters:
  • text (str) – Text to be encoded.

  • **kwargs – Additional arguments to be passed to the tokenizer.

Returns:

Encoded sequence of token IDs.

Return type:

torch.Tensor

sequence_to_text(sequence, **kwargs)[source]

Decodes the given sequence of token IDs into text.

Parameters:
  • sequence (list) – Sequence of token IDs.

  • **kwargs – Additional arguments to be passed to the tokenizer.

Returns:

Decoded text.

Return type:

str

tokenize(text, **kwargs)[source]

Tokenizes the given text into subwords.

Parameters:
  • text (str) – Text to be tokenized.

  • **kwargs – Additional arguments to be passed to the tokenizer.

Returns:

List of subwords.

Return type:

list

convert_tokens_to_ids(return_tensors=None, **kwargs)[source]

Converts the given tokens into token IDs.

Parameters:

return_tensors (str) – Type of tensor to be returned.

Returns:

List or tensor of token IDs.

Return type:

list or torch.Tensor

convert_ids_to_tokens(ids, **kwargs)[source]

Converts the given token IDs into tokens.

Parameters:

ids (list) – List of token IDs.

Returns:

List of tokens.

Return type:

list

encode_plus(text, **kwargs)[source]

Encodes the given text into a sequence of token IDs along with additional information.

Parameters:
  • text (str) – Text to be encoded.

  • **kwargs – Additional arguments to be passed to the tokenizer.

encode(text, **kwargs)[source]

Encodes the given text into a sequence of token IDs.

Parameters:
  • text (str) – Text to be encoded.

  • **kwargs – Additional arguments to be passed to the tokenizer.

Returns:

Encoded sequence of token IDs.

Return type:

torch.Tensor

decode(sequence, **kwargs)[source]
pyabsa.framework.tokenizer_class.tokenizer_class.build_embedding_matrix(config, tokenizer, cache_path=None)[source]

Function to build an embedding matrix for a given tokenizer and config.

Args: - config: A configuration object. - tokenizer: A tokenizer object. - cache_path: A string that specifies the cache path.

Returns: - embedding_matrix: A numpy array of shape (len(tokenizer.word2idx)+1, config.embed_dim)

containing the embedding matrix for the given tokenizer and config.

pyabsa.framework.tokenizer_class.tokenizer_class.pad_and_truncate(sequence, max_seq_len, value, **kwargs)[source]

Pad or truncate a sequence to a specified maximum sequence length.

Parameters:
  • sequence (list or np.ndarray) – The sequence of elements to be padded or truncated.

  • max_seq_len (int) – The maximum sequence length to pad or truncate to.

  • value – The value to use for padding.

  • **kwargs – Additional keyword arguments to ignore.

Returns:

The padded or truncated sequence, as a list or numpy array, depending on the type of the input sequence.

Return type:

np.ndarray or list

pyabsa.framework.tokenizer_class.tokenizer_class._load_word_vec(path, word2idx=None, embed_dim=300)[source]

Loads word vectors from a given embedding file and returns a dictionary of word to vector mappings.

Parameters:
  • path (str) – Path to the embedding file.

  • word2idx (dict) – A dictionary containing word to index mappings.

  • embed_dim (int) – The dimension of the word embeddings.

Returns:

A dictionary containing word to vector mappings.

Return type:

word_vec (dict)