`pyabsa.framework.tokenizer_class.tokenizer_class`

Module Contents

Classes

`Tokenizer`
`PretrainedTokenizer`

Functions

`build_embedding_matrix`(config, tokenizer[, cache_path])	Function to build an embedding matrix for a given tokenizer and config.
`pad_and_truncate`(sequence, max_seq_len, value, **kwargs)	Pad or truncate a sequence to a specified maximum sequence length.
`_load_word_vec`(path[, word2idx, embed_dim])	Loads word vectors from a given embedding file and returns a dictionary of word to vector mappings.

class pyabsa.framework.tokenizer_class.tokenizer_class.Tokenizer(config)[source]

Bases: object

static build_tokenizer(config, cache_path=None, pre_tokenizer=None, **kwargs)[source]

fit_on_text(text: str | List[str], **kwargs)[source]

text_to_sequence(text: str | List[str], padding='max_length', **kwargs)[source]

Convert input text to a sequence of token IDs.

Parameters: - text : str or list of str

Input text to be converted to a sequence of token IDs.

paddingstr, optional (default=”max_length”)
Padding method to use when the sequence is shorter than the max_seq_len parameter.
**kwargs:
Additional arguments that can be passed, such as reverse.

Returns: - sequence: list of int or list of list of int

Sequence of token IDs or list of sequences of token IDs, depending on whether the input text is a string or a list of strings.

sequence_to_text(sequence)[source]

Convert a sequence of token IDs to text.

Parameters: - sequence : list of int

Sequence of token IDs to be converted to text.

class pyabsa.framework.tokenizer_class.tokenizer_class.PretrainedTokenizer(config, **kwargs)[source]

text_to_sequence(text, **kwargs)[source]

text_to_sequence(text, **kwargs)[source]

Encodes the given text into a sequence of token IDs.

Parameters:

text (str) – Text to be encoded.
**kwargs – Additional arguments to be passed to the tokenizer.

Returns:

Encoded sequence of token IDs.

Return type:

torch.Tensor

sequence_to_text(sequence, **kwargs)[source]

Decodes the given sequence of token IDs into text.

Parameters:

sequence (list) – Sequence of token IDs.
**kwargs – Additional arguments to be passed to the tokenizer.

Returns:

Decoded text.

Return type:

str

tokenize(text, **kwargs)[source]

Tokenizes the given text into subwords.

Parameters:

text (str) – Text to be tokenized.
**kwargs – Additional arguments to be passed to the tokenizer.

Returns:

List of subwords.

Return type:

list

convert_tokens_to_ids(return_tensors=None, **kwargs)[source]

Converts the given tokens into token IDs.

Parameters:: return_tensors (str) – Type of tensor to be returned.
Returns:: List or tensor of token IDs.
Return type:: list or torch.Tensor

convert_ids_to_tokens(ids, **kwargs)[source]

Converts the given token IDs into tokens.

Parameters:: ids (list) – List of token IDs.
Returns:: List of tokens.
Return type:: list

encode_plus(text, **kwargs)[source]

Encodes the given text into a sequence of token IDs along with additional information.

Parameters:

text (str) – Text to be encoded.
**kwargs – Additional arguments to be passed to the tokenizer.

encode(text, **kwargs)[source]

Encodes the given text into a sequence of token IDs.

Parameters:

text (str) – Text to be encoded.
**kwargs – Additional arguments to be passed to the tokenizer.

Returns:

Encoded sequence of token IDs.

Return type:

torch.Tensor

decode(sequence, **kwargs)[source]

pyabsa.framework.tokenizer_class.tokenizer_class.build_embedding_matrix(config, tokenizer, cache_path=None)[source]

Function to build an embedding matrix for a given tokenizer and config.

Args: - config: A configuration object. - tokenizer: A tokenizer object. - cache_path: A string that specifies the cache path.

Returns: - embedding_matrix: A numpy array of shape (len(tokenizer.word2idx)+1, config.embed_dim)

containing the embedding matrix for the given tokenizer and config.

pyabsa.framework.tokenizer_class.tokenizer_class.pad_and_truncate(sequence, max_seq_len, value, **kwargs)[source]

Pad or truncate a sequence to a specified maximum sequence length.

Parameters:

sequence (list or np.ndarray) – The sequence of elements to be padded or truncated.
max_seq_len (int) – The maximum sequence length to pad or truncate to.
value – The value to use for padding.
**kwargs – Additional keyword arguments to ignore.

Returns:

The padded or truncated sequence, as a list or numpy array, depending on the type of the input sequence.

Return type:

np.ndarray or list

pyabsa.framework.tokenizer_class.tokenizer_class._load_word_vec(path, word2idx=None, embed_dim=300)[source]

Loads word vectors from a given embedding file and returns a dictionary of word to vector mappings.

Parameters:

path (str) – Path to the embedding file.
word2idx (dict) – A dictionary containing word to index mappings.
embed_dim (int) – The dimension of the word embeddings.

Returns:

A dictionary containing word to vector mappings.

Return type:

word_vec (dict)

pyabsa.framework.tokenizer_class.tokenizer_class

Module Contents

Classes

Functions

`pyabsa.framework.tokenizer_class.tokenizer_class`