pyabsa.framework.tokenizer_class.tokenizer_class
Module Contents
Classes
Functions
|
Function to build an embedding matrix for a given tokenizer and config. |
|
Pad or truncate a sequence to a specified maximum sequence length. |
|
Loads word vectors from a given embedding file and returns a dictionary of word to vector mappings. |
- class pyabsa.framework.tokenizer_class.tokenizer_class.Tokenizer(config)[source]
Bases:
object
- text_to_sequence(text: Union[str, List[str]], padding='max_length', **kwargs)[source]
Convert input text to a sequence of token IDs.
Parameters: - text : str or list of str
Input text to be converted to a sequence of token IDs.
- paddingstr, optional (default=”max_length”)
Padding method to use when the sequence is shorter than the max_seq_len parameter.
- **kwargs:
Additional arguments that can be passed, such as reverse.
Returns: - sequence: list of int or list of list of int
Sequence of token IDs or list of sequences of token IDs, depending on whether the input text is a string or a list of strings.
- class pyabsa.framework.tokenizer_class.tokenizer_class.PretrainedTokenizer(config, **kwargs)[source]
-
- text_to_sequence(text, **kwargs)[source]
Encodes the given text into a sequence of token IDs.
- Parameters:
text (str) – Text to be encoded.
**kwargs – Additional arguments to be passed to the tokenizer.
- Returns:
Encoded sequence of token IDs.
- Return type:
torch.Tensor
- sequence_to_text(sequence, **kwargs)[source]
Decodes the given sequence of token IDs into text.
- Parameters:
sequence (list) – Sequence of token IDs.
**kwargs – Additional arguments to be passed to the tokenizer.
- Returns:
Decoded text.
- Return type:
str
- tokenize(text, **kwargs)[source]
Tokenizes the given text into subwords.
- Parameters:
text (str) – Text to be tokenized.
**kwargs – Additional arguments to be passed to the tokenizer.
- Returns:
List of subwords.
- Return type:
list
- convert_tokens_to_ids(return_tensors=None, **kwargs)[source]
Converts the given tokens into token IDs.
- Parameters:
return_tensors (str) – Type of tensor to be returned.
- Returns:
List or tensor of token IDs.
- Return type:
list or torch.Tensor
- convert_ids_to_tokens(ids, **kwargs)[source]
Converts the given token IDs into tokens.
- Parameters:
ids (list) – List of token IDs.
- Returns:
List of tokens.
- Return type:
list
- encode_plus(text, **kwargs)[source]
Encodes the given text into a sequence of token IDs along with additional information.
- Parameters:
text (str) – Text to be encoded.
**kwargs – Additional arguments to be passed to the tokenizer.
- pyabsa.framework.tokenizer_class.tokenizer_class.build_embedding_matrix(config, tokenizer, cache_path=None)[source]
Function to build an embedding matrix for a given tokenizer and config.
Args: - config: A configuration object. - tokenizer: A tokenizer object. - cache_path: A string that specifies the cache path.
Returns: - embedding_matrix: A numpy array of shape (len(tokenizer.word2idx)+1, config.embed_dim)
containing the embedding matrix for the given tokenizer and config.
- pyabsa.framework.tokenizer_class.tokenizer_class.pad_and_truncate(sequence, max_seq_len, value, **kwargs)[source]
Pad or truncate a sequence to a specified maximum sequence length.
- Parameters:
sequence (list or np.ndarray) – The sequence of elements to be padded or truncated.
max_seq_len (int) – The maximum sequence length to pad or truncate to.
value – The value to use for padding.
**kwargs – Additional keyword arguments to ignore.
- Returns:
The padded or truncated sequence, as a list or numpy array, depending on the type of the input sequence.
- Return type:
np.ndarray or list
- pyabsa.framework.tokenizer_class.tokenizer_class._load_word_vec(path, word2idx=None, embed_dim=300)[source]
Loads word vectors from a given embedding file and returns a dictionary of word to vector mappings.
- Parameters:
path (str) – Path to the embedding file.
word2idx (dict) – A dictionary containing word to index mappings.
embed_dim (int) – The dimension of the word embeddings.
- Returns:
A dictionary containing word to vector mappings.
- Return type:
word_vec (dict)