pyabsa.utils

Subpackages

Submodules

Package Contents

Classes

DatasetItem

Built-in mutable sequence.

VoteEnsemblePredictor

Functions

make_ABSA_dataset(dataset_name_or_path[, checkpoint])

Make APC and ATEPC datasets for PyABSA, using aspect extractor from PyABSA to automatically build datasets. This method WILL NOT give you the best performance but is quite fast and labor-free.

generate_inference_set_for_apc(dataset_path)

Generate inference set for APC dataset. This function only works for APC datasets located in integrated_datasets.

convert_apc_set_to_atepc_set(path[, use_tokenizer])

Converts APC dataset to ATEPC dataset.

train_word2vec([corpus_files, save_path, vector_dim, ...])

Train a Word2Vec model on a given corpus and save the resulting model and vectors to disk.

train_bpe_tokenizer([corpus_files, base_tokenizer, ...])

Train a Byte-Pair Encoding tokenizer.

download_all_available_datasets(**kwargs)

Download datasets from GitHub

download_dataset_by_name([task_code, dataset_name])

If download all datasets failed, try to download dataset by name from Huggingface

load_dataset_from_file(fname, config)

Loads a dataset from one or multiple files.

class pyabsa.utils.DatasetItem(dataset_name, dataset_items=None)[source]

Bases: list

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

pyabsa.utils.make_ABSA_dataset(dataset_name_or_path, checkpoint='english')[source]

Make APC and ATEPC datasets for PyABSA, using aspect extractor from PyABSA to automatically build datasets. This method WILL NOT give you the best performance but is quite fast and labor-free. The names of dataset files to be processed should end with ‘.raw.ignore’. The files will be processed and saved to the same directory. The files will be overwritten if they already exist. The data in the dataset files will be plain text row by row.

For obtaining the best performance, you should use DPT tool in ABSADatasets to manually annotate the dataset files, which can be found in the following link: https://github.com/yangheng95/ABSADatasets/tree/v1.2/DPT . This tool should be downloaded and run on a browser.

is much more time-consuming. :param dataset_name_or_path: The name of the dataset to be processed. If the name is a directory, all files in the directory will be processed. If it is a file, only the file will be processed. If it is a directory name, I use the findfile to find all files in the directory. :param checkpoint: Which checkpoint to use. Basically, You can select from {‘multilingual’, ‘english’, ‘chinese’}, Default is ‘english’. :return:

pyabsa.utils.generate_inference_set_for_apc(dataset_path)[source]

Generate inference set for APC dataset. This function only works for APC datasets located in integrated_datasets.

pyabsa.utils.convert_apc_set_to_atepc_set(path, use_tokenizer=False)[source]

Converts APC dataset to ATEPC dataset. :param path: path to the dataset :param use_tokenizer: whether to use a tokenizer

pyabsa.utils.train_word2vec(corpus_files: List[str] = None, save_path: str = 'word2vec', vector_dim: int = 300, window: int = 5, min_count: int = 1000, skip_gram: int = 1, num_workers: int = None, epochs: int = 10, pre_tokenizer: str = None, **kwargs)[source]

Train a Word2Vec model on a given corpus and save the resulting model and vectors to disk.

Args: - corpus_files: a list of file paths for the input corpus - save_path: the directory where the model and vectors will be saved - vector_dim: the dimension of the resulting word vectors - window: the size of the window used for context - min_count: the minimum count of a word for it to be included in the model - skip_gram: whether to use skip-gram (1) or CBOW (0) algorithm - num_workers: the number of worker threads to use (default: CPU count - 1) - epochs: the number of iterations over the corpus - pre_tokenizer: the name of a tokenizer to use for preprocessing (optional)

pyabsa.utils.train_bpe_tokenizer(corpus_files=None, base_tokenizer='roberta-base', save_path='bpe_tokenizer', vocab_size=60000, min_frequency=1000, special_tokens=None, **kwargs)[source]

Train a Byte-Pair Encoding tokenizer. :param base_tokenizer: The base tokenizer template from transformer tokenizers. :param e.g.: //huggingface.co/models :param you can pass any name of the pretrained tokenizer from https: //huggingface.co/models :param corpus_files: input corpus files organized line by line. :param save_path: save path of the tokenizer. :param vocab_size: the size of the vocabulary. :param min_frequency: the minimum frequency of the tokens. :param special_tokens: special tokens to add to the vocabulary. :param **kwargs:

Returns:

pyabsa.utils.download_all_available_datasets(**kwargs)[source]

Download datasets from GitHub :param kwargs: other arguments

pyabsa.utils.download_dataset_by_name(task_code: pyabsa.framework.flag_class.TaskCodeOption | str = TaskCodeOption.Aspect_Polarity_Classification, dataset_name: pyabsa.utils.data_utils.dataset_item.DatasetItem | str = None, **kwargs)[source]

If download all datasets failed, try to download dataset by name from Huggingface Download dataset from Huggingface: https://huggingface.co/spaces/yangheng/PyABSA :param task_code: task code -> e.g., TaskCodeOption.Aspect_Polarity_Classification :param dataset_name: dataset name -> e.g, pyabsa.tasks.AspectPolarityClassification.APCDatasetList.Laptop14

pyabsa.utils.load_dataset_from_file(fname, config)[source]

Loads a dataset from one or multiple files.

Parameters:
  • fname (str or List[str]) – The name of the file(s) containing the dataset.

  • config (dict) – The configuration dictionary containing the logger (optional) and the maximum number of data to load (optional).

Returns:

A list of strings containing the loaded dataset.

Raises:

ValueError – If an empty line is found in the dataset.

class pyabsa.utils.VoteEnsemblePredictor(predictors: [List, dict], weights: [List, dict] = None, numeric_agg='average', str_agg='max_vote')[source]
numeric_agg(result: list)[source]

Aggregate a list of numeric values.

Parameters:

result – a list of numeric values

Returns:

the aggregated value

__ensemble(result: dict)[source]

Aggregate prediction results by calling the appropriate aggregation method.

Parameters:

result – a dictionary containing the prediction results

Returns:

the aggregated prediction result

__dict_aggregate(result: dict)[source]

Recursively aggregate a dictionary of prediction results.

Parameters:

result – a dictionary containing the prediction results

Returns:

the aggregated prediction result

__list_aggregate(result: list)[source]
predict(text, ignore_error=False, print_result=False)[source]

Predicts on a single text and returns the ensemble result.

Parameters:
  • text (str) – The text to perform prediction on

  • ignore_error (bool) – Whether to ignore any errors that occur during prediction, defaults to False

  • print_result (bool) – Whether to print the prediction result, defaults to False

Returns:

The ensemble prediction result

Return type:

dict

batch_predict(texts, ignore_error=False, print_result=False)[source]

Predicts on a batch of texts using the ensemble of predictors. :param texts: a list of strings to predict on. :param ignore_error: boolean indicating whether to ignore errors or raise exceptions when prediction fails. :param print_result: boolean indicating whether to print the raw results for each predictor. :return: a list of dictionaries, each dictionary containing the aggregated results of the corresponding text in the input list.