pyabsa.utils.absa_utils.absa_utils

Module Contents

Functions

generate_inference_set_for_apc(dataset_path)

Generate inference set for APC dataset. This function only works for APC datasets located in integrated_datasets.

is_similar(→ bool)

Determines if two strings are similar based on the number of common tokens they share.

assemble_aspects(fname[, use_tokenizer])

Preprocesses the input file, groups sentences with similar aspects, and generates samples with the corresponding aspect labels and polarities.

split_aspects(sentence)

Splits a sentence into multiple aspects, each with its own context and polarity.

convert_atepc(fname, use_tokenizer)

Converts the input file to the Aspect Term Extraction and Polarity Classification (ATEPC) format.

convert_apc_set_to_atepc_set(path[, use_tokenizer])

Converts APC dataset to ATEPC dataset.

refactor_chinese_dataset(fname, train_fname, test_fname)

Refactors the Chinese dataset by splitting it into train and test sets and converting it into the ATEPC format.

detect_error_in_dataset(dataset)

Detects errors in a given dataset by checking if the sentences with similar aspects have different lengths.

pyabsa.utils.absa_utils.absa_utils.generate_inference_set_for_apc(dataset_path)[source]

Generate inference set for APC dataset. This function only works for APC datasets located in integrated_datasets.

pyabsa.utils.absa_utils.absa_utils.is_similar(s1: str, s2: str) bool[source]

Determines if two strings are similar based on the number of common tokens they share. :param s1: string 1 :param s2: string 2 :return: True if strings are similar, False otherwise

pyabsa.utils.absa_utils.absa_utils.assemble_aspects(fname, use_tokenizer=False)[source]

Preprocesses the input file, groups sentences with similar aspects, and generates samples with the corresponding aspect labels and polarities.

Parameters:
  • fname (str) – The filename to be preprocessed

  • use_tokenizer (bool, optional) – Whether to use a tokenizer, defaults to False

Returns:

A list of samples

Return type:

list

pyabsa.utils.absa_utils.absa_utils.split_aspects(sentence)[source]

Splits a sentence into multiple aspects, each with its own context and polarity. :param sentence: input sentence with multiple aspects :return: list of tuples containing single aspect with its context and polarity

pyabsa.utils.absa_utils.absa_utils.convert_atepc(fname, use_tokenizer)[source]

Converts the input file to the Aspect Term Extraction and Polarity Classification (ATEPC) format. :param fname: filename :param use_tokenizer: whether to use a tokenizer

pyabsa.utils.absa_utils.absa_utils.convert_apc_set_to_atepc_set(path, use_tokenizer=False)[source]

Converts APC dataset to ATEPC dataset. :param path: path to the dataset :param use_tokenizer: whether to use a tokenizer

pyabsa.utils.absa_utils.absa_utils.refactor_chinese_dataset(fname, train_fname, test_fname)[source]

Refactors the Chinese dataset by splitting it into train and test sets and converting it into the ATEPC format. :param fname: the name of the dataset file :param train_fname: the name of the output train file :param test_fname: the name of the output test file

pyabsa.utils.absa_utils.absa_utils.detect_error_in_dataset(dataset)[source]

Detects errors in a given dataset by checking if the sentences with similar aspects have different lengths. :param dataset: dataset file name