Working with Datasets¶
PyABSA provides a comprehensive system for working with datasets, supporting both built-in benchmark datasets and custom data. This guide covers everything you need to know about preparing, loading, and using datasets effectively.
Built-in Datasets¶
PyABSA includes access to standard benchmark datasets that are commonly used in ABSA research. These datasets are automatically downloaded and cached when you first use them.
Available Dataset Categories¶
Aspect Polarity Classification (APC)¶
from pyabsa import AspectPolarityClassification as APC
# Popular benchmark datasets
dataset = APC.APCDatasetList.Restaurant14
dataset = APC.APCDatasetList.Laptop14
dataset = APC.APCDatasetList.Restaurant15
dataset = APC.APCDatasetList.Restaurant16
# Multilingual datasets
dataset = APC.APCDatasetList.Multilingual
Aspect Term Extraction (ATE)¶
from pyabsa import AspectTermExtraction as ATE
# Standard ATE datasets
dataset = ATE.ATEDatasetList.Restaurant14
dataset = ATE.ATEDatasetList.Laptop14
Aspect Sentiment Triplet Extraction (ASTE)¶
from pyabsa import AspectSentimentTripletExtraction as ASTE
# ASTE benchmark datasets
dataset = ASTE.ASTEDatasetList.Restaurant14
dataset = ASTE.ASTEDatasetList.Restaurant15
dataset = ASTE.ASTEDatasetList.Restaurant16
Using Built-in Datasets¶
# Simply pass the dataset to your trainer
trainer = APC.APCTrainer(
config=config,
dataset=APC.APCDatasetList.Restaurant14,
auto_device=DeviceTypeOption.AUTO
)
Custom Dataset Preparation¶
To use your own datasets, you need to follow specific formatting conventions that PyABSA expects.
Data Format Specifications¶
For Aspect Polarity Classification (APC)¶
Each line should contain a sentence with marked aspects followed by sentiment labels:
[sentence with marked aspects] $LABEL$ [sentiment1, sentiment2, ...]
Example:
The [B-ASP]food[E-ASP] was excellent, but the [B-ASP]service[E-ASP] was terrible. $LABEL$ Positive, Negative
Aspect Marking Rules:
Use
[B-ASP]to mark the beginning of an aspect termUse
[E-ASP]to mark the end of an aspect termMultiple aspects in one sentence are supported
Sentiment Labels:
Positiveor1for positive sentimentNegativeor-1for negative sentimentNeutralor0for neutral sentiment
For Aspect Term Extraction (ATE)¶
Format is similar but without sentiment labels:
The [B-ASP]food[E-ASP] was excellent, but the [B-ASP]service[E-ASP] was terrible. $LABEL$
For Aspect Sentiment Triplet Extraction (ASTE)¶
Uses BIO tagging for aspects and opinions:
The food was excellent , but the service was terrible . $LABEL$ B-ASP I-ASP O B-OP O O B-ASP I-ASP O B-OP O
File Organization¶
Organize your dataset files using this structure:
my_dataset/
├── train.dat.apc # Training data
├── test.dat.apc # Test data
├── valid.dat.apc # Validation data (optional)
└── readme.txt # Dataset description (optional)
File Naming Conventions:
Training files:
train.dat.[task](e.g.,train.dat.apc,train.dat.ate)Test files:
test.dat.[task]Validation files:
valid.dat.[task](optional)
Loading Custom Datasets¶
Method 1: Direct Path¶
from pyabsa import AspectPolarityClassification as APC
# Point to your dataset directory
dataset_path = "path/to/my_dataset"
trainer = APC.APCTrainer(
config=config,
dataset=dataset_path,
auto_device=DeviceTypeOption.AUTO
)
Method 2: Using DatasetItem¶
from pyabsa.utils.data_utils.dataset_item import DatasetItem
# Create a dataset item
my_dataset = DatasetItem(
dataset_name="MyCustomDataset",
dataset_path="path/to/my_dataset"
)
trainer = APC.APCTrainer(
config=config,
dataset=my_dataset,
auto_device=DeviceTypeOption.AUTO
)
Data Preprocessing and Validation¶
Automatic Data Validation¶
PyABSA automatically validates your dataset format when loading:
# The framework will check for:
# - Correct file naming
# - Valid data format
# - Consistent labeling
# - Missing files
Manual Data Validation¶
You can validate your dataset before training:
from pyabsa.utils.data_utils import validate_dataset
# Validate your dataset format
is_valid, errors = validate_dataset("path/to/my_dataset", task="apc")
if not is_valid:
print("Dataset validation errors:")
for error in errors:
print(f"- {error}")
Data Augmentation¶
PyABSA supports automatic data augmentation to improve model robustness:
Enabling Augmentation¶
# Enable augmentation during training
config.load_aug = True
# Or specify augmentation parameters
config.augmentation_ratio = 0.2 # Augment 20% of training data
config.augmentation_methods = ["synonym", "back_translation"]
Custom Augmentation¶
from pyabsa.augmentation import TextAugmentation
# Create custom augmentation pipeline
augmenter = TextAugmentation(
methods=["synonym_replacement", "random_insertion"],
ratio=0.15
)
# Apply to your dataset
augmented_data = augmenter.augment_dataset("path/to/dataset")
Advanced Dataset Features¶
Multi-language Support¶
# Specify language for your dataset
config.language = "en" # English
config.language = "zh" # Chinese
config.language = "multilingual" # Multi-language
# Use appropriate tokenizer
config.pretrained_bert = "bert-base-multilingual-cased"
Handling Imbalanced Data¶
# Enable class balancing
config.balance_classes = True
# Use weighted loss
config.use_class_weights = True
# Oversample minority classes
config.oversample_ratio = 2.0
Cross-domain Datasets¶
# Training on multiple domains
config.cross_domain_training = True
config.domain_adaptation = True
# Specify source and target domains
config.source_domain = "restaurant"
config.target_domain = "laptop"
Best Practices¶
Data Quality Guidelines¶
Consistent Labeling: Ensure sentiment labels are consistent across your dataset
Balanced Distribution: Aim for balanced sentiment distribution when possible
Quality Control: Review a sample of your data manually before training
Sufficient Size: Use at least 1000 examples for training, more for better performance
Performance Optimization¶
# Cache processed datasets for faster loading
config.cache_dataset = True
# Use multiple workers for data loading
config.num_workers = 4
# Enable dataset preprocessing
config.preprocess_data = True
Error Handling¶
# Handle missing files gracefully
config.ignore_error = True
# Skip malformed examples
config.skip_errors = True
# Log data loading issues
config.verbose_data_loading = True
Troubleshooting Common Issues¶
Format Errors¶
Problem: “Invalid data format” error
Solution: Check that aspects are properly marked with [B-ASP] and [E-ASP] tags
Problem: “Missing label separator” error
Solution: Ensure each line contains $LABEL$ separator
Loading Errors¶
Problem: “Dataset not found” error
Solution: Verify file paths and naming conventions
Problem: “Empty dataset” error
Solution: Check that data files contain actual content
Performance Issues¶
Problem: Slow data loading
Solution: Enable cache_dataset=True and increase num_workers
Problem: Out of memory during data loading
Solution: Reduce max_seq_len or train_batch_size
For additional help with dataset issues, check
the PyABSA GitHub repository or refer to the example datasets in the
examples-v2 directory.