ABSA datasets for PyABSA 

total views total views per week total clones total clones per week

To augment your datasets, please refer to BoostTextAugmentation 

Auto-annoate your datasets via PyABSA!

There is an experimental feature which allows you to auto-build APC dataset and ATEPC datasets, see the usage here:

from pyabsa import make_ABSA_dataset 
# refer to the comments in this function for detailed usage
make_ABSA_dataset(dataset_name_or_path='integrated_datasets/review', checkpoint='english')

Contribute (prepare) your dataset to ABSADatasets to use it in PyABSA

We hope you can share your custom dataset or an available public dataset. If you are willing to, please follow the instruction to process your data and open a PR.

Format your dataset to use it in PyABSA

Format your APC dataset according to our dataset format. (Recommended. Once you finished this step, we can help you to finish other steps) image
Generate the inference dataset for APC / ATEPC task (Optional. The example is available here)
Convert the APC dataset to ATEPC dataset, and move the transformed ATEPC datasets from apc_dataset to corresponding atepc_datasets. (Optional. The example is available here )
Register your dataset in PyABSA. (Optional. Register here)

Important: Rename your dataset filename before use it in PyABSA

It is recommended to assign an id for your dataset, which will avoid some potential problem (e.g., dataset mis-loading) while PyABSA detects the dataset. Merging your datasets into ABSADatasets, please keep the id remained.

For a custom APC dataset, its name should be {id}.{dataset name}.{type}.dat.apc,

datasets
├── apc_datasets
│     ├── 101.restaurant
│     │    ├── restaurant.train.dat.apc  # train_dataset
│     │    ├── restaurant.test.dat.apc  # test_dataset
│     │    └── restaurant.valid.dat.apc  # valid_dataset, dev set are not recognized in PyASBA, please rename dev-set to valid-set
│     └── others
├── atepc_datasets

ATEPC dataset files should be {id}.{dataset name}.{type}.dat.atepc, e.g.,

datasets
├── 101.restaurant
│    ├── restaurant.train.dat.atepc  # train_dataset
│    ├── restaurant.test.dat.atepc  # test_dataset
│    └── restaurant.valid.dat.atepc  # valid_dataset, dev set are not recognized in PyASBA, please rename dev-set to valid-set
└── others

I prepare a demo custom APC/ATEPC dataset which is based on third-party annotated Yelp dataset. Iif you got problem in dataset renaming, please put your data into the prepared dataset files. Check datasets/apc_datasets/100.CustomDataset and datasets/atepc_datasets/100.CustomDataset to view or rewrite the custom dataset.

Then, use the {id}.{dataset name} to locate your dataset, e.g., dataset = '101.restaurant'

Need to Annotate Your Own Dataset?

A Stand-alone browser based tool to help process data for the training set. here image1

Once data saved, 3 files will be created:
1. a CSV file training set for classic sentiment analysis
2. a TXT file training set for PyABSA
3. a JSON file for saving unfinished work

Notice

All datasets provided are for research only, we do not hold any Copyright of any datasets. These datasets follow their original licenses (if any).

Datasets source:

MAMS https://github.com/siat-nlp/MAMS-for-ABSA

SemEval 2014: https://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools

SemEval 2015: https://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools

SemEval 2016: https://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools

Chinese: https://www.sciencedirect.com/science/article/abs/pii/S0950705118300972?via%3Dihub

Shampoo: brightgems@GitHub

MOOC: jmc-123@GitHub with GPL License

Twitter: https://dl.acm.org/doi/10.5555/2832415.2832437

Television & TShirt: https://github.com/rajdeep345/ABSA-Reproducibility

Yelp: WeiLi9811@GitHub

SemEval2016Task5: YaxinCui@GitHub

Arabic Hotel Reviews
Dutch Restaurant Reviews
English Restaurant Reviews
French Restaurant Reviews
Russian Restaurant Reviews
Spanish Restaurant Reviews
Turkish Restaurant Reviews

English-MOOC github: aparnavalli@GitHub

ABSA datasets for PyABSA

To augment your datasets, please refer to BoostTextAugmentation