ABSA datasets for PyABSA
total views total views per week total clones total clones per week
To augment your datasets, please refer to BoostTextAugmentation
Auto-annoate your datasets via PyABSA!
There is an experimental feature which allows you to auto-build APC dataset and ATEPC datasets, see the usage here:
from pyabsa import make_ABSA_dataset
# refer to the comments in this function for detailed usage
make_ABSA_dataset(dataset_name_or_path='integrated_datasets/review', checkpoint='english')
Contribute (prepare) your dataset to ABSADatasets to use it in PyABSA
We hope you can share your custom dataset or an available public dataset. If you are willing to, please follow the instruction to process your data and open a PR.
Format your dataset to use it in PyABSA
Format your APC dataset according to our dataset format. (Recommended. Once you finished this step, we can help you to finish other steps) image
Generate the inference dataset for APC / ATEPC task (Optional. The example is available here)
Convert the APC dataset to ATEPC dataset, and move the transformed ATEPC datasets from apc_dataset to corresponding atepc_datasets. (Optional. The example is available here )
Register your dataset in PyABSA. (Optional. Register here)
Important: Rename your dataset filename before use it in PyABSA
It is recommended to assign an id for your dataset, which will avoid some potential problem (e.g., dataset mis-loading) while PyABSA detects the dataset. Merging your datasets into ABSADatasets, please keep the id remained.
For a custom APC dataset, its name should be {id}.{dataset name}.{type}.dat.apc,
datasets
├── apc_datasets
│ ├── 101.restaurant
│ │ ├── restaurant.train.dat.apc # train_dataset
│ │ ├── restaurant.test.dat.apc # test_dataset
│ │ └── restaurant.valid.dat.apc # valid_dataset, dev set are not recognized in PyASBA, please rename dev-set to valid-set
│ └── others
├── atepc_datasets
ATEPC dataset files should be {id}.{dataset name}.{type}.dat.atepc, e.g.,
datasets
├── 101.restaurant
│ ├── restaurant.train.dat.atepc # train_dataset
│ ├── restaurant.test.dat.atepc # test_dataset
│ └── restaurant.valid.dat.atepc # valid_dataset, dev set are not recognized in PyASBA, please rename dev-set to valid-set
└── others
I prepare a demo custom APC/ATEPC dataset which is based on third-party annotated Yelp dataset. Iif you got problem in dataset renaming, please put your data into the prepared dataset files. Check datasets/apc_datasets/100.CustomDataset and datasets/atepc_datasets/100.CustomDataset to view or rewrite the custom dataset.
Then, use the {id}.{dataset name} to locate your dataset, e.g., dataset = '101.restaurant'
Need to Annotate Your Own Dataset?
A Stand-alone browser based tool to help process data for the training set. here image1
Once data saved, 3 files will be created:
a CSV file training set for classic sentiment analysis
a TXT file training set for PyABSA
a JSON file for saving unfinished work
Notice
All datasets provided are for research only, we do not hold any Copyright of any datasets. These datasets follow their original licenses (if any).
Datasets source:
MAMS https://github.com/siat-nlp/MAMS-for-ABSA
SemEval 2014: https://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools
SemEval 2015: https://alt.qcri.org/semeval2015/task12/index.php?id=data-and-tools
SemEval 2016: https://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools
Chinese: https://www.sciencedirect.com/science/article/abs/pii/S0950705118300972?via%3Dihub
Shampoo: brightgems@GitHub
MOOC: jmc-123@GitHub with GPL License
Twitter: https://dl.acm.org/doi/10.5555/2832415.2832437
Television & TShirt: https://github.com/rajdeep345/ABSA-Reproducibility
Yelp: WeiLi9811@GitHub
SemEval2016Task5: YaxinCui@GitHub
Arabic Hotel Reviews
Dutch Restaurant Reviews
English Restaurant Reviews
French Restaurant Reviews
Russian Restaurant Reviews
Spanish Restaurant Reviews
Turkish Restaurant Reviews
English-MOOC github: aparnavalli@GitHub