Towards Emotion-aided Multi-modal Dialogue Act Classification:
Description:
A new dataset- multimodal Emotion aware Dialogue Act dataset called EMOTyDA, collected from open-sourced dialogue datasets.EMOTyDA dataset is curated by collecting conversations from two open sourced datasets IEMOCAP and MELD.
Both IEMOCAP and MELD have pre-annotated emotion labels.
The 12 DA annotated categories are "Greeting (g)", "Question (q)", "Answer (ans)", "Statement-Opinion (o)", "Statement-Non-Opinion (s)", "Apology (ap)", "Command (c)", "Agreement (ag)", "Disagreement (dag)", "Acknowledge (a)", "Backchannel (b)" and "Others (oth)".
Reference:
T. Saha, A. Patra, S. Saha and P. Bhattacharyya (2020), `` Towards Emotion-aided Multi-modal Dialogue Act Classification", In ACL 2020, July 5-10, 2020, Seattle, Washington (Category A*).
Sentiment and Emotion aware Multi-modal Speech Act Classification in Twitter (Tweet Act Classification) : EmoTA.
Description:
EmoTA dataset is curated by collecting tweets from an open-sourced tweet dataset named SemEval-2018.
SemEval-2018 dataset has pre-annotated multi-label emotion tags.
The 7 manually annotated TA tags are “Statement” (sta), “Expression” (exp), “Question” (que), “Request” (req), “Suggestion” (sug), “Threat” (tht) and “Others” (oth).
The sentiment label for tweets are obtained following a semi-supervised approach using the IBM Watson Sentiment Classifier(https://cloud.ibm.com/apidocs/natural-language-understanding#sentiment). EmoTA dataset contains the silver-standard sentiment tags.
Reference:
T. Saha, A. Upadhyaya, S. Saha, P. Bhattacharyya (2021), "Towards Sentiment and Emotion aided Multi-modal Speech Act Classification in Twitter", in NAACL-HLT 2021, June 6-11, 2021 ( Category A).
A Multitask Framework for Sentiment, Emotion and Sarcasm aware Cyberbullying Detection from Multi-modal Code-Mixed Memes.
Description:
We have created a benchmark multi-modal (Image+Text) meme dataset called MultiBully annotated with bully, sentiment, emotion and sarcasm labels collected from open-source Twitter and Reddit platforms. Moreover, the severity of the cyberbullying posts is also investigated by adding a harmfulness score to each meme. Out of 5854 memes in our database, 2632 were labeled as nonbully, while 3222 were tagged as bullies.
Reference:
Maity, K., Jha, P., Saha, S. and Bhattacharyya, P., 2022, July. A multitask framework for sentiment, emotion and sarcasm aware cyberbullying detection from multi-modal code-mixed memes. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1739-1749).
Ex-ThaiHate: A Generative Multi-task Framework for Sentiment and Emotion Aware Hate Speech Detection with Explanation in Thai.
Description:
We have developed Ex-ThaiHate, a new benchmark dataset for explainable hate speech detection in the Thai language. This dataset includes hate, sentiment, emotion and rationales labels. The dataset comprises 2685 hate and 4912 non-hate instances.
Reference:
Maity, K., Bhattacharya, S., Phosit, S., Kongsamlit, S., Saha, S. and Pasupa, K., 2023, September. Ex-ThaiHate: A Generative Multi-task Framework for Sentiment and Emotion Aware Hate Speech Detection with Explanation in Thai. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 139-156). Cham: Springer Nature Switzerland.
GenEx: A Commonsense-aware Unified Generative Framework for Explainable Cyberbullying Detection in Hindi-English Code-mixed language
Description:
We created an explainable cyberbullying dataset called BullyExplain, addressing four tasks simultaneously: Cyberbullying Detection (CD), Sentiment Analysis (SA), Target Identification (TI), and Detection of Rationales (RD). Each tweet in this dataset is annotated with four classes: Bully (Yes/No), Sentiment (Positive/Neutral/Negative), Target (Religion/Sexual-Orientation/Attacking-Relatives-and-Friends/Organization/Community/Profession/Miscellaneous), and Rationales (highlighted parts of the text justifying the classification decision). The rationales are not marked if the post is non-bullying, and the target class is selected as NA (Not Applicable). The BullyExplain dataset comprises a total of 6,084 samples, with 3,034 samples belonging to the non-bully class and the remaining 3,050 samples marked as bully. The number of tweets with positive and neutral sentiments is 1,536 and 1,327, respectively, while the remaining tweets express negative sentiments.
Reference:
Maity, K., Jain, R., Jha, P., Saha, S. and Bhattacharyya, P., 2023, December. GenEx: A Commonsense-aware Unified Generative Framework for Explainable Cyberbullying Detection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 16632-16645).
A deep learning framework for the detection of Malay hate speech
Description:
We created a dataset called HateM in Malay, where we looked at each tweet and marked it as either hate or non-hate. The dataset has 3,002 tweets marked as non-hate and 1,890 tweets marked as hate.
Reference:
Maity, K., Bhattacharya, S., Saha, S. and Seera, M., 2023. A deep learning framework for the detection of Malay hate speech. IEEE Access.
Emotion, Sentiment, and Sarcasm aided Complaint Detection: Description:
We extend the Twitter-based Complaints dataset with the
emotion, sentiment, and sarcasm classes. The extended Complaints dataset
consists of 2214 non-complaints and 1235 complaint tweets in English.
Reference:
A. Singh, A. Nazir, S. Saha (2021), ``Adversarial Multi-task Model for Emotion, Sentiment, and Sarcasm aided Complaint Detection", in 44th European Conference on Information Retrieval (10-14 April 2022), ECIR 2022 (core ranking A), Norway.
Sentiment and Emotion-Aware Multi-Modal Complaint Identification: Description:
We curate a new multimodal complaint dataset- Complaint, Emotion, and Sentiment Annotated Multi-modal Amazon Reviews Dataset (CESAMARD), a collection of opinionated texts (reviews) and images of the products posted on the website of the retail giant Amazon. The CESAMARD dataset comprises 3962 reviews with the corresponding complaint, emotion, and sentiment labels.
Reference:
A Singh, S. Dey, A. Singha, S. Saha (2021), ``Sentiment and Emotion-aware Multi-modal Complaint Identification", in AAAI 2022 (core rank A*).
Complaint and Severity Identification from Online Financial Content: Description:
We curate a Financial Complaints corpus (FINCORP), a collection of annotated complaints arising between financial institutions and consumers expressed in English on Twitter. The dataset has been enriched with the associated emotion, sentiment, and complaint severity classes. The dataset comprises 3149 complaints and 3133 non-compliant instances spanning over ten domains (e.g., credit cards, mortgages, etc.).
Reference:
A. Singh, R. Bhatia, and S. Saha (2022), "Complaint and Severity Identification from Online Financial Content", IEEE Transactions on Computational Social Systems.
Peeking inside the black box - A Commonsense-Aware Generative Framework for Explainable Complaint Detection: Description:
We extended the original Complaints dataset with causal span annotations for complaint and non-complaint labels. The extended dataset (X-CI) is the first benchmark dataset for explainable complaint detection. Each instance in the X-CI dataset is annotated with five labels: complaint label, emotion label, polarity label, complaint severity level, and rationale (explainability), i.e., the causal span explaining the reason for the complaint/non-complaint label.
Reference:
A. Singh, R. Jain, P. Jha, S. Saha (2023), ``Peeking inside the black box: A Commonsense-aware Generative Framework for Explainable Complaint Detection”, ACL 2023 (Core rank: A*).
Knowing What and How - A Multi-modal Aspect-Based Framework for Complaint Detection: Description:
he CESAMRD-Aspect dataset consists of aspect categories and associated complaint/non-complaint labels and spans five domains (books, electronics, edibles, fashion, and miscellaneous). The dataset comprises 3962 reviews, with 2641 reviews in the non-complaint category (66.66%) and 1321 reviews in the complaint category (33.34%). Each record in the dataset consists of the image URL, review title, review text, and corresponding complaint, polarity, and emotion labels. The instances in the original CESAMARD dataset were grouped according to various domains, such as electronics, edibles, fashion, books, and miscellaneous. We take it a step forward by including the pre-defined set of aspect categories for each of the 5 domains with the associated complaint/non-complaint labels. All domains share three common aspects: packaging, price, and quality, which are essential considerations when shopping online.
Reference:
A. Singh, V. Gangwar, S. Sharma, S. Saha (2023), ``Knowing What and How: A Multi-modal Aspect-Based Framework for Complaint Detection", ECIR 2023 (Core rank A) , Dublin.
AbCoRD - Exploiting Multimodal Generative Approach for Aspect-based Complaint and Rationale Detection: Description:
We added rationale annotation for aspect-based complaint classes to the benchmark multimodal complaint dataset (CESAMARD) covering five domains (books, electronics, edibles, fashion, and miscellaneous). The causal span that best explains the reason for the complaint label in each aspect-level complaint instance was selected. Note that if the review is categorized as non-complaint as a whole, then all aspect-level annotations will also be marked as non-complaint. However, in cases where there is a complaint at the review level, certain aspects may still be considered non-complaint. Each review instance is marked with at most six aspects in the dataset.
Reference:
R.Jain, A. Singh, V. Gangwar, S. Saha (2023), ``AbCoRD: Exploiting multimodal generative approach for Aspect-based Complaint and Rationale Detection”, ACM Multimedia (Core rank: A*) .
Large Scale Multi-Lingual Multi-Modal Summarization Dataset Description:
The current largest multi-lingual multi-modal summarization dataset (M3LS), and it consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. Spans 20 languages, targeting diversity across five language roots, it is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages.
Reference:
Verma, Y., Jangra, A., Verma, R. and Saha, S., 2023, May. Large Scale Multi-Lingual Multi-Modal Summarization Dataset. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (pp. 3602-3614) ECIR (CORE Ranking: A).
Multimodal Rumour Detection: Catching News that Never Transpired! Description:
Extension of the PHEME 2016 dataset. The PHEME-2016 dataset initially lacked images. Images were collected for tweet threads with user-uploaded images mentioned in the metadata. For threads without images, we performed web scraping to augment visuals. Only source tweets were considered for image downloads to ensure relevance and appropriateness.
Reference:
Kumar, R., Sinha, R., Saha, S., Jatowt, A. (2023). Multimodal Rumour Detection: Catching News that Never Transpired!. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023 (CORE Ranking: A). Lecture Notes in Computer Science, vol 14189. Springer, Cham. https://doi.org/10.1007/978-3-031-41682-8_15