Course Project of the Natural Language Processing Course at IIIT-Delhi
Sarcasm is often used in our daily conversations, to display contempt or create a mocking impact. Sarcasm detection has garnered much attention in recent years with an increasing number of studies done to explore the features, ideas which aid in spotting sarcasm in language. Over recent years, this field of research has established itself as an important problem in NLP with many works proposing different solutions to address this task (Joshi et al. (2017b) [8]). Most of the work on sarcasm detection has been done on isolated utterances, that is unimodal analysis. Since sarcasm is not just expressed by the utterance, it also depends on the visual expression, audio modulations and the context of the utterance. We have performed multimodal sarcasm detection, and drawn comparisons between the score based on the inclusion and exclusion of the cues used. We perform analysis on the following cues - utterance, context, visual and audio features in both speaker dependent and independent scenarios.
Castro et al.(2019) [3] generated a new sarcasm dataset Multimodal Sarcasm Detection Dataset (MUStARD https://www.aclweb.org/anthology/P19-1455.pdf) which is used as the basis for training the sarcasm detection model. It consists of annotated text data consisting of an utterance, its context and the contextual speakers. It also has video clips of each utterance, through which audio and video features are derived. Github- https://github.com/soujanyaporia/MUStARDRelated Work MUStARD Dataset was generated by Castro et al.(2019) who not only provided the dataset but also attempted sarcasm detection using multi-modal approaches. They represented textual utterances through BERT (Devlin et al., 2018) embeddings, audio features by performing preprocessing on speech using a custom pipeline formed in Librosa((McFee et al., 2018), and Video features using frames from utterance videos and using a pool5 layer of an ImageNet (Deng et al., 2009) pretrained ResNet-152 (He et al., 2016) image classification model. They used all these features and generated their results both by leveraging speaker dependence and context as well as without them. They finally trained a SVM model on their respective feature vector. Chauhan et al.(2020) build upon this work and suggest the use of emotional information for solving the problem in a multi-task framework. They propose a segment-wise inter-modal attention based framework for this task.
Our task was focused on Multi-modal Detection of Sarcasm on Mustard dataset and we begin by focusing on textual data. We move forward with it in the following ways:
- Preprocessing of textual data - All commonly used contractions are resolved in both utterances and context, lowercase all the text and removed the punctuations.We also tried to use count of exclamation marks, count of question marks and count of commas(pauses) in our feature vector but it didn’t result in good F1-scores so we didn’t consider it in our final models.
- Speakers are encoded using one-hot encoding.
- Extraction of Audio features - Pre-extracted audio features are used from the available dataset. These features contain information related to pitch, intonation, and other tonal-specific details of the speech. The popular speech-processing library Librosa was used to extract these features. (https://www.aclweb.org/anthology/P19-1455.pdf)
- Extraction of visual features - Pre-extracted visual features are incorporated in our feature vector. These features are extracted for each of the f frames in the utterance video using a pool5 layer of an ImageNet pretrained ResNet-152 image classification model. (https://www.aclweb.org/anthology/P19-1455.pdf). The features are flattened and spliced or padded to form vectors of length 265000
- Vader Sentiment Features (pos, neg, neu, compound) are also computed for both utterance and contextual text.
- Tfidf embedding -Embedding for utterances and contextual sentences are generated using TF-IDF gram level. BERTbase, RoBERTa,Word2vec and Glove embeddings were also considered but didn’t give better results.
- Cross - validation using 3-Folds.
- We also implemented BERT based classification ,CNN and RNN for text classification but couldn’t get better results.
- Different classification models - SVC, Random Forest, XGB - are trained on the feature vectors. Our final model includes all the above mentioned models and a voting Classifier which comprise all three of the above models.
- Comparative analysis of each dataset is performed using Weighted F-1 scores, precision and recall.