Real-Time Sentiment-Based Anomaly Detection in Twitter Data Streams
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Twitter has over 316 million active users and the engagement of these Twitter users results in the rapid production of data, notably in the context of popular topics (such as news stories, politics, and sports). This data is available in the form of data streams, which has led many researchers to develop analysis techniques especially for Twitter data streams. Although anomaly detection in time series is a well established research area, its application to detect sentiment-based anomalies in large volumes of streaming data began recently. A sentiment-based anomaly is de ned as a sudden increase in the time series of tweets individually associated with a positive, neutral, or negative sentiment. The goal of this research is to develop and evaluate a technique to automatically detect sentiment-based anomalies, while avoiding the repeated detection of anomalies of similar types. Detecting anomalies in data streams is challenging due the requirement that anomalies be detected in real-time. We propose an approach for real-time sentiment-based anomaly detection (RSAD) in Twitter data streams. Sentiment classi cation is used to split the input data stream into three independent streams (positive, neutral, and negative), which are then analyzed separately for anomalous spikes in the number of tweets. Rare anomalies and the rst occurrence of repeated anomalies are distinguished from the repeated occurrence of similar anomalies. Six approaches for anomaly detection in data streams, including two baseline approaches, are described. These approaches were tested on two user-generated datasets. The rst dataset concerned an international sports event and was collected from Twitter and the second concerned a political party and was collected from multiple social media platforms. Results from these evaluations show that a probabilistic exponentially weighted moving average (PEWMA), coupled with a sliding window that uses a median absolute deviation (MAD) calculation, is effective at identifying sentiment-based anomalies. The PEWMA-MAD approach is consistently among the top two methods for all cases tested. The simple linear regression approach is slightly better in the case of the second dataset. Overall, the results suggest that the PEWMA-MAD approach may be robust su ciently to be applied to a wide variety of datasets from di erent social media platforms. ii