Real-Time Sentiment-Based Anomaly Detection in Twitter Data Streams

Patel, Khantil Ragnesh

Real-Time Sentiment-Based Anomaly Detection in Twitter Data Streams

Files

Patel_Khantil_Ragnesh_200334657_MSC_CS_Spring2016.pdf (3.04 MB)

Date

2016-03

Authors

Patel, Khantil Ragnesh

Publisher

Faculty of Graduate Studies and Research, University of Regina

Abstract

Twitter has over 316 million active users and the engagement of these Twitter users results in the rapid production of data, notably in the context of popular topics (such as news stories, politics, and sports). This data is available in the form of data streams, which has led many researchers to develop analysis techniques especially for Twitter data streams. Although anomaly detection in time series is a well established research area, its application to detect sentiment-based anomalies in large volumes of streaming data began recently. A sentiment-based anomaly is de ned as a sudden increase in the time series of tweets individually associated with a positive, neutral, or negative sentiment. The goal of this research is to develop and evaluate a technique to automatically detect sentiment-based anomalies, while avoiding the repeated detection of anomalies of similar types. Detecting anomalies in data streams is challenging due the requirement that anomalies be detected in real-time. We propose an approach for real-time sentiment-based anomaly detection (RSAD) in Twitter data streams. Sentiment classi cation is used to split the input data stream into three independent streams (positive, neutral, and negative), which are then analyzed separately for anomalous spikes in the number of tweets. Rare anomalies and the rst occurrence of repeated anomalies are distinguished from the repeated occurrence of similar anomalies. Six approaches for anomaly detection in data streams, including two baseline approaches, are described. These approaches were tested on two user-generated datasets. The rst dataset concerned an international sports event and was collected from Twitter and the second concerned a political party and was collected from multiple social media platforms. Results from these evaluations show that a probabilistic exponentially weighted moving average (PEWMA), coupled with a sliding window that uses a median absolute deviation (MAD) calculation, is effective at identifying sentiment-based anomalies. The PEWMA-MAD approach is consistently among the top two methods for all cases tested. The simple linear regression approach is slightly better in the case of the second dataset. Overall, the results suggest that the PEWMA-MAD approach may be robust su ciently to be applied to a wide variety of datasets from di erent social media platforms. ii

Description

A Thesis Submitted to the Faculty of Graduate Studies and Research In Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science, University of Regina. xi, 125 p.

URI

https://hdl.handle.net/10294/6863

Collections

Master’s and Doctoral Theses

Full item page

Real-Time Sentiment-Based Anomaly Detection in Twitter Data Streams

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections