Browsing by Author "Parasteh, Sirvan"
Now showing 1 - 1 of 1
- Results Per Page
- Sort Options
Item Open Access New probabilistic approaches for detecting and evaluating concept drift in data streams(Faculty of Graduate Studies and Research, University of Regina, 2025-01) Parasteh, Sirvan; Sadaoui, Samira; Butz, Cory; Uddin, Sami; Yow, Kin-Choong; Shafiq, OmairIn modern applications like online shopping, financial forecasting, and real-time fraud detection, data distributions frequently shift, causing predictive models trained on historical data to underperform. This phenomenon, known as Concept Drift (CD), presents a major challenge in adaptive learning environments, necessitating ongoing monitoring and adjustment to accommodate evolving data streams. Active drift detection methods, which track changes in data distribution or model performance, offer a targeted solution by prompting adaptations only when significant shifts are detected. However, existing active methods face challenges: distribution-based approaches may miss subtle drifts or respond to non-critical changes (virtual drift), while performance-based methods, which detect shifts impacting model accuracy (real drift), can overreact to transient noise, leading to unnecessary adaptations. These challenges underscore the need to balance sensitivity and stability in CD detection. To address these issues, we propose a hybrid approach that combines insights derived from data distribution through probabilistic measures, such as marginal probability distribution of input data or classifier confidence, with error-based detection, offering a more robust and precise solution for managing CD. The first key contribution is the development of SPNCD, a probabilistic method leveraging Sum-Product Networks (SPNs) to detect real and virtual drifts by analyzing shifts in the joint probability distribution of features and class labels. Inspired by the Bayesian CD definition, SPNCD integrates prediction error, which assesses model performance, and the marginal distribution, which captures changes in data distribution. Building on this approach, we then develop the PRDD algorithm, which uses the classifier’s confidence as an indirect estimate of data distribution similarity, alongside error rates, to detect real drift with precision and timely response in dynamic data streams. Based on these foundations, we develop NPRDD, an enhanced method specifically designed for noisy data environments, which combines cross-entropy-based surprise measures with predicted class probabilities to distinguish genuine drifts from noise. We further enhance PRDD with two detection strategies: 1) PRDDW that uses a sliding fixed-sized window approach to determine the proportion of real-drift candidates, and 2) PRDDS that adopts a reward-aging mechanism to compute a drift score based on recent drift events. To ensure the usability of these two methods, we present a parameter optimization procedure using Bayesian optimization to find robust default parameter values that generalize well in various scenarios. To validate each of the proposed methods, we conduct an exhaustive experimental study involving different synthetic data streams, simulating abrupt and gradual drifts. These studies also compared our methods to several benchmark drift detectors. Moreover, we devise a theoretical framework to understand the impact of critical components of our methods. Also, specifically for PRDDWand PRDDS, we design an empirical framework to generate 4,000 unique synthetic data streams, define the drift regions, and present several metrics to assess the performance of the base learners and detectors; the latter needs to be improved in the literature. The empirical results show that our methods outperform existing CD detection methods in most cases in classification and detection-based metrics and rank among the top performers, underscoring their robustness and practical applicability. Moreover, our experimental framework provides benchmarks for reproducible evaluations, setting a new standard for future research in CD detection.