Social Science Research Council Research AMP Just Tech
Citation

Beyond keywords: modeling the semantic complexity of deceptive communication on instant messaging platforms

Author:
Xu, Shuo; Ding, Zhanyi; Wei, Zijing; Yang, Chao Peter; Li, Yixiang; Chen, Xuanjie; Wang, Hailiang
Publication:
Journal of Computational Social Science
Year:
2026

Instant messaging platforms such as Telegram enable rapid information exchange but also facilitate deceptive messaging at scale. In this study, we examine Telegram spam detection through a hierarchy of models that vary in linguistic modeling capacity, from interpretable lexical baselines (Logistic Regression, Random Forest, LightGBM) to sequential (GRU) and context-aware transformer representations (ALBERT). Using a harmonized preprocessing and evaluation pipeline on 20,348 labeled messages, we compare predictive performance across metrics (F1, ROC–AUC, PR–AUC, calibration) and assess pairwise differences via McNemar’s test with multiple-comparison correction. Across all metrics, ALBERT achieves the strongest performance and substantially improves spam-class detection relative to lexical models. This performance gap is consistent with the presence of a subset of deceptive messages whose signals are less concentrated in surface keywords and more distributed across context. However, improved performance may also reflect differences in model capacity and inductive bias, benefits from large-scale pretraining, and stronger handling of sparse patterns via contextual and subword representations. Accordingly, we interpret the proposed “complex tier” as an operational characterization of lexically subtle spam in this corpus, and we suggest that keyword-based moderation may be insufficient on its own to capture the full spectrum of deceptive messaging observed here.