The Largest Social Media Ground-Truth Dataset for Real/Fake Content: TruthSeeker

Automatic detection of fake content in social media such as Twitter is an enduring challenge. Technically, determining fake news on social media platforms is a straightforward binary classification problem. However, manually fact-checking even a small fraction of daily tweets would be nearly impossible due to the sheer volume. To address this challenge, we crawled and crowd-sourced one of the most extensive ground-truth tweet datasets. Utilizing Politifact and expert labeling as a base, it contains more than 180 000 labels from 2009 to 2022, creating five- and three-label classification using Amazon Mechanical Turk. We utilized multiple levels of validation to ensure an accurate ground-truth benchmark dataset. Then, we created and implemented numerous machine learning and deep learning algorithms, including different variations of bidirectional encoder representations from transformers (BERT)-based models and classical machine learning algorithms on the data to test the accuracy of real/fake tweet detection with both categories. Then, determining which versions gave us the highest result metrics. Further analysis is performed on the dataset by explicitly utilizing the DBSCAN text clustering algorithm combined with the YAKE keyword creation algorithm to determine topics’ clustering and relationships. Finally, we analyzed each user in the dataset, determining their bot score, credibility score, and influence score for a better understanding of what type of Twitter user posts, their influence with each of their tweets, and if there were any underlying patterns to be drawn from each score concerning the truthfulness of the tweet. The experiment’s results illustrated profound improvement for models dealing with short-length text in solving a real-life classification problem, such as automatically detecting fake content in social media.