Social Science Research Council Research AMP Just Tech
Citation

Generating units of cultural analysis with large language models: methods and validation for scalable cross-cultural research

Author:
Syme, Kristen L.; Motos, Nikos; Placek, Caitlyn D.
Publication:
Royal Society Open Science
Year:
2026

We present a transparent, human-in-the-loop framework that uses large language models (LLMs) to transform ethnographic texts into binary, structured data suitable for statistical analysis. As a test case, we analyse ritual fasting across 56 societies from the Human Relations Area Files (HRAF), evaluating the ability of one LLM (GPT-4) to annotate constructs derived from evolutionary models of costly signalling, health trade-offs and ecological adaptation. Outputs were compared with a human consensus subset (n = 225) and two independent coders (n = 1015). GPT-4 matched or exceeded human performance on well-defined variables and highlighted systematic omissions in human annotation. Discrepancies were adjudicated to refine construct definitions and improve reliability. The resulting variables—capturing features such as leadership, sexual abstinence and resource sacrifice—function as minimal analytic units, akin to pixels in an image, enabling scalable cross-cultural comparison and statistical modelling. Although demonstrated with fasting data, the method is broadly applicable across cultural domains and theoretical frameworks, supporting reproducible, large-scale cultural analysis.