What is the Value of Human Word Associations in the Time of Generative Language Models?
November 27, 2025
Where Expert Thought Leads the Conversation
BY ASSISTANT PROFESSOR CYNTHIA SIEW
What is the first word that comes to your mind when seeing the word pineapple?
This is the word association game which has a long history in psychological research.
Modern psychologists do not currently endorse the idea, but word associations reveal a person’s subconscious[i] and continue to be a key aspect of psychology research because they reveal how word meanings are represented and organised in human memory[ii].
So, what was the first word that came to your mind in response to pineapple? Perhaps some of your answers were: fruit, yellow, banana, Spongebob, tart, Hawaii.
Responses like fruit, yellow, and banana tell us something about the meaning of pineapple itself. But associations like Spongebob, tart, and Hawaii tell us more about the real-world knowledge that we have learned about pineapple – we know that Spongebob Squarepants lives in a pineapple under the sea, and that Hawaii is a major producer of pineapples.
And, if you live in Southeast Asia, you would be familiar with a popular snack known as a pineapple tart.
What can we do with word associations?
When psychologists collect several of these associations to words, we can build mathematical models to study the internal mental structures of the human mind. These mental models tell us how words and concepts are interrelated to each other, and how humans encode their verbal and non-verbal experiences within an internal, conceptual language system[iii].
For example, studying word associations revealed to us how words like bubble, immunity, and virus rapidly changed their meanings before and after the COVID-19 pandemic[iv].
In my own research, I have discovered differences in the meanings of common English words across speakers of North American English and Singapore English. For example, the word shag elicited different patterns of associations among the two groups of participants. North American participants produced responses like carpet, sex, and hair, whereas Singapore English participants produced responses like tired, exhausted, and sian (a Singapore English word that means boring or tiresome). These results highlight nuances in the meanings of words, even among people who speak the same language[v].
Collecting enough word associations is difficult!
Although these examples highlight the utility of word association data, one of the biggest challenges researchers face is acquiring enough data to do these analyses in the first place. It can be expensive and time consuming to collect many word associations, especially if we want to ensure that the data is representative of a diverse population, and is collected for thousands of words and concepts found in any given language.
The most well-known database of word associations is the Small World of Words project[vi]. It took the researchers almost eight years of painstaking work to collect word associations from over 90,000 participants for 12,000 English words. Even though my research group is only collecting data for a smaller subset of words unique to Singapore English[vii] (https://singlishwords.nus.edu.sg), we faced similar challenges.
Can LLMs save the day?
With the emergence and easy accessibility of generative language models such as GPT, language scientists are exploring the possibility of using Large Language Models (LLMs) to supplement, or even replace, data collection with human participants.
The idea is simple: Prompts are designed to “nudge” the LLM to behave as if it is a participant in an experiment. Then additional instructions are provided to elicit responses from the LLM. These responses are evaluated for their suitability to serve as a proxy for the data collected from humans. So far, promising results have been obtained in many areas of social science, including psychology[viii], linguistics[ix], and economics[x]. LLMs have even used to generate word association data[xi]. In these studies, the responses from LLMs have been found to match the human data well.
Given that these impressive results were achieved at a fraction of the time and cost that human data collection would entail, LLMs have been touted as one way of potentially revolutionising social science research practices[xii].
Taking a closer look at LLM-generated data
In my view, it is worth breaking down the hype by taking a closer look at the actual responses produced by the LLM. When we do so, researchers report that LLM responses do not fully capture the full spectrum of human and cultural diversity[xiii]. Furthermore, this limitation cannot be readily resolved by adjusting parameters or prompts[xiv].
As a researcher of Singapore English, one of my research goals is to develop language databases of Singapore English. In a recent paper, my research group and I compared GPT-4’s responses to that of Singapore English speakers on a language rating task[xv].
In this task, we asked participants and GPT-4 to evaluate Singapore English concepts like sian and shiok (which means pleasurable or satisfying) on a number of lexical properties, such as valence (whether the word elicited positive or negative feelings) and humor (to extent to which the word elicited humorous thought). For instance, most of our participants considered shiok to be a positive word, and sian a negative word.
Although we found moderate correlations between LLM and human ratings, a number of other interesting findings emerged. First, these correlations were not as good compared to what was previously found in the literature for English and Spanish words. Second, LLM performance was inconsistent across lexical properties. While LLM performance was acceptable for valence, it was especially poor for word humor, suggesting that complex properties of language are harder for a LLM to infer. Finally, LLM ratings did not predict human language performance as well as human ratings did.
In a different project, our lab has also discovered that the latent semantic or meaning structure of LLM word associations appeared to be less internally coherent as compared to the semantic structure of human word associations[xvi]. What this implies is that the organisation of concepts in a LLM’s “memory” is more fragmented and less cohesive than how the same set of concepts are organised in human memory.
The irreplaceable value of human data in social science research
Taken together, our research experiences highlight two different but complementary insights. First, it is not trivial to replace human data collection with LLMs, particularly for research on under-resourced and under-represented languages and dialects. Second, the value of human-generated linguistic data cannot be overstated.
On reflection, it is perhaps not surprising that LLM-generated responses cannot entirely replicate the diversity in human data. Generative models are explicitly designed to give the user the most likely continuation to a prompt. Even though the LLM can produce reasonable responses, they are ultimately missing the long tail of quirky responses (Spongebob in response to pineapple) that reflect the richness of the human experience.
A second reason is that all LLMs are trained on observable languages mostly found on the Internet, which are produced by people who are not necessarily representative of all human cultures[xvii]. Hence, LLM patterns of word associations necessarily reflect the external language patterns of humans of a particular culture. On the other hand, human word associations reflect a complex combination of word meanings, usage patterns, real-world knowledge, and personal experiences. Human word associations reflect the internal linguistic worldview of our minds which are a by-product of each of our own unique lived experiences.
As a language scientist, I find that your very human responses to a simple word association game contain much more potential to provide much needed insights into the nature of human memory than generative responses from an LLM. I hope that you appreciate the exquisiteness and individuality of your own word associations as well.
[i] Jung, C. G. (1910). The Association Method. The American Journal of Psychology, 21(2), 219. https://doi.org/10.2307/1413002
[ii] Deese, J. (1965). The Structure of Associations in Language and Thought. John Hopkins Press.
[iii] Ufimtseva, N. V. (2014). Russian Psycholinguistics: Contribution to the Theory of Intercultural Communication. Intercultural Communication Studies, XXIII(1).
[iv] Laurino, J., De Deyne, S., Cabana, Á., & Kaczer, L. (2023). The Pandemic in Words: Tracking Fast Semantic Changes via a Large-Scale Word Association Task. Open Mind, 7, 221–239. https://doi.org/10.1162/opmi_a_00081
[v] https://osf.io/d56mf/files/kz34r
[vi] De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., & Storms, G. (2019). The “Small World of Words” English word association norms for over 12,000 cue words. Behavior Research Methods, 51, 987–1006. https://doi.org/10.3758/s13428-018-1115-7
[vii] Wong, J. J., & Siew, C. S. Q. (2024). Preliminary Data from the Small World of Singlish Words Project: Examining Responses to Common Singlish Words. Journal of Open Psychology Data, 12(1), 3. https://doi.org/10.5334/jopd.108
[viii] Trott, S., Jones, C., Chang, T., Michaelov, J., & Bergen, B. (2023). Do Large Language Models Know What Humans Know? Cognitive Science, 47(7), e13309. https://doi.org/10.1111/cogs.13309
[ix] Trott, S. (2024). Can large language models help augment English psycholinguistic datasets? Behavior Research Methods, 56(6), 6082–6100. https://doi.org/10.3758/s13428-024-02337-z
[x] Horton, J. (2023). Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?(No. w31122; p. w31122). National Bureau of Economic Research. https://doi.org/10.3386/w31122
[xi] Abramski, K., Improta, R., Rossetti, G., & Stella, M. (2025). The “LLM World of Words” English free association norms generated by large language models. Scientific Data, 12(1), 803. https://doi.org/10.1038/s41597-025-05156-9
[xii] Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI language models replace human participants? Trends in Cognitive Sciences, 27(7), 597–600. https://doi.org/10.1016/j.tics.2023.04.008
[xiii] Xiao, B., Duan, X., Haslett, D. A., & Cai, Z. (2025). Human-likeness of LLMs in the mental lexicon. The SIGNLL Conference on Computational Natural Language Learning. https://openreview.net/forum?id=beu7HZAYtG
[xiv] Murthy, S. K., Ullman, T., & Hu, J. (2025). One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 11241–11258. https://doi.org/10.18653/v1/2025.naacl-long.561
[xv] Siew, C. S. Q., Chang, F., & Wong, J. J. (2025). Investigating the Effects of Valence, Arousal, Concreteness, and Humor on Words Unique to Singapore English. Journal of Cognition, 8(1), 53. https://doi.org/10.5334/joc.470
[xvi] https://osf.io/d56mf/files/kz34r
[xvii] Atari, M., Xue, M. J., Park, P. S., Blasi, D. E., & Henrich, J. (2023). Which Humans? PsyArXiv. https://doi.org/10.31234/osf.io/5b26t
Assistant Professor Cynthia Siew (NUS Psychology) incorporates the use of experimental methods from cognitive psychology and psycholinguistics, computational modeling and mathematical methods from network science, and large-scale analysis of databases and linguistic corpora to address complex questions about the lexicon, how its structure influences processing and how it changes with time.
