SWOW-ZH – Cailab

This page contains the official released datasets for Mandarin Chinese (SWOW-ZH). Please check release version when replicating published results and use the date as the version and dataset acronym for consistency. Don’t hesitate to get in touchif you have any queries or suggestions.

Acknowledgements and Fair Use

Many hours of work have been put in this project and we gratefully acknowledge all the volunteers who have dedicated their time to contribute. If you find these data useful, please share the link to the study:

https://smallworldofwords.org

Your support to keep this project going and up-to-date is greatly appreciated!

Datafiles

Raw and processed data. For each of these languages we release the raw data, and a balanced datafile where each cue has the same number of associate responses.

Associative strength. Sometimes it’s convenient to know how many participants give a specific response to a cue. In this case, you should download the associative frequency files (i.e. the conditional probability of a response given a cue, and the number of responses might be different among cue words.). The first contains statistics based on the first response a participant gave (R1), and the second file contains all three responses (R123).

Cue and response statistics. Cue statistics provides information about which words were known, and how many responses for each cue were missing. Two files are available, one based on the first response a participant gave (R1), and a second file contains all three responses given by participant (R123). Response statistic includes response counts for tokens and types.

Mandarin Chinese Data (SWOW-ZH23)

Updated 28 March 2025

Warning: the following data are part of a manuscript currently submitted and under review. This release is therefor likely subject to change. Please check back if you decide to use these data in your work.

Three-response associations were collected from 2016 to 2023 for 10,000 cue words, each answered by about 76 participants on average. Participants were native speakers in Mandarin Chinese or other Chinese dialects. A dataset before the word and participant cleaning are provided, where 85 taboo words in responses were masked, and 19 taboo words in cue words were deleted. Preprocessing ended up including 55 participants for each cue word. A dataset after preprocessing is provided, where containing responses for 10,024 cue words contributed by 30,504 participants, for a total of 551,320 trials. Each row of datasets is a single trial, consisting of demographic information from one participant, and his or her three responses for one cue word. Scripts in MATLAB for preprocessing can be found on the SWOW-ZH github page https://github.com/lib314a/SWOWZH and the raw and preprocessed data, cue and response statistics, and centrality measures currently under review can be found below:

SWOW-ZH23 [49Mb]

We also provide relatedness measures for word pairs based on cosine similarity for associative strength distributions, pointwise mutual information weighted distribution (PPMI), random walk vectors (RW) and compressed random walk vectors. Note that files contain pairwise entries on each row and are xz compressed to reduce file size.

Citation: Li, B., Ding, Z., De Deyne, S., & Cai, Q. (2025). A large-scale database of Mandarin Chinese word associations from the Small World of Words project. Behavior Research Methods. doi: 10.3758/s13428-024-02513-1.

Data in other languages.

For data in Spanish, English and Dutch, please see the Small World of Words research page.

Contact simon.dedeyne@unimelb.edu.au to discuss work-in-progress files in other languages.

License

The data are licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. They cannot be redistributed or used for commercial purposes.