Tagging Singapore English
October 17, 2025
Since its inception in 1988, the International Corpus of English (ICE) has been a cornerstone for research on World Englishes, comprising 14 countries’ corpora from both the Inner Circle countries like Britain and the US to the Outer Circle countries like Singapore and the Philippines. Grammatically annotating the ICE corpora is a tall order due to limited resources and the need for human oversight. Part-of-speech (PoS) taggers, which are tools used to add linguistically relevant features like phonological and lexical annotation. For example, ‘table’ is labelled by the tag, ‘noun’. Modern PoS taggers are trained on data from Inner Circle English and can be used as cost-effective tools to tag Outer Circle English, though with lower accuracy. Regardless, their relatively high performance still makes it easier to check and correct the automatic tagging, easing what would otherwise be an extremely labour-intensive process. These corrected texts can then be used as a benchmark to improve PoS taggers, in turn making them more effective for processing Outer Circle English materials.
In ‘Tagging Singapore English’ (World Englishes, 2022), Bao et al. (NUS English, Linguistics and Theatre Studies) explored using the Stanford PoS tagger, trained on standard American English, to tag the Singaporean component of the ICE (ICE-SIN). Tagging ICE-SIN is part of a larger effort to build a tagged and parsed treebank on Singapore English. The researchers found that the Stanford PoS tagger achieved comparable accuracy rates in the more formal registers of ICE-SIN, where it attained 96% accuracy. This is similar to accuracy rates reported for British and American English. As expected, the accuracy was lower in the informal register of private conversations in ICE-SIN. The researchers partly attributed this reduced accuracy to contact-induced changes that are characteristic of Singapore English, including lexical and grammatical borrowings.
Lexical borrowings that were limited to Singapore and Malaysia, such as ‘kiasu’ and ‘kopitiam’, posed less of a challenge as they could be treated as regular words. However, grammatical borrowings, such as the sentence-final particles and novel uses of words like ‘got’, posed a greater problem by introducing an extra layer of structure or grammatical meaning not found in English morphosyntax. Properly tagging these forms required contextual morphosyntactic information to resolve the categorical uncertainty of words.
Tagging ICE-SIN not only provides important insights into the contact-induced changes in Singapore English, but also demonstrated the feasibility and benefits of linguistically annotating other Outer Circle varieties within the ICE project. A tagged ICE-SIN allows for data-driven investigation of language contact in unprecedented detail. More broadly, systematically annotating the various ICE corpora would further establish the corpus as an invaluable resource for quantitative research on language variation.
Read the full article here.
