ChatGPT, n-grams and the power of subword units: The future of research in morphology

 

 

 

Subword units (cf. morphemes in linguistic morphology) are a powerful device for language modeling (cf. Byte Pair Encoding (BPE), a subword-based tokenization algorithm part of the architecture of Large Language Models (LLMs) such as ChatGPT). Based on recent advances in natural language processing, the notion of complexity (the logic of the Big O notation in computer science), existing phonology-driven (form-focused) analyses of (derivational) morphology (e.g. Stratal approach) and my own research on affix order in various languages, I maintain that research in morphology should take a form-focused perspective and that novel resources favoring such a change in perspective should be developed. I provide psycholinguistic evidence from a language with poor inflectional morphology (English) and a language with very rich inflection (Polish) that native speakers do not rely on semantic cues for affix ordering in derivation but rather memorize affix combinations as bigrams and trigrams. Speakers seem to treat frequently co-occurring linearly adjacent affixes, be they derivational or inflectional, together, as subword units longer than a morpheme, which is exactly what happens during the subword-based tokenization (BPE) in a LLM. Claims that ChatGPT does not reflect human-like language processing in morphology (and not only) are, most probably, due to the lack of linguistic research that adopts a ChatGPT perspective on language. *UPDATE*: The PDF of the presentation of my invited talk at DeriMo2023 is now added to the limited-in-length PDF of the paper. In the presentation, I explain the logic of the BPE algorithm and illustrate how the ChatGPT tokenizer can be used for word segmentation. Significantly, ChatGPT does not combine words but tokens.

 

Full text at: https://ling.auf.net/lingbuzz/007598

 

Stela Manova for Gauss:AI