This site is under construction!
Current NLP research uses neither linguistically annotated corpora nor the traditional pipeline of linguistic modules, which raises questions about the future of linguistics. Linguists who have tried to crack the secrets of deep learning NLP models, including BERT (a bidirectional transformer-based ML technique employed for Google Search), have had as their ultimate goal to show that deep nets make linguistic generalizations. I decided for an alternative approach. To check whether it is possible to process natural language without grammar, I developed a very simple model, the End-to-end N-Gram Model (EteNGraM), that elaborates on the standard n-gram model. EteNGraM, at a very basic level, imitates current NLP research by handling semantic relations without semantics. Like in NLP, I pre-trained the model with the orders of the TAM markers in the verbal domain, fine-tuned it, and then applied it for derivation of Greenberg’s Universal 20 and its exceptions in the nominal domain. Although EteNGraM is ridiculously simple and operates only with bigrams and trigrams, it successfully derives and differentiates between the attested and unattested patterns in Cinque (2005) “Deriving Greenberg's Universal 20 and Its Exceptions”, Linguistic Inquiry 36, and Cinque (2014) “Again on Tense, Aspect, Mood Morpheme Order and the “Mirror Principle”.” In Functional Structure from Top to Toe: The Cartography of Syntactic Structures 9. EteNGraM also makes fine-grained predictions about preferred and dispreferred patterns across languages and reveals novel aspects of the organization of the verbal and nominal domain. To explain EteNGraM's highly efficient performance, I address issues such as: complexity of data versus complexity of analysis; structure building by linear sequences of elements and by hierarchical syntactic trees; and how linguists can contribute to NLP research.
Full text here.
Stela Manova for Gauss AI