Recent advances in Natural Language Processing- Some Woolly speculations

If you enjoy this article, please check out my free book by clicking Here: “Something to Read in Quarantine: Essays 2018-2020.”

Natural Language Processing (NLP) per Wikipedia:

“Is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.”

The field has seen tremendous advances during the recent explosion of progress in machine learning techniques.

Here are some of its more impressive recent achievements:

A) The Winograd Schema is a test of common sense reasoning- easy for humans, but historically almost impossible for computers- which requires the test taker to indicate which noun an ambiguous pronoun stands for. The correct answer hinges on a single word, which is different between two separate versions of the question. For example:

The city councilmen refused the demonstrators a permit because they feared violence.

The city councilmen refused the demonstrators a permit because they advocated violence.

Who does the pronoun “They” refer to in each of the instances?

The Winograd schema test was originally intended to be a more rigorous replacement for the Turing test, because it seems to require deep knowledge of how things fit together in the world, and the ability to reason about that knowledge in a linguistic context. Recent advances in NLP have allowed computers to achieve near human scores:(

B) The New York Regent’s science exam is a test requiring both scientific knowledge and reasoning skills, covering an extremely broad range of topics. Some of the questions include:

1.Which equipment will best separate a mixture of iron filings and black pepper? (1) magnet (2) filter paper (3) triplebeam balance (4) voltmeter

2. Which form of energy is produced when a rubber band vibrates? (1) chemical (2) light (3) electrical (4) sound

3. Because copper is a metal, it is (1) liquid at room temperature (2) nonreactive with other substances (3) a poor conductor of electricity (4) a good conductor of heat

4. Which process in an apple tree primarily results from cell division? (1) growth (2) photosynthesis (3) gas exchange (4) waste removal

On the 8th grade, non-diagram based questions of the test, a program was recently able to score 90%. ( )


It’s not just about answer selection either. Progress in text generation has been impressive. See, for example, some of the text samples created by Megatron:


Much of this progress has been rapid. Big progress on the Winograd schema, for example, still looked like it might be decades away back in (from memory) much of 2018. The computer science is advancing very fast, but it’s not clear our concepts have kept up.

I found this relatively sudden progress in NLP surprising. In my head- and maybe this was naive- I had thought that, in order to attempt these sorts of tasks with any facility, it wouldn’t be sufficient to simply feed a computer lots of text. Instead, any “proper” attempt to understand language would have to integrate different modalities of experience and understanding, like visual and auditory, in order to build up a full picture of how things relate to each other in the world. Only on the basis of this extra-linguistic grounding could it deal flexibly with problems involving rich meanings- we might call this the multi-modality thesis. Whether the multi-modality thesis is true for some kinds of problems or not, it’s certainly true for far fewer problems than I, and many others, had suspected.

I think science-fictiony speculations generally backed me up on this (false) hunch. Most people imagined that this kind of high-level language “understanding” would be the capstone of AI research, the thing that comes after the program already has a sophisticated extra-linguistic model of the world. This sort of just seemed obvious- a great example of how assumptions you didn’t even know you were making can ruin attempts to predict the future.

In hindsight it makes a certain sense that reams and reams of text alone can be used to build the capabilities needed to answer questions like these. A lot of people remind us that these programs are really just statistical analyses of the co-occurence of words, however complex and glorified. However we should not forget that the relationships between words are isomorphic to the relations between things- that isomorphism is why language works. This is to say the patterns in language use mirror the patterns of how things are(1). Models are transitive- if x models y, and y models z, then x models z. The upshot of these facts are that if you have a really good statistical model of how words relate to each other, that model is also implicitly a model of the world.

It might be instructive to think about what it would take to create a program which has a model of eighth grade science sufficient to understand and answer questions about hundreds of different things like “growth is driven by cell division”, and “What can magnets be used for” that wasn’t NLP led. It would be a nightmare of many different (probably handcrafted) models. Speaking somewhat loosely, language allows for intellectual capacities to be greatly compressed. From this point of view, it shouldn’t be surprising that some of the first signs of really broad capacity- common sense reasoning, wide ranging problem solving etc., have been found in language based programs- words and their relationships are just a vastly more efficient way of representing knowledge than the alternatives.

So I find myself wondering if language is not the crown of general intelligence, but a potential shortcut to it.


A couple of weeks ago I finished this essay, read through it, and decided it was not good enough to publish. The point about language being isomorphic to the world, and that therefore any sufficiently good model of language is a model of the world, is important, but it’s kind of abstract, and far from original.

Then today I read this report by Scott Alexander of having trained GPT-2 (a language program) to play chess. I realised this was the perfect example. GPT-2 has no (visual) understanding of things like the arrangement of a chess board. But if you feed it enough sequences of alphanumerically encoded games- 1.Kt-f3, d5 and so on- it begins to understand patterns in these strings of characters which are isomorphic to chess itself. Thus, for all intents and purposes, it develops a model of chess.

Exactly how strong this approach is- whether GPT-2 is capable of some limited analysis, or can only overfit openings- remains to be seen. We might have a better idea as it is optimised — for example, once it is fed board states instead of sequences of moves. Either way though, it illustrates the point about isomorphism.

Of course everyday language stands in a woolier relation to sheep, pine cones, desire and quarks than the formal language of chess moves stands in relation to chess moves, and the patterns are far more complex. Modality, uncertainty, vagueness and other complexities enter but the isomorphism between world and language is there, even if inexact.

Postscript- The Chinese Room Argument

After similar arguments are made, someone usually mentions the Chinese room thought experiment. There are, I think, two useful things to say about it:

A) The thought experiment is an argument about understanding in itself, separate from capacity to handle tasks, a difficult thing to quantify or understand. It’s unclear that there is a practical upshot for what AI can actually do.

B) A lot of the power of the thought experiment hinges on the fact that the room solves questions using a lookup table, this stacks the deck. Perhaps we be more willing to say that the room as a whole understood language if it formed an (implicit) model of how things are, and of the current context, and used those models to answer questions? Even if this doesn’t deal with all the intuition that the room cannot understand Chinese, I think it takes a bite from it.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

(1)- Strictly of course only the patterns in true sentences mirror, or are isomorphic to, the arrangement of the world, but most sentences people utter are at least approximately true.

14 thoughts on “Recent advances in Natural Language Processing- Some Woolly speculations

  1. >So I find myself wondering if language is not the crown of general intelligence, but a potential shortcut to it.

    I’ve come to this view myself, too. It’s possible that language and “general intelligence” co-evolved, because, as you note, language is a compressed representation of a world-model, and intelligence is needed to efficiently compress world-models into short sequences.

    One thing that current approaches are not, however, is hardware-efficient. Would you agree with that? Do you think the same level of text generation/question-answering/test-taking/chess-playing ability can be done with much less hardware required?

    It’s entirely possible that the answer is “No,” but it seems to me that “just throw more hardware and data at it,” isn’t going to make progress in that direction.


    1. That’s a good point and insight. I agree that new methods need to be developed as brute force will take too much processing


  2. I have my doubts about the benchmarks we use for these tests. Are the models really intelligent or are they simply recalling data they have seen previously? With the amount of data these models use they are bound to found something similar during their training. But there is no mechanism to control for this and to see how exactly do the models generalize. I think that these simple schemas such as Winograd schema are too rough for the models we have today and we need evaluation that is more precise and more on point in measuring how exactly do these models work and how does a dichotomy between logic/understanding and memory/remembering interact within these models.


  3. What does the Chinese room argument have to do with consciousness? And how does NLP overcome even in the slightest the core argument?


    1. You’re quite right, it’s not really about consciousness, it’s about understanding- I was shoddily using consciousness as a synonym for “true understanding”. My philosophical acumen has grown rusty since I dropped out of philosophy grad school. I’ve fixed the text accordingly.


  4. Interesting thoughts. I would suggest that language actually has a double edged connection to the world. One is isomorphism, just as you say, but the other is a causal connection which is essentially significance. So if I say “I saw Tiger Woods playing golf yesterday”, then that utterance is both isomorphic in that it depicts an event, but if true, it also signifies the event, in the sense that an event like the one depicted is actually part of the causal history of the utterance. In other words, language signifies as well as depicts. To get a deeper understanding of language we would have to teach the computer to model the underlying causality of sentences as well as what the sentences depict. But as you say, it is astonishing how far the latest generation of models can go just by observing patterns in texts.


  5. Thanks for sharing your thoughts. I still feel as though multi-modality will be required (or at least, worth doing), but GPT-3 has definitely shifted my intuitions on this. As you point out, big language models are doing things that we never would have guessed they would, and they could still be made 1000x larger, presumably. What other emergent things will show up then? Will be fascinating to watch.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s