Massive performance jump in two very interesting natural language benchmarks

NOTE: This story is now subject to an update which likely invalidates its central premise: https://deponysum.com/2020/11/22/update-performance-jump-in-two-very-interesting-natural-language-benchmarks-its-gone-away/

“BERT Ensemble – MBGA Optimization” is a new entrant to two leader-boards natural language understanding tasks on the the Allen AI Leader-boards. Without trying to sound clickbaity, I audibly gasped when I read its results.

It scored 97.6% on the ARC reasoning challenge (Easy set). The previous high score was 92.6%, i.e. it reduced by two thirds the error rate. The ARC-Easy set, despite the name, contains both the hard and easy questions from the ARC challenge task (the hard version on the other hand excludes the easy questions). In the words of Allen AI:

“The ARC dataset contains 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering.”

In other words, it’s a standardized school science test. Here are some examples of the questions:

1.Which equipment will best separate a mixture of iron filings and black pepper? (1) magnet (2) filter paper (3) triplebeam balance (4) voltmeter

2. Which form of energy is produced when a rubber band vibrates? (1) chemical (2) light (3) electrical (4) sound

3. Because copper is a metal, it is (1) liquid at room temperature (2) nonreactive with other substances (3) a poor conductor of electricity (4) a good conductor of heat

4. Which process in an apple tree primarily results from cell division? (1) growth (2) photosynthesis (3) gas exchange (4) waste removal

While a jump of 5% might not seem enormous, what’s so important about this is that it has reduced by two thirds the number of mistakes of the previous model, suggesting a huge leap in power.

The second result is far, far more impressive still. The OpenBookQA task is:

“[A] new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject. It consists of 5,957 multiple-choice elementary-level science questions (4,957 train, 500 dev, 500 test), which probe the understanding of a small “book” of 1,326 core science facts and the application of these facts to novel situations. For training, the dataset includes a mapping from each question to the core science fact it was designed to probe. Answering OpenBookQA questions requires additional broad common knowledge, not contained in the book. The questions, by design, are answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. Strong neural baselines achieve around 50% on OpenBookQA, leaving a large gap to the 92% accuracy of crowd-workers

[Bolding not in original]

It scored 100% on the OpenBookQA task. The previous best was 87.2% and the benchmark for human level performance by crowd-sourced workers is 91.7%. This test is intended to be an ecologically valid test of the human ability to reason over existing facts and apply them in novel situations- combined with background knowledge and understanding of the world. This program did not merely meet human level it seemingly exceeded it by a lot.

As for the model itself, we know very little about it beyond that it is:

“[An] Ensemble of BERT models with multi-metric Bayesian and genetic-algorithm based optimization.” It is by one “V. Agarwal” who I suspect- but do not know- is Vidhan Agarwal of Carnegie Mellon University.

Assuming these scores aren’t a mistake (and we can’t rule that out) this represents progress that I, as someone who had been -admittedly amateurishly- watching these task scoreboards for years- hadn’t expected for several years. In the case of OpenBookQA, maybe even five to ten years.

You can check out the results here: https://leaderboard.allenai.org/

I will email Agarwal about his achievement and ask him about his methodology and if there are any plans to apply his model to other tasks. Depending on what he says I might simply update this post, or if there’s anything really interesting make a new one.

If you enjoyed this article please consider joining our mailing list: https://forms.gle/TaQA3BN5w3rgpyqeA also, a collection of my best writing between 2018 and early 2020 is available as a free e-book “Something to read in quarantine: Essays 2018-2020”. You can grab it here.

Massive performance jump in two very interesting natural language benchmarks

Published by deponysum

One thought on “Massive performance jump in two very interesting natural language benchmarks”

Leave a comment Cancel reply

Share this:

Related

Published by deponysum

One thought on “Massive performance jump in two very interesting natural language benchmarks”

Leave a comment Cancel reply