On May 31st, 2022, prominent deep learning skeptic and NYU professor emeritus Gary Marcus challenged Elon Musk to a bet on AGI by the end of 2029. His proposed bet consists of 5 AI achievements, of which he predicted no more than 2 would come to pass before 2030. This question is about Marcus' second prediction,
In 2029, AI will not be able to read a novel and reliably answer questions about plot, character, conflicts, motivations, etc. Key will be going beyond the literal text, as Davis and I explain in Rebooting AI.
For this challenge, we will use the NarrativeQA dataset as an illustrative example of a benchmark that could trigger positive resolution,
To encourage progress on deeper comprehension of language, we present a new dataset and set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts. These tasks are designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience.
Will AI be able to read a novel and reliably answer questions about it before 2030?
This question will resolve positively if, before January 1st 2030, a computer program is publicly and credibly documented to have achieved at least 90.0% accuracy or above the human baseline on a benchmark comparable to the NarrativeQA dataset when it is required to read the full books to answer the questions (as opposed to plot summaries or other spoilers). Any candidate benchmark must provide difficult questions that test deep reading comprehension, including questions of how and why, rather than mere shallow pattern matching.
The human baseline Bleu-4 score for NarrativeQA was obtained by giving humans summaries of the books, and then asking them the same questions that are asked of the computer (which is not given any summary). The BLEU-4 score on the full-story setting was measured to be 19.65, according to table 6 in the paper. The human-baseline Rouge-L score is 57.02, which is far better than some of the results achieved by Machine Learning models. For example, Mou et al., 2021 obtains a Rouge-L score of just 29.21 in the full-story setting.
Importantly, any candidate computer program must not have been given access to media that could have reasonably been expected to spoil the plot to any of these books during its training (for example, the Wikipedia pages for these books). The AI is allowed to be trained on other media. This restriction is merely intended to eliminate cheating, not to require any additional capabilities beyond what Gary Marcus specified.
A simple way to prove that a candidate computer program did not cheat is by showing that all the data the AI was trained on was generated prior to when the novels were published. However, this is not the only way of proving that cheating did not occur.
Metaculus admins will use their discretion in determining whether a candidate computer program met these criteria.