Recent natural language processing (NLP) models have succeeded in generating human-level text and translations. However questions remain regarding to what extent this success relies on understanding, as opposed to memorization of statistical patterns.
A recent paper showed that when statistical-cues are removed, state of the art NLP models fail on argument reasoning tasks -- despite human performance remaining unaffected. Untrained humans perform at ~80% accuracy on this argument reasoning task, whereas recent NLP models perform near 50%.
When will a machine learning model out-perform the human-level of 80% accuracy on this benchmark? This question resolves when either:
- A paper posted on arxiv.org claims a greater than 80% accuracy on the Niven and Kao benchmark.
- A paper posted on arxiv.org claims a greater than 80% accuracy on a successor* dataset to the Niven and Kao data.
*A successor dataset will count towards this resolution criterion if it satisfies all of the following:
-
Published in an arxiv.org pre-print intended to quantify argument and/or reasoning
-
Cites Niven and Kao
-
Pre-2020 NLP models show random-level performance on the dataset (<=60% accuracy for a binary task, <=100*(1/n+1/n/5)% for an n-ary task)
If the successor dataset includes information on human-level performance, that threshold will be used instead of the 80% accuracy threshold.