Recent natural language processing (NLP) models have succeeded in generating human-level text and translations. However questions remain regarding to what extent this success relies on understanding, as opposed to memorization of statistical patterns.
A recent paper showed that when statistical-cues are removed, state of the art NLP models fail on argument reasoning tasks -- despite human performance remaining unaffected. Untrained humans perform at ~80% accuracy on this argument reasoning task, whereas recent NLP models perform near 50%.