Recently, Hendrycks et. al. proposed a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary and collage level mathematics, computer science, law, accounting and more. For each task, the model is provided only 5 training examples. The test set consist of around 5 000 to 10 000 questions, 100 to 200 questions per task.
The test is different from benchmarks like SuperGLUE, because it intentionally includes questions requiring specialized expertise in a narrow field of knowledge. Many tasks will be difficult for an average human. See example questions below.
They found that the very largest GPT-3 model achieves 43.9% accuracy vs. 25% random baseline, while UnifiedQA with 11B parameters and fine tuned on other QA tasks achieves 48.9%. Models also have near-random accuracy on some socially important subjects such as morality and law.
The question asks: