Related Questions on Metaculus:
From Dan Hendrycks et al.,
While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. [...] Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for tracking advancements.
This question will resolve according to rather stringent conditions. It will use the strict accuracy on the competition coding problems which "requires programs pass every test case" (as defined in section 4.2 of the paper), and it will require that the model be given only one try per problem. For reference, the best model GPT-Neo 2.7B received a strict accuracy of 3.9% on introductory problems, 0.57% on interview problems, and 0.0% on competition problems.