Your submission is now a Draft.

Once it's ready, please submit your draft for review by our team of Community Moderators. Thank you!

You have been invited to co-author this question.

When it is ready, the author will submit it for review by Community Moderators. Thanks for helping!


This question now needs to be reviewed by Community Moderators.

We have high standards for question quality. We also favor questions on our core topic areas or that we otherwise judge valuable. We may not publish questions that are not a good fit.

If your question has not received attention within a week, or is otherwise pressing, you may request review by tagging @moderators in a comment.

You have been invited to co-author this question.

It now needs to be approved by Community Moderators. Thanks for helping!


{{qctrl.question.predictionCount() | abbrNumber}} predictions
{{"myPredictionLabel" | translate}}:  
{{ qctrl.question.resolutionString() }}
{{qctrl.question.predictionCount() | abbrNumber}} predictions
My score: {{qctrl.question.player_log_score | logScorePrecision}}
Created by: jacob.pfau and
co-authors , {{coauthor.username}}
AI Technical Benchmarks

Make a Prediction


Human infant learning integrates information across senses -- sight, sound, touch, etc. -- but current state of the art machine learning models usually use only one of these types. It remains to be seen whether integrating data across modes is necessary for achieving human-level intelligence.

In contemporary machine learning (ML) research, we are mostly interested in image, text, graph, and video data. State of the art models in each of these domains train only on inputs of that specific domain; let's call this uni-modal training. By extension, if a model were to train on two or more of these input types, while evaluating on only one, we'll call that multi-modal training with uni-modal evaluation. For the purposes of this question, we are only interested in uni-modal evaluation tasks, so robotics and driving benchmarks are out of the question.