Created by: jacob.pfau
co-authors , {{coauthor.username}}
AI Technical Benchmarks

Human infant learning integrates information across senses -- sight, sound, touch, etc. -- but current state of the art machine learning models usually use only one of these types. It remains to be seen whether integrating data across modes is necessary for achieving human-level intelligence.

In contemporary machine learning (ML) research, we are mostly interested in image, text, graph, and video data. State of the art models in each of these domains train only on inputs of that specific domain; let's call this uni-modal training. By extension, if a model were to train on two or more of these input types, while evaluating on only one, we'll call that multi-modal training with uni-modal evaluation. For the purposes of this question, we are only interested in uni-modal evaluation tasks, so robotics and driving benchmarks are out of the question.