M

Your Notebook is now a Draft.

Once it's ready, please submit it for review by our team of Community Moderators. Thank you!

Pending

This content now needs to be approved by community moderators.

Submitted

This essay was submitted and is waiting for review.

Action Ontologies, Computer Ontologies

{{"estimatedReadingTime" | translate:({minutes: qctrl.question.estimateReadingTime()})}}
Metaculus Journal

The following essay is by Jacob Falkovich who writes at Putanumonit.com

A reading of this essay is featured on the Metaculus Journal Podcast here.

The mystery of perception

Out in the universe, there are merely atoms¹ and the void. On the table in front of you, there’s a ripe tomato. Inside your skull is a brain, a collection of neurons that have no direct access to either atoms or tomatoes — only the electrochemical state of some other neurons. And yet your brain is able to perceive a tomato and various qualities of it: red, round, three-dimensional, real.

On the common how-it-seems view of perception, there is no particular mystery to this. In this view, light from the tomato hits your eyes and is decoded “bottom-up” in your brain into simple features such as color, shape, and size, which are then combined into complex perceptions such as “tomato.” This view is intuitively appealing: Whenever we perceive a tomato we find the actual tomato there; thus we believe the tomato to be the sole and sufficient cause of the perception.

A closer look begins to challenge this intuition. You may see a tomato up close or far away, at different angles, partially obscured, in dim light, etc. The perception of it as being red, round, and a few inches across doesn’t change even though the light hitting your retina is completely different in each case: different angles of your visual field, different wavelengths, etc. 

Take color for example. Naively, the perception of color is the detection of wavelengths of light, and yet you perceive the same color from green light (530 nm) as you do from a mix of blue (470 nm) and yellow (570 nm). A white piece of paper will appear white in your perception even though it actually reflects the wavelengths of the light around it: blue under a clear sky, green if held close to grass, orange by candle light. The strawberries in the image below appear red even though there isn’t a single red-hued pixel in it. Wherever the perception of color is coming from, it is certainly not the mere bottom-up decoding of light wavelengths.

Now consider the photograph of the tomato below. The light hitting your retina is exactly the same as would be generated by a physical tomato, and yet your perception is completely different. The tomato in the photograph is red and round, but it’s lacking three-dimensionality and “reality”. You don’t perceive that you can pick it up and eat it. As with color, the percepts of volume and reality are also coming from somewhere other than just the visible image itself.

To add to the mystery, no neurons in your brain are labeled “red” or “tomato” or even “part of the visual system.” The neurons just are, in various electrochemical configurations. There is nowhere for the labels to be encoded except in other neurons. And where would the labels for those be stored then?

Predictive processing

It turns out that perception is driven not by bottom-up feature detection but by top-down prediction of sensory inputs. The brain builds a generative model of the hidden causes of sensory inputs, its idea of the outside world. “Generative” means that it models objects not in all their possible aspects but in how they affect sensory inputs. The color of a surface is a model of visual stimulation caused by that surface under different light conditions. Volume is a model of how an object would appear if you walked around it. Physicality is how it would feel to touch or pick up.

All the features of perception are features of that generative model, not of the external objects themselves. We may infer through indirect means that things in the world are made of elementary particles or exist in states of quantum entanglement, but these features are not perceived directly because they don’t affect sensory inputs.

More generally, the brain (as a collection of neurons) is constantly trying² to anticipate its own state by inferring the hidden causes that affect it. The existence of “sensory inputs” is itself an inference that results from some neurons having their states affected to a large extent by things not in the brain itself—for example neurons connected more directly to a retinal photoreceptor cell or an arm muscle. To predict the state of these neurons the mind starts with modeling itself as having access to senses such as vision and proprioception. It then proceeds to model the outside world as, for example, containing objects that reflect light which are the cause of changes in the sensory inputs of vision.³ 

The fundamentals of predictive processing are not unique to human brains. We can think of brains as a special case of a cybernetic regulation system, one that evolved to regulate the human body (for example, by maintaining its internal temperature at 37°C). According to a fundamental principle of cybernetics, “every good regulator of a system must be [or contain] a [predictive] model of that system. This can be further generalized into the free energy principle, which applies to all systems (like the brain and body) that maintain an organized state in the face of entropy. 

These abstractions are not immediately relevant now, so let’s refocus on what’s in our heads. Our brains monitor and control the state of our bodies, bodies that move and act, and must do so to survive. Action is our best way of making predictions come true: shifting our eyes to change our visual inputs, reaching for food we anticipate eating, and ultimately fulfilling the core hardwired prediction of staying alive.

“The world can be validly construed as forum for action, or as place of things.”⁴While science tends to describe it as the latter, our subjective experience is mostly of the former. Our model of the world is concerned mainly with predicting the impact (directly on our sensory input and indirectly on the modeled world) of the actions available at our disposal. We perceive a world of action possibilities.

And so what seem to us as the basic objects that make up the world and their properties, our perceived ontology, is contingent very much on the bodies we find ourselves in. A different mind will perceive the world differently not only if it has different sensors at its disposal, but also different actuators. For a long time human minds were the only ones around, our ontology unchallenged. But this is changing.

Prediction in action

How do you describe the trajectory of a ball flying through the air? You may label its location and velocity in three space coordinates, and include the effects of gravity and air resistance to calculate the spot where it is bound to hit the ground.

How do you catch a ball flying through the air? It’s extremely difficult for people to accurately estimate the distance of the ball or its landing spot, let alone to quantify air resistance. But it’s not that hard to catch the ball in practice. Mathematically, to catch a ball while running towards it at constant speed one must set d2(tan α)/dt2, the second derivative of the tangent of the angle of gaze to the ball, to zero. Experienced catchers like baseball outfielders do, in fact, run towards the ball in a way that fulfills this equation.⁵

Of course, the experience of running to catch the ball is nothing like the experience of solving a differential equation. The subjective experience doesn’t contain the angle of gaze explicitly, or time, or spatial coordinates. It contains something like catchability, the direct prediction of whether your run will intercept the ball’s flight. When catchability is high you keep your pace and direction of running, when it is low you turn or accelerate. The sense of catchability is similar even for flying objects for which it doesn’t correspond to the inverse of d2(tan α)/dt2, like a frisbee disc. Our perceived ontology contains catchability itself, not trigonometric functions.

Now suppose that our goal was to program a ball-catching robot. What variables and functions would we encode into it? We can try to translate human ontology, but that will prove very difficult. Humans are aware of a ball’s catchability, but not of how their brain from this impression based on the proprioception of craning neck and running legs. Catchability is not a property of a flying ball, but of a human running to catch a flying ball. 

Alternatively, we can program the robot by solving ball-catching from scratch. We could encode a variable of “distance to ball” derived from the size of the ball’s image in pixels and “flight speed” derived from an electronic clock connected to the camera. A robot programmed in this manner will be solving the problem in a very “inhuman” way. It may perform much better or much worse than humans in various situations (e.g., for a differently sized ball) in ways that would be difficult to predict and interpret. 

This strikes me as a central challenge in building AI that achieves human-level performance in the physical world. The most pertinent example are self-driving cars, which are and have always been exactly 5 years away

A busy street scene contains a bewildering amount of information that requires interpretation according to some ontology of things and properties. We can come up with a “naive” ontology that includes objects like traffic lights, pedestrians, and cars, along with features like speed and distance. A lot of the current effort seems aimed at teaching AI this ontology, with millions of people each day clicking on squares that contain pedestrians in CAPTCHAs. 

But when humans drive we’re aware of things like whether the moods of the drivers around us are impatient or relaxed, the “passability” of the car in front of us, the “intention” of the object in the corner of our eye to jump onto the road—whether that be a person, deer, or plastic bag blowing in the wind. Intentions in particular are a key component of human ontology, even though they do not “exist” in the physical world. When an acquaintance offers their right hand, you perceive their intention to shake yours as clearly and confidently as the hand itself. We also predict automatically the consequences of our own actions, such as accepting or refusing the handshake.

Reading intentions is key to driving on busy city streets where drivers, bikers, and pedestrians focus more on reading each other’s minds than on following the letter of the traffic law. Tesla’s autopilot that performs well on highways struggles mightily in the city. It has no problem identifying cars and pedestrians, but doesn’t anticipate their behavior and their reaction to itself (such as honking at it a lot).

Our action-oriented ontology also performs really well in novel situations since it doesn’t need to fully interpret a scene to decide on the likeliest appropriate reaction. Imagine walking down the street at night when out of the corner of your eye you catch a dark fuzzy limb moving in a tree. Is it a cougar? A gorilla? A burglar? A Halloween decoration? An animal or object you’ve never seen before and couldn’t recognize? Before you determine the source of the movement or even become consciously aware of it, you will do the following things: focus your eyes on the tree, slow your walk, stop talking, increase heart rate and blood pressure, tense your muscles, and secrete adrenaline and cortisol in preparation for possible fight or flight.

Performance in novel and unexpected situations is the largest challenge facing autonomous driving AI. Regulators and consumers will not accept self-driving cars that work 99% of the time but 99.9999%. Reacting well in 1-in-a-million situations means reacting well to something outside the bounds of training data.

Again, there are two approaches to building self-driving AI. We can try to teach it human driving ontology, which also happens to be the one that all our roads, vehicles, signs, etc. are built to cater to. But this may prove impossible. A brain with a camera doesn’t see the world the way a brain with eyes does if it’s not also connected to a heart and limbs as ours is.

The other approach is to program a new ontology of driving into the car’s AI, realizing that it could end up quite different from how the road appears to humans. This may work, but it will be very difficult for us to predict how soon and how well it would. After all, there is no mind, human or AI, that is currently using that ontology to drive on highways and in cities, on familiar roads and in unexpected situations. 

Our “tomato” is not the AI’s tomato, and its driving will not be like ours. With that in mind, below is my own forecast on the question of when totally hands-off, go anywhere at a button push autonomous vehicles will be available:


¹ Or quarks or quantum fields etc.

² "Trying" and similar teleological language is to be understood as a description of what a brain does from the outside, as a system which constantly reconfigures itself to minimize prediction error as much as possible. This doesn't imply that one has a subjective experience of trying to do so.

³ This paradigm is usually called “predictive processing” or “predictive coding”. For an introduction, I recommend Scott Alexander’s review of the book “Surfing Uncertainty” by Andy Clark or Anil Seth’s TED talk and book “Being You”.

⁴ Peterson, J. B. (2002). Maps of meaning: The architecture of belief. Routledge.

⁵ McLeod, P., & Dienes, Z. (1996). Do fielders know where to go to catch the ball or only how to get there?. Journal of experimental psychology: human perception and performance22(3), 531.

Submit Essay

Once you submit your essay, you can no longer edit it.