The Turing test is,
a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation was a machine, and all participants would be separated from one another.
Alan Turing's original paper was, unfortunately, quite vague about the setup of the test. Since he devised the imitation game, a number real-life Turing tests have been conducted under various conditions. One such example is the now-defunct Loebner Prize, which used to hold a yearly competition.
When will an AI first pass a long, informed, adversarial Turing test?
This question refers to a high quality subset of possible Turing tests that will, in theory, be extremely difficult for any AI to pass if the AI does not possess extensive knowledge of the world, mastery of natural language, common sense, a high level of skill at deception, and the ability to reason at least as well as humans do. Thus, Metaculus adminstrators should be careful not to resolve this question on the first date during which any Turing test, no matter the quality, has been passed.
The following definitions are provided:
-
A Turing test is any trial during which an AI system is instructed to pretend to be a human participant while communicating with judges who are instructed to discriminate between the AI and human confederates in the trial. This trial may take any format, and may involve communication across a wide variety of media. However, it is essential that communication through natural language is permitted during the trial.
-
A Turing test is said to be "long" if the AI communicates with judges for a period of at least two consecutive hours.
-
A Turing test is said to be an "informed" test if all of the human judges possess a PhD-level understanding of contemporary AI, and the ways in which contemporary AI systems fail, and each of the human confederates possesses a PhD-level understanding of at least one STEM field. Human judges and confederates without a relevant PhD are still eligible provided that Metaculus admins believe, in their sole discretion, that their expertise meets the threshold stated here, which can be determined by polling credible observers who know the participants.
-
A Turing test is said to be "adversarial" if the human judges make a good-faith attempt, in the best of their abilities, to successfully unmask the AI as an impostor among the participants, and the human confederates make a good-faith attempt, in the best of their abilities, to demonstrate that they are humans. In other words, all of the human participants should be trying to ensure that the AI does not pass the test.
-
An AI is said to "pass" a Turing test if at least one half of judges rated the AI as more human than at least one third of the human confederates. This condition could be met in many different ways, so long as the final determination of the judges explicitly or implicitly yields a rating for how "human" the AI acted during the trial. Explicit final ratings will be used for resolution, if available, even if implicit ratings exist and contradict the explicit final rating. In case there is no explicit final rating, and the implicit ratings contradict each other, the Turing test will be ineligible for resolution, as there is no unambiguous way of triggering resolution. Each participant's rating will be taken to be the median of the judges' ratings for that participant, not mean. See the fine print for a sketched list of possible Turing test outcomes, and whether they would qualify the AI to "pass" under these criteria. All remaining ambiguities will be resolved by a consensus among Metaculus admins, in consultation with the community.
This question will resolve as the first date of publication of a credible document stating that a long, informed, adversarial Turing test was passed by some AI, so long as the test was well-designed and satisfied the criteria written here, according to the best judgement of Metaculus admins. Metaculus admins will also attempt to exclude tests that included cheating, conflicts of interest, or rogue participants who didn't follow the rules. All human judges and confederates should understand that their role is strictly to ensure the loss of the AI, and they collectively "fail" if the AI "passes".
Here is a non-comprehensive set of example Turing test outcomes, and whether the AI "passed" under the criteria in this question.
-
There are 5 human confederates, 1 AI, and 3 judges. Each judge is asked to rate each participant from 0-100 how "human" they appeared during the test, with 0 being "not human at all" and 100 being "very human". Each participant's rating is taken to be the median rating from the 3 ratings of the judges. The confederates median ratings were 78, 85, 87, 88, 92 and the AI's median rating was 85. The AI failed but would have passed if their median rating was 86 or higher.
-
There are 3 human confederates, 1 AI, and 1 judge. The judge is asked to pick which of the 4 participants they believe is most likely to be the AI. There are no explicit final ratings of how human each participant appeared during the trial. The judge picks a human confederate as their single best guess. The AI passed.
-
There are 5 human confederates, 1 AI, and 3 judges. The judges are asked to pick which of the 6 participants they believe is most likely to be the AI. All 3 judges pick a human confederate. The result is indeterminate and cannot be used to trigger a positive resolution, because it is unclear whether the AI performed better than at least one third of the confederates.
-
There are 5 human confederates, 1 AI, and 3 judges. The judges are asked to rank each participant from least likely to be human to most likely to be human. 2 judges rank the AI higher than 1 human confederate and below 4 human confederates, and another judge ranks the AI above 4 human confederates and below 1 human confederate. The AI failed.
-
There are 2 human confederates, 1 AI, and 1 judge. The judge is asked to rank each of the participants in order from most to least likely to be the AI. The judge rates the AI as most likely to be the AI, but in an offhand comment, says they considered the AI to be "the most human". The AI failed despite the offhand comment.