M

Your Notebook is now a Draft.

Once it's ready, please submit it for review by our team of Community Moderators. Thank you!

Pending

This content now needs to be approved by community moderators.

Submitted

This essay was submitted and is waiting for review.

Few Sample Statistics and Forecasting GPT-5

{{"estimatedReadingTime" | translate:({minutes: qctrl.question.estimateReadingTime()})}}
Metaculus Journal

datscilly is a Metaculus Pro Forecaster. He was ranked #1 in baseline accuracy on the Metaculus leaderboard for 2018, 2019, 2020, 2018-2019, and 2020-2021, and was ranked #1 in peer accuracy for 2020. He was also one of the top five forecasters in Scott Alexander's ACX 2023 Forecasting Contest.

When reviewing my performance in the previous and current Quarterly Cup tournaments, I found that I performed well on two quantitative questions with few data points, a situation that I call “few sample statistics.” I'm now applying the same method to predict when GPT-5 will be announced. Along the way I came up with a general recommendation for a recipe to apply in situations with few data points, and I'm sharing it here.

In the Q4 2023 Quarterly Cup, I placed well mainly on the performance of one question: Dutch effective number of parties. I predicted a wide distribution that covered the outcome whereas most people predicted too narrowly. 

I’ll set up the question as I framed it at the time. There will be an election in a European country, for parliament seats, and there is something called the effective number of parties. This measure is new to me, but looking through the Wikipedia page, it seems to match what I would expect the quantity to measure, in simple cases (rows A, C, E in the Wikipedia table). The most recent value for the forecast variable was 8.54, for the 2021 election. The two values previous to that going back in time were 8.12 and 5.70. The question background info also tells us that historically, the average number of parties in this country over the last 100 years was 5.3. Other than that, it does not seem that there’s other important information I can find, and I moved on to the next step.

We can distill the question down to its essence, the deceptively simple summary: …, 5.70, 8.12, 8.54, ? (domain: politics, sub-domain: number of parties). When penciling in my initial prediction during the blind period, the first thing I notice is that the scale is set such that a wide distribution is needed on the UI. Since the prediction variable rose from 5.7 to 8.12 (~+2.4) in one period, I feel that we should not rule out a decrease of a similar magnitude. I think outcomes of around 8.1 to 8.5 again are most likely. Then on the upper end, I think the number of parties can increase, but not much more. For example, anchoring on the latest data point ~8.5 and calculating +/- 2.4, I think 6.1 is more likely than 10.9. 

This is the basic intuitive method for numerical predictions, where you guess the median, first quartile, and third quartile, directly with one’s judgment. I will later contrast this method with the “base distribution method”, named in analogy to calculating base rates for binary questions, where a low, central, high estimate are produced from an explicit calculation. 

In both methods, the numerical data is combined with domain knowledge or question specific info to form a prediction. For example, and to finish up the number of parties prediction, the domain knowledge I used for the question is that the Dutch effective number of parties at 8.5 is high compared to other countries. I also thought about how variable I expect this measure of number of parties to be, and it should have a high variance, as I imagine the 5th to 10th most popular parties can gain or lose a large number of seats. In my final prediction I went with a main distribution at 7.8 - 8.8, since the two latest values are close together, but I also had a second minor distribution that covers outcomes 5.5 - 7.5, and makes the prediction wider.

For another question, Taylor Swift Super Bowl Airtime?, I also predicted intuitively. In the same way as in the previous question, we can push domain specific info into the background and summarize the question as 14, 12, 76, 24, 44, ? (domain: entertainment, sub-domain: celebrity screen time). I thought that the first two numbers look to be different and lower than the next three. We should have a main component 30 - 70, a minor component 6 - 24, and some chance of 0. My initial prediction had a median of around 36 and was a bit wider, but on seeing the community prediction and comments that supported a mode of 50 I thought that was better, and went with a median of 44 in my final prediction. For both the first and updated predictions I anchored on the latest value of 44. At first, I thought that it would be lower than 44, since sports fans were already complaining about her airtime, and past numbers were biased with Kansas City Chiefs wins in each one. But then, later I thought about incentives of the NFL, and the one-way relationship between Taylor Swift and the NFL. The phenomenon is free advertising for the NFL, and the broadcasters are likely to show Taylor Swift as often as is reasonable, so I predicted more narrowly around 30 to 66.

When predicting using one’s intuition, there is some amount of inconsistency, where sometimes I would flip-flop between what I think is more likely, or think differently between one day and the next, like in the Taylor Swift question. That is one reason I wanted to explore turning the intuitive method into explicit calculations. Another reason arose during the preparation for writing this post. I wanted to apply this method to a new question, to make an actual prediction in this article, and decided on the GPT-5 timing question. For that question, turning the intuitive method into a formal one is good, because it allows me to better justify my prediction in writing. Also, I’m confused about the domain specific considerations for the question, so I’m replacing judgements of that sort with a numerical method that has been tuned on other questions.

To take our intuition of what the modal estimate should be, and turn it into a formal method, we use a weighted average of our data points. The calculation I have in mind is an exponential recency-weighted average, like a exponential moving average except it is not moving. More generally, we can consider weighted averages with freely chosen weights for each data point, but the exponential average is simpler, with only one parameter that we need to set using our judgment. In choosing the calculation method, a recency-weighted average makes sense and lines up with what we were intuitively doing before, since we were looking at the most recent values as anchors, and focusing on the most recent value in particular.

Let’s use the dutch election numbers as an example of how to calculate this average. For the question we decided to predict based on the three most recent data points, 5.70, 8.12, 8.54 . We decided that the 5.7 value is different, or outdated, so it should have a low weight. A good choice for the exponential decay weight is then 0.5, which corresponds to looking at a horizon of 2 effective data points. (An exponential weight of 0.8, for example, would correspond to a horizon of 5. And in detail for math intuition, a horizon of n means the n most recent data points have a weight above 1/e ~= 0.37, and the rest of the data points have weights below 1/e. Also, the weights for all the data points add up to n in the infinite limit.) With an exponential decay of 0.5, the 5.7 data point has as weight of 0.25, 8.12 has a weight of 0.5, and 8.54 has a weight of 1. The weighted average is then: (5.7 * 0.25 + 8.12 * 0.5 + 8.54 * 1) / (0.25 + 0.5 + 1) = 8.01 .

In the past, I would have taken this central estimate, and computed some measure of dispersion like the sample variance to estimate the 25th and 75th percentiles. But I realized that it’s better to calculate the percentiles from the data points directly, by arranging them in increasing order. This simple method is known by the fancy name “estimating quantiles from the order statistics” where e.g. the 1st, 2nd, … order statistics are the smallest data point, 2nd smallest data point, and so on. For a simple example, the data points [ 5.70, 8.12, 8.54 ] arranged in increasing order is again [ 5.70, 8.12, 8.54 ], and we can then take 5.70 as the 25th percentile, 8.12 as the 50th percentile, and 8.54 as the 75th percentile. For another example, with 9 data points they would be mapped to the 10th, 20th, … 90th percentiles in the appropriate order. To calculate percentiles other than ones that land on those anchor points, we can linearly interpolate between the two nearest anchors. One advantage of this method is that it handles skewed distributions better one that calculates the sample mean and variance.

That example works when all the points have an equal weight of 1, and we would like to adapt it to when the the data points have unequal weight. Using the dutch election example again, if the points [ 5.70, 8.12, 8.54 ]  have weights of [ 0.25, 0.5, 1 ], they should no longer be evenly spaced; yet, it is not obvious what percentile the 5.7 point with 0.25 weight should map to, and this is a first issue.

It turns out that from Wikipedia: Quantile # Esimating quantiles from a sample that there are multiple methods of calculating the percentile (more generally: quantiles) from a sample, not just the most intuitive one that evenly spaces out the samples from 0th percentile to 100th percentile. We would like to understand the different formulas for calculating the quantiles, in order to choose the best one, and this is the second problem. 

Both issues can be accounted for with two insights, which is that the points should be allotted an interval proportional to their weight, and that extra buffer spaces beyond the leftmost point and the rightmost point are needed. For instance, the simple example with 9 data points should not be thought of as 10 intervals [0-10, 10-20, … 90-100], but as 9 intervals plus two half sized intervals at either end: [0-5, 5-15, 15-25, … 85-95, 95-100], with the data points being mapped to the midpoint of the interval they occupy. This scheme generalizes to a set of data points with unequal weights. Also, methods R-5 to R-9 on the Wikipedia page corresponds to this scheme with various sizes for the endpoint buffer spaces, which shows that the size of the buffer spaces can be freely chosen, but should be equal for symmetry.

The Wikipedia page cites a paper that recommends method R-8. This method gives quantile estimates that are approximately median-unbiased, and this holds independent of the kind of distribution that the sample was drawn from. From statistics class we are more familiar with the concept of an unbiased estimator, like when estimating the population variance from a sample, the formula for the variance of a sample has a final division by n - 1 instead of n. This is actually the concept of being mean-unbiased, and there are other ways of being unbiased such as median-unbiased. An advantage of the median-unbiased criteria is that it does not depend on the shape of the distribution that you’re sampling from, unlike mean-unbiased, which can depend on the distribution. For example, unlike in the case for the variance, there is no general mean-unbiased estimator for the standard deviation.

For a sample with 2 data points method R-8 maps them to the 2/7 and 5/7 quantiles (~28.6th and 71.4th percentiles). For 3 points it maps them to 20th, 50th, and 80th percentiles. In general, for exponential weights of x^(n-1), x^(n-2), …, x, 1 which are unnormalized (we haven’t divided them by (1+x+...+x^(n-1)), method R-8 corresponds to adding buffer spaces of 1/6 length to both ends. Then we normalize by (1+x+...+x^(n-1) + 1/3).

The formulas given in the Wikipedia page work for data points of weight 1, but I found it hard to understand the formulas by looking at them directly. Instead, it was easier to calculate using the midpoint and buffer space scheme in a spreadsheet, and to find the right buffer space size through the outputs of R-8.


Let us apply this few data points method to the question of GPT-5 timing. From the release of GPT-1 to GPT-4, there are three time intervals between releases, and the question can be framed as predicting the fourth time interval: 8.15, 15.42, 33.53, ? (Domain: technology, sub-domain: software releases) (measured in months).

First, we should make a judgment call whether the time series has an increasing trend, or whether it is stationary. If there is an increasing tread, then using a weighted average method no longer makes sense, and we will likely use the intuitive method instead while taking some linear extrapolation as an anchor. Unlike the Dutch elections example, at first glance this one seems to have a clear increasing trend. Also, unlike that example, the time between software releases has no upper bound; the number of parties seems to have a soft, sociology induced upper bound. But we can almost ignore the domains and see this difference from the data points alone. Getting back to the topic on hand, I decide that the GPT time series is stationary because the community prediction is confident that the time between GPT-4 and GPT-5 is around 19 months. It does make sense for it to be stationary, since software releases tend to happen regularly, and it's not clear to me whether the causal process for GPT timings specifically is stationary or increases over time. As an aside, the causal process that emits outcomes for large projects is a complex one, which is an argument for predicting based on the durations for similar past projects (the outside view) as opposed to information specific to the current project (the inside view). Backing up, I've chosen a stationary trend over an increasing trend, but without high confidence in this judgment, for the final prediction we should make sure to put some probability in outcomes of 33.5 and 45 for example.

We can now apply the recipe we developed for the Dutch elections example to the GPT-5 question, since with a stationary trend it makes sense to take a weighted average of previous values as our central estimate. Similar to the other example, I think an exponential decay weight of 0.5 or, equivalently, a horizon of 2 effective data points is reasonable, since I don't put much weight on the first data point of 8.15. While possible, I don't think it's likely for a new GPT to be trained that fast, and also, we've already passed 8.15 months since the release of GPT-4. With these weights of 0.25, 0.5, 1, our central estimate is 24.7, which we use as the mode of our prediction. Again, we use method R-8 to calculate percentiles directly from our three data point sample:

Percentile   X_(n)

0.20    8.15

0.32    15.42

0.68    33.53

Interpolating between the second and third data point, we get that the 44th percentile corresponds to a duration of 21.5 months. Hence, our prediction is 44% that GPT-5 will be released in 2025.

Should we use that calculation as is, or adjust it based on domain specific information? Here, I give some arguments that domain specific information pulling towards earlier outcomes should be downweighted. We should clearly distinguish between what is technically achievable versus the business incentives for GPT-5, even though the two considerations interact to influence the project timeline. When I think about the pace of technical AI research, one of the reasons GPT-4 took 33.5 months to be released is probably the change in architecture. Unlike the past GPTs that were a scale up of the transformer architecture, GPT-4 is most likely a Mixture of Experts model (of transformers). Being the first to identify and make that change probably cost OpenAI some time. In contrast, GPT-3.5 is likely not a Mixture of Experts model, but its performance was not good enough to give it the GPT-4 name, which nicely brings us to business incentives.

For the next iteration of GPT as well, OpenAI will probably wait for a suitably impressive model to grant the GPT-5 name. I'm fairly confident about this judgment, and in the cases of Gemini 1.5 and Claude 3, while they were released soon after Gemini 1 and Claude 1 (Claude 2 was unimpressive), they were not a step improvement over their predecessor like GPT-4 was over GPT-3. So, I think the most important domain specific information for this question concerns what is technically achievable at specific times, and that consideration is indifferent to the social impact and business considerations of ChatGPT.

Might GPT-5 be released sooner because it will be a simple scale up of GPT-4? My attempts to guess at the technical difficulties to create GPT-5 are inconclusive. On one hand, it may be possible, even likely, that scaling to a bigger model trained on more data is enough for next level performance. On the other hand, OpenAI’s CEO Sam Altman has hinted that simply scaling up much larger (e.g. 100x compute, 50x cost) may not be the way forward. We might also be running out of high quality internet text, though I believe Epoch’s dataset indicates that it’s more likely that GPT-5 will exhaust the reserve the high quality text than that GPT-4 has exhausted it.

In conclusion, I think that the effects of ChatGPT doesn’t really effect the timing of GPT-5, as well as there being no better way to guess at the technical difficulties to create GPT-5 other than to use the release timings of past GPTs.

Submit Essay

Once you submit your essay, you can no longer edit it.