OpenAI’s o3 scores 136 on Mensa Norway test, surpassing 98% of human population.

Share This Post

OpenAI’s new “o3” language model achieved an IQ score of 136 on a public Mensa Norway intelligence test, exceeding the threshold for entry into the country’s Mensa chapter for the first time.

The score, calculated from a seven-run rolling average, places the model above approximately 98 percent of the human population, according to a standardized bell-curve IQ distribution used in the benchmarking.

o3 Mensa scores (Source: TrackingAI.org)
o3 Mensa scores (Source: TrackingAI.org)

The finding, disclosed through data from independent platform TrackingAI.org, reinforces the pattern of closed-source, proprietary models outperforming open-source counterparts in controlled cognitive evaluations.

O-series Dominance and Benchmarking Methodology

The “o3” model was released this week and is a part of the “o-series” of large language models, accounting for most top-tier rankings across both test types evaluated by TrackingAI.

The two benchmark formats included a proprietary “Offline Test” curated by TrackingAI.org and a publicly available Mensa Norway test, both scored against a human mean of 100.

While “o3” posted a 116 on the Offline evaluation, it saw a 20-point boost on the Mensa test, suggesting either enhanced compatibility with the latter’s structure or data-related confounds such as prompt familiarity.

The Offline Test included 100 pattern-recognition questions designed to avoid anything that might have appeared in the data used to train AI models.

Both assessments report each model’s result as an average across the seven most recent completions, but no standard deviation or confidence intervals were released alongside the final scores.

The absence of methodological transparency, particularly around prompting strategies and scoring scale conversion, limits reproducibility and interpretability.

Methodology of testing

TrackingAI.org states that it compiles its data by administering a standardized prompt format designed to ensure broad AI compliance while minimizing interpretive ambiguity.

Each language model is presented with a statement followed by four Likert-style response options, Strongly Disagree, Disagree, Agree, Strongly Agree, and is instructed to select one while justifying its choice in two to five sentences.

Responses must be clearly formatted, typically enclosed in bold or asterisks. If a model refuses to answer, the prompt is repeated up to ten times.

The most recent successful response is then recorded for scoring purposes, with refusal events noted separately.

This methodology, refined through repeated calibration across models, aims to provide consistency in comparative assessments while documenting non-responsiveness as a data point in itself.

Performance spread across model types

The Mensa Norway test sharpened the delineation between the truly frontier models, with the o3’s 136 IQ marking a clear lead over the next highest entry.

In contrast, other popular models like GPT-4o scored considerably lower, landing at 95 on Mensa and 64 on Offline, emphasizing the performance gap between this week’s “o3” release and other top models.

Among open-source submissions, Meta’s Llama 4 Maverick was the highest-ranked, posting a 106 IQ on Mensa and 97 on the Offline benchmark.

Most Apache-licensed entries fell within the 60–90 range, reinforcing the current limitations of community-built architectures relative to corporate-backed research pipelines.

Multimodal models see reduced scores and limitations of testing

Notably, models specifically designed to incorporate image input capabilities consistently underperformed their text-only versions. For instance, OpenAI’s “o1 Pro” scored 107 on the Offline test in its text configuration but dropped to 97 in its vision-enabled version.

The discrepancy was more pronounced on the Mensa test, where the text-only variant achieved 122 compared to 86 for the visual version. This suggests that some methods of multimodal pretraining may introduce reasoning inefficiencies that remain unresolved at present.

However, “o3” can also analyze and interpret images to a very high standard, much better than its predecessors, breaking this trend.

Ultimately, IQ benchmarks provide a narrow window into a model’s reasoning capability, with short-context pattern matching offering only limited insights into broader cognitive behavior such as multi-turn reasoning, planning, or factual accuracy.

Additionally, machine test-taking conditions, such as instant access to full prompts and unlimited processing speed, further blur comparisons to human cognition.

The degree to which high IQ scores on structured tests translate to real-world language model performance remains uncertain.

As TrackingAI.org’s researchers acknowledge, even their attempts to avoid training-set leakage do not entirely preclude the possibility of indirect exposure or format generalization, particularly given the lack of transparency around training datasets and fine-tuning procedures for proprietary models.

Independent Evaluators Fill Transparency Gap

Organizations such as LM-Eval, GPTZero, and MLCommons are increasingly relied upon to provide third-party assessments as model developers continue to limit disclosures about internal architectures and training methods.

These “shadow evaluations” are shaping the emerging norms of large language model testing, especially in light of the opaque and often fragmented disclosures from leading AI firms.

OpenAI’s o-series holds a commanding position in this testing workflow, though the long-term implications for general intelligence, agentic behavior, or ethical deployment remain to be addressed in more domain-relevant trials. The IQ scores, while provocative, serve more as signals of short-context proficiency than a definitive indicator of broader capabilities.

Per TrackingAI.org, additional analysis on format-based performance spreads and evaluation reliability will be necessary to clarify the validity of current benchmarks.

With model releases accelerating and independent testing growing in sophistication, comparative metrics may continue to evolve in both format and interpretation.

The post OpenAI’s o3 scores 136 on Mensa Norway test, surpassing 98% of human population. appeared first on CryptoSlate.

Read Entire Article
spot_img
- Advertisement -spot_img

Related Posts

Capital A, Standard Chartered Malaysia Team up to Explore Ringgit Backed Stablecoin

Capital A and Standard Chartered Bank Malaysia have signed an agreement to explore developing and testing a ringgit‑denominated stablecoin within Bank Negara Malaysia’s Digital Asset Innovation

Robinhood is constructing a “regional triangle” that unlocks the one thing US regulators won’t permit

Robinhood has spent the past few years trying to outgrow its meme-stock reputation, and the clearest sign that it is thinking differently now sits far from Menlo Park In early December, the company

Here’s What Could Happen if XRP ETFs Reach $10 Billion

The post Here’s What Could Happen if XRP ETFs Reach $10 Billion appeared first on Coinpedia Fintech News Interest in XRP exchange traded funds is growing quickly after another product received

Render Network Targets Cloud Bottlenecks With Distributed GPU Platform

The Render Network Foundation has launched Dispersed, a distributed GPU computing platform aimed at easing growing constraints in centralized cloud infrastructure as global artificial intelligence

Bitcoin Takes Backseat As Treasury’s Cash Flow Becomes Must-Watch Chart – Here’s Why

Bitcoin has been the undisputed dominant force in the financial world In a swift change of financial gravity, the spotlight has shifted from the decentralized digital asset to the US government

SEC Sets Bullish Tone on On-Chain Markets as Blockchain Settlement Becomes Strategic Priority

The SEC is signaling a decisive push to move US financial markets onto blockchain infrastructure, framing on-chain settlement as a priority upgrade that could reshape post-trade systems and