×
OpenAI releases first open-source models with Phi-like synthetic training
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

OpenAI has released its first open-source large language models, gpt-oss-120b and gpt-oss-20b, marking the company’s entry into the open-weight model space. While these models excel at certain benchmarks, they appear to follow the same synthetic data training approach as Microsoft’s Phi series, potentially prioritizing safety over real-world performance in what amounts to OpenAI’s version of “Phi-5.”

What you should know: These models demonstrate strong benchmark performance but show significant gaps in practical applications and out-of-domain knowledge.

  • The models perform well on technical benchmarks but struggle with tasks like SimpleQA and lack knowledge in areas like popular culture.
  • Early user reactions are mixed, with some praising their capabilities while others expressing disappointment on social media.
  • The author predicts these models will fall into the category of “performs much better on benchmarks than on real-world tasks.”

The Phi connection: Former Microsoft researcher Sebastien Bubeck, who developed the Phi model series, joined OpenAI at the end of 2024, suggesting a direct influence on these new models.

  • Phi models were trained exclusively on synthetic data—text generated by other language models or curated textbooks rather than scraped internet content.
  • This approach consistently produced impressive benchmark results but disappointing real-world performance across the Phi series.
  • Training on synthetic data allows developers to “teach to the test” by generating data that matches benchmark problems, inflating scores while reducing practical utility.

In plain English: Think of synthetic data training like studying for a test using only practice exams created by the test makers themselves. You’ll ace the official test but struggle with real-world problems that weren’t in those practice materials.

Why synthetic data matters for safety: OpenAI likely chose this approach to minimize risks associated with releasing open-source models.

  • Once released, open-source models can be fine-tuned by anyone to remove safety guardrails, creating permanent liability for the company.
  • Training on controlled synthetic data makes it easier to produce models that decline harmful requests and avoid learning problematic behaviors.
  • The author notes that “the main use-case for fine-tuning small language models is for erotic role-play,” highlighting safety concerns for companies releasing open models.

Strategic positioning: Unlike Meta, OpenAI doesn’t need their open-source models to be exceptionally useful since their primary business relies on closed-source offerings.

  • The company needed models that could beat Chinese open-source competitors on benchmarks while avoiding potential scandals.
  • This release allows OpenAI to claim participation in the open-source space without cannibalizing their core business model.
  • The synthetic data approach provides a safety buffer that traditional training methods cannot offer.
OpenAI's new open-source model is basically Phi-5

Recent News

Arc browser maker launches $20/month AI subscription tier

Free users keep core features but face limits on heavy chat usage.

OpenAI releases first open-source models with Phi-like synthetic training

Safety concerns trump performance when anyone can remove your model's guardrails.