Melbourne-based developer Sam Paech stirred the digital pot last week, publishing what he claims is evidence that DeepSeek's R1-0528 AI model, an updated version of its reasoning AI, was trained on outputs from Google’s Gemini AI family. Paech's assertion, shared in an X post, highlights the model's preference for words and expressions akin to those favored by Gemini 2.5 Pro.
The revelation isn't a definitive smoking gun, but it's not without merit. Echoes of Gemini are also found in the traces—those "thoughts" the model generates as it processes information—of DeepSeek's R1 model, noted by the pseudonymous creator of AI's "free speech eval" tool, SpeechMap. These traces, they say, "read like Gemini traces."
DeepSeek's history of using rival AI models for training data isn't new. In December, developers observed that DeepSeek’s V3 model often identified itself as ChatGPT, hinting at a potential training on ChatGPT chat logs. This year, OpenAI, which has found evidence linking DeepSeek to the use of distillation—a technique to train AI models by extracting data from larger, more capable ones—has taken measures to prevent such practices.
Distillation, while not uncommon, is a practice prohibited by OpenAI’s terms of service, which disallow the use of the company’s model outputs to build competing AI. This contention is further complicated by the fact that many models misidentify themselves and converge on the same words and phrases due to the open web's increasing saturation with AI-generated content.
AI "contamination" has made filtering AI outputs from training datasets a daunting task. Despite this, experts like Nathan Lambert from the nonprofit AI research institute AI2, believe it's plausible that DeepSeek trained on data from Google’s Gemini. Lambert suggests that DeepSeek, being short on GPUs and flush with cash, would benefit from creating synthetic data from the best API model available, effectively gaining more compute power.
In response to such concerns, AI companies have bolstered security measures. OpenAI now requires organizations to complete an ID verification process to access certain advanced models, a process that excludes China. Google has also begun summarizing the traces generated by models available through its AI Studio developer platform, complicating the training of performant rival models on Gemini traces. Anthropic, in a similar move to protect its competitive advantages, announced it would start summarizing its model’s traces.
As the digital landscape evolves, the interplay between AI development and data ethics becomes increasingly complex. The quest for advanced AI capabilities, while pushing the boundaries of technology, also raises questions about the source and legitimacy of the data used in training these intelligent systems.