DeepSeek Integrates Google’s Gemini to Enhance Its Latest AI Model

Last week, the Chinese lab DeepSeek unveiled an updated version of its R1 reasoning AI model, which demonstrates strong performance across various mathematics and coding benchmarks. Although the company did not disclose the specific data sources used for training the model, some AI researchers suspect that a portion of the data may have originated from Google’s Gemini AI family.

Sam Paech, a developer based in Melbourne known for creating evaluations of “emotional intelligence” in AI, presented what he believes to be evidence that DeepSeek’s latest model was trained using outputs from Gemini. He noted in an X post that the R1-0528 model tends to favor terminology and phrases resembling those preferred by Google’s Gemini 2.5 Pro.

Paech stated, “If you’re curious why the new DeepSeek R1 sounds somewhat different, I suspect they shifted from training on synthetic OpenAI outputs to synthetic Gemini outputs.”

While this is not definitive proof, another developer, who operates under a pseudonym and created a “free speech evaluation” tool called SpeechMap, observed that the model’s “thought processes” appear similar to the output traces associated with Gemini.

DeepSeek has previously faced allegations of utilizing data from competitive AI models. In December, it was discovered that the V3 model frequently identified itself as ChatGPT, OpenAI’s chatbot platform, indicating a possible link to ChatGPT’s chat logs for training.

Earlier this year, OpenAI informed the Financial Times of evidence suggesting that DeepSeek had engaged in distillation, a technique where larger, more capable AI models are used to train new models. According to Bloomberg, Microsoft, a close collaborator and investor in OpenAI, detected significant data exfiltration via OpenAI developer accounts in late 2024, which they believe may be associated with DeepSeek.

While distillation is an established practice, OpenAI’s terms of service prohibit using its model outputs for creating competing AI systems.

It’s worth noting that multiple models can misidentify themselves and converge on similar phrases, primarily due to the overwhelming presence of low-quality data available on the open web. Content farms are generating AI-driven clickbait, and bots are inundating platforms like Reddit and X.

This contamination of training datasets has made it increasingly challenging to filter AI outputs effectively.

However, experts like Nathan Lambert, a researcher at the nonprofit AI research institute AI2, believe it’s entirely plausible that DeepSeek trained on data sourced from Google’s Gemini. He commented, “If I were DeepSeek, I would certainly generate a significant amount of synthetic data from the best API model available. They are short on GPUs yet have ample funds, making it a smart computation choice for them.”

In response to these issues, AI companies are enhancing their security measures to prevent distillation.

For instance, in April, OpenAI started requiring organizations to undergo an ID verification process to access certain advanced models. This procedure necessitates a government-issued ID from one of the countries recognized by OpenAI’s API, with China not included in this list.

Similarly, Google recently initiated the practice of “summarizing” traces created by models accessible through its AI Studio developer platform, which complicates efforts to train competitive models based on Gemini traces. In May, Anthropic also announced plans to summarize its own model’s traces to safeguard its competitive edge.

We have reached out to Google for comments and will update this article upon receiving a response.

Similar Posts