Published in AI

DeepSeek-R1 borrowed OpenAI texts

by on05 March 2025


Nearly three quarters of them

A recent study by plagiarism experts Copyleaks has found that 74.2 per cent of texts generated by DeepSeek-R1 are strikingly similar to OpenAI's models.

This finding raises significant concerns regarding DeepSeek-R1's originality, particularly in data sourcing, intellectual property rights, and transparency.

The report pointed out that unacknowledged dependence on existing models can perpetuate biases, restrict diversity, and introduce legal and ethical dilemmas.

The report said DeepSeek's assertions of pioneering a cost-effective training methodology—if based on unauthorised use of OpenAI's work—may have misled the market, contributing to a substantial $593 billion single-day loss for NVIDIA and providing DeepSeek with an unjust competitive edge.

Copyleaks used three advanced AI classifiers trained on texts from major models, including Claude, Gemini, Llama, and OpenAI. These classifiers detected nuanced stylistic elements such as sentence structure, vocabulary, and phrasing.

 A "unanimous jury" system enhanced the methodology's robustness, requiring consensus among all three classifiers before confirming a classification. This strategy ensured a high precision rate of 99.88 per cent and a false-positive rate of just 0.04 per cent, effectively identifying texts from both known and novel AI models.

When applied to DeepSeek-R1, the analysis revealed that 74.2 per cent of its generated texts aligned with OpenAI's stylistic patterns, prompting critical questions about the model's originality and the broader implications for AI-generated content.

Microsoft's Phi-4 model demonstrated a 99.3 per cent divergence, indicating no resemblance to any known model and confirming its independent development.

Copyleaks Chief Data Scientist Shai Nisan said: "With this research, we have moved beyond general AI detection as we knew it and into model-specific attribution, a breakthrough that fundamentally changes how we approach AI content."

He further emphasised the importance of this capability in enhancing transparency, ensuring ethical AI training practices, and safeguarding the intellectual property rights of AI technologies to prevent potential misuse.

Nisan added: "As AI technologies evolve, stakeholders must accurately discern the origins of AI-generated content. Our approach enhances fair use protection and improves security and tracks the evolution of AI writing styles."

Last modified on 05 March 2025
Rate this item
(0 votes)

Read more about: