Model Focus: DeepSeek - Uneven Profile with Serious Flaws

Reading time: approx. 5 min

We continue our review by looking at models in the lower performance range. DeepSeek-R1:8b is an example of a model with a very uneven profile. Although it surprisingly performs well in one specific area (coding), the serious flaws in language, facts, and reasoning make it an unreliable and risky choice for most work tasks in education.

What You Will Learn

  • The specific problems with DeepSeek, especially regarding hallucinations.
  • Why uneven performance makes a model unreliable.

The Basics: DeepSeek in Brief

  • Model: DeepSeek-R1:8b
  • Developer: DeepSeek AI (a Chinese AI company focusing on open source)
  • Weaknesses: Major problems with facts, reasoning, and linguistic quality. Tends to hallucinate.
  • Ollama command: ollama run deepseek-r1:8b

Results from the Benchmark

DeepSeek's results show extremely uneven capability.

  • Bright Spot (Code & Technology): Received top rating (5/5) for writing correct Python code with a clear explanation.
  • Factual Knowledge: Gave a completely incorrect answer about the introduction of compulsory elementary school and mixed up several different school reforms (2/5).
  • Reasoning: Hallucinated wildly and invented incomprehensible concepts like "moon murder" and "third heaven's butterfly" (2/5).
  • Linguistic Quality: The language was unpredictable with made-up words and poor flow (2/5).
  • Pedagogy: The attempt to explain fractions contained the incorrect and confusing word "bunthärd" (3/5).

Practical Application: A Warning Signal

The experience with DeepSeek is an important lesson: do not focus blindly on a single strength. A model's real value lies in its reliability across a broad spectrum of tasks. DeepSeek's inability to handle basic language and facts makes it unsuitable as a general assistant. The risk of it introducing errors into your material is far too great.

Conclusion

Avoid DeepSeek for general use. If you have a very specific need for a local model to generate code, it may be worth testing, but for all other text-based work, it is a risk factor.

Next Steps

From a model that is uneven, we move to another popular model that unfortunately proved to have even more fundamental flaws in our testing. The next lesson is about Mistral:7b.