Diligence of DeepSeek R1 Dominates Today, Latent Space Navigation Tomorrow 

Image genererated by DALL-E 3/ Jussi Rasku

All eyes in the AI world are currently on DeepSeek. Politics aside, the results of their R1 model are competitive, and they have shown the ability to leverage some ongoing and interesting trends in the technical development of large language models (LLMs). 

One of the key insights behind DeepSeek R1 is that inference, meaning making the computer produce text, is relatively cheap compared to training an LLM. To challenge OpenAI, DeepSeek went for a smart approach with their R1: instead of developing novel, efficient architectures in a quest for improving the quality of ‘thinking’ LLMs do, they have built on top of existing proven methods for increasing model output quality, such as Mixture of Experts (MoE) and using lower precision than their competition to encode the model weights. The one critical trick was borrowed from OpenAI: they fine-tuned the model to generate more tokens or words. The most interesting contribution for the AI community was in how the model was trained, but the outcome of their significant efforts in itself is not that surprising. After all, it makes intuitive sense that doing more work and spending more time on a problem leads to better answers, as shown by the OpenAI reasoning models. 

However, using more resources to provide the same results leads to a challenge of measuring the output: to enable an apples-to-apples comparison, we perhaps should start measuring the time and compute resources required to achieve impressive results on the LLM benchmarks. From machine learning, we already know that ensembles work, adding stochasticity improves outcomes, and overparameterization with more data—even synthetic data—helps improve results. That is, more effort improves results. However, all of this comes at the cost of increased runtime and energy use. If we do not consider both sides of the effort-results tradeoff, we might draw the wrong conclusions. 

I would also caution against reading too much into rumors about training costs, as it is important to base interpretations on solid facts, not speculation. For example, consider how many optimizations, trial runs, technical expertise, exceptional engineering, and low-level programming were required to effectively use the H800 Nvidia GPUs for LLM training. What was the cost of this engineering effort? How large was the investment in the data center housing 2,048 NVIDIA H800 GPUs, and how was it funded? Comparing the investment required to build an apartment complex to the rent of a single apartment makes no sense. 

Finally, if we were to set our goal to compete with, or even surpass, DeepSeek here in the EU, how would we proceed? Their approach highlights the importance of building on top of the state-of-the-art techniques that have been tested in practice. However, simply repeating these motions is not enough. From a researcher’s point of view, investing in advanced research on architectures and experiments that aim to add new capabilities to chain-of-thought LLMs is critical. For example, fine-tuning LLMs to employ the scientific method, enabling tool use to test hypotheses and results, and instructing them to verify outputs through diverse approaches and sources before finalizing answers would likely allow reaching good standing on the leaderboards. Yes, this would increase inference time, as well as hardware and energy costs, but for applications where these constraints aren’t prohibitive, the resulting improvements in quality could be significant.  

In addition to access to large-scale computation, we will also need smart, highly educated, and driven individuals to ask the right questions and have the freedom to pursue them. The approach of doing more “thinking” has now been shown to work. Now, the question is how can we do it more effectively? One promising possibility is to make AI models work in concept space instead of relying on human language. Instead of emitting tokens or words, they could emit “thoughts” as vector representations in high-dimensional concept space. Considering that and the recent DeepSeek R1 results, my personal bet would be that Large Concept Models relying on Continuous Latent Space or Chain-of-Embedding strategies will be the source of the next thing to shake the AI world.

Author

Jussi Rasku

Jussi Rasku

Vice head of GPT-Lab Seinäjoki & Postdoctoral Research Fellow

Scroll to Top