Xenophobia or Racism around DeepSeek …. Or just incompetence?

DeepSeek freaked out my Grandfather and College Roommate – but none of the AI researchers I know. I’ve spent the last 15+ years watching new AI models come out all the time. Progress is constantly being made and rarely a week goes by that I’m not impressed with a new model or technique. For me, and my colleagues, that’s what DeepSeek was … an impressive new model, but nothing that was mind blowing. There’s a lot of cool insights and neat engineering advancements in the paper … not to sell it short …. But that is scientific progress. It’s been fascinating watching the world (or at least the US stock market) freak out over DeepSeek the past few weeks, because it isn’t freaking out AI researchers. Most of us spent some time reading the paper, figuring it out, playing around with it online, but not having our world turned upside down. We do this all the time with our job. In fact, a lot of us had read the DeepSeek paper before everything blew up and had moved on. ("Why haven't you written about DeepSeek yet?"). Instead….
The questions I get from researchers are … “Why?” And “Is this xenophobia or racism?”
It’s weird, normally I get technical questions from technical people while the rest of the world asks about racism and the world ending with AI. With DeepSeek, I’m getting the AI researchers asking about racism. I’m sure this is definitely part of it, but I think a lot of it is just incompetence of the financial analysts who cover AI. I don’t think they understand it – or the vast majority do not. There’s been some good reporting and blog posts over the past few weeks as things have normalized a bit, but there’s still a lot of people freaking out about how a Chinese company did this.
Technology improves. The way we are training our AI models are incredibly stupid … we throw insane amounts of data at these models … pretty much the entire internet …. to do exactly what your iPhone’s stupid autocomplete algorithm does. You get some pretty neat results, but it isn’t really clear that brute force is the smartest way to do these things. There’s a landgrab going on right now with AI models and this simple, but expensive, way of doing things is the easiest. I don’t think any serious AI researcher expects this to continue. We’ve just incentivized Chinese researchers to move down this path a bit sooner since they can only have NVIDIA H800s instead of H100s.
We will continue to see advancements like this going forward. Who knows where the next crazy advancement is going to come from? It could be a foreign company no one has ever heard of, or it might be a student at a University in the US. Anyone actually in the field is probably expecting this. But Wall Street is likely to freak out again … and then recover a few days later.
Cost. $5.6 million. This is the thing that is really scaring most people – but is highlighting a fundamental disconnect between someone who has built models and the broader public following NVIDIA’s stock price. This is a major underestimate of the total cost. From their paper, they state that “official training of DeepSeek-V3”. That is one run … which doesn’t count all the runs that go into preparing for that training run, nor the costs of DeepSeek-V2 which impacted it, nor the costs of salaries, etc. Plus things break constantly when training models. You often have to go back and redo parts of the process. This engineering is hard and underestimated. Also, I remember hearing that an LLM that Bloomberg built was roughly $10M for their final training run. Under $6 million is super impressive, but it is still a lot of money. Engineers cost a lot more. Recent PhD hires are getting $500,000 salaries while Master’s students are getting $400,000. Sure labor is cheaper in China, but costs add up quite quickly beyond a single training run. I’ve heard a senior professor in the field hypothesize that funding the entire DeepSeek company, not a single run, is actually a billion dollars … so, yes you will still see a lot of CapEx from big tech companies going forward.
Great, so you can do a lot with less right now … now let’s see what happens when we scale that $5.6 million up to a hundred million. I bet we see this happening with a major US tech company (or Chinese tech company. These improvements do not necessarily have to be for the cheaper. We are instead going to see wayyy more powerful models coming out that are quite expensive.
Cheating? I hear a bit as well that DeepSeek is cheating somehow. Like, stealing other companies’ intellectual property. Based on things like number of parameters in the model etc., it doesn’t seem to be copied from any other model to start (no upcycling, etc.). However, there probably is distillation. OpenAI claims that DeepSeek copied some of their model’s outputs. Sure, this breaks the Terms of Service … but most of what we see in our AI models are things vacuumed up from around the internet. This is just slightly less socially acceptable than most of the other things we see. OpenAI doesn’t tell you where their data comes from. More open models like Meta’s Llama models do not tell you data either. This feels more like it is on that spectrum rather than complete corporate espionage.
What actually scares me: User Logs. What makes Google search work? It’s not some fancy algorithm … it’s that Google knows what people actually click on when they type in a query. That is worth more than anything. As people use DeepSeek more and more, this will be the most valuable thing that they get … user interactions.
My main takeaways are:
Cool, I should play around with Mixture-of-Experts more.
We will still throw a lot of money at GPUs.
User logs will be the main goldmine that this generates for DeepSeek.
Auto-Regressive models iPhone autocorrect will continue to be key (see Janus below).
Some other thoughts about DeepSeek and Chinese AI models:
For a more technical audience, I think the following blog is super useful: https://epoch.ai/gradient-updates/how-has-deepseek-improved-the-transformer-architecture The start of the conclusion sums up what I think a lot of AI researchers think about DeepSeek:
“I see many of the improvements made by DeepSeek as “obvious in retrospect”: they are the kind of innovations that, had someone asked me in advance about them, I would have said were good ideas. However, as I’ve said earlier, this doesn’t mean it’s easy to come up with the ideas in the first place.”
Mixture-of-Experts! I expect a lot more work will come out talking about Mixture-of-Experts (MoE). This seems to be a key insight from DeepSeek. MoE has been around for basically ever (in AI terms). It was first introduced by Google in 2017 (MoE). It is actually older than the Transformer paper – the main AI model that everyone uses now. Meta (aka Facebook) used it in their No Language Left Behind (NLLB) paper a few years ago which is where I started seeing it gain more traction.
I started digging into more of DeepSeek’s papers and they basically say everything that they do … going back a while. In hindsight, if you’d been following their papers (but why would anyone) you would have seen that they have been going all-in on Mixture-of-Experts.
Alibaba Qwen-2.5-Max. Not to be upstaged by a rival Chinese company, Alibaba quickly pushed out a press release about it’s model and how it beats DeepSeek. https://qwenlm.github.io/blog/qwen2.5-max/ I spent a bit of time reading this and quickly realized that it is an unfair comparison. The world was shocked by DeepSeek R1, not as much by V3 (R1 is trained-off of V3). Alibaba’s press release is deceptive and they actually do not beet R1 (for now … I’m sure their model will improve). Here’s the figure they show … augmented with R1’s actual numbers.
Janus-Pro-7B came out shortly after the R1 model and lots of people took interest in this as well. This is the computer vision model. It too is impressive, but it didn’t shock me any more than any other tech company’s latest model. It does two tasks (including the harder generation task, aka making images) but I was most reminded of LLaVA-v1.5 which came out of the University of Wisconsin Madison. It’s interesting to see this, and Emu-3 which are showing that auto-regressive models are winning in this modality (images, not just text) as well. Overall, you can train those models with academic-level resources. In other words, you do not need to be a wealthy tech company.
And since we are on the theme of Ancient Rome, my prediction for when the next iteration of Janus comes out is the Ides of March. Based on a lot of other DeepSeek papers, I’m guessing that this will be using a lot of Mixture-of-Experts. I wholeheartedly expect the financial markets to freak out again.
Subscribe to my newsletter
Read articles from Academically Impertinent directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Academically Impertinent
Academically Impertinent
I am a professor who has been working on Artificial Intelligence for almost two decades. This is my unfiltered, NSFW takes on some of the developments in the field. Things should still be a proper reflection, and scientifically rigor-ish, just without all the egos and pomp that I see in most of Academia/Silicon Valley/Wall Street.