This post is about synthetic data and why it wont scale like we think. This field is the absolute frontier of science, it is about the process of learning itself, and consequently it is mired in epistemics and traps. The problem is that the search space of solutions is infinite, and importantly, there is also an infinite number of deceptively good ideas. We need to develop strong tastes and high level beliefs to tell us when our eyes (empirics) are wrong. I think the old statistical theorems are actually really good and explain deep learning very well contrary to what many people will tell you. # The "magic" of deep learning ![[inductive_bias_invert.png]] *Inverted image from ["Deep Learning is Not So Mysterious or Different"](https://arxiv.org/abs/2503.02113v2), which you should read* The paper "[Deep Learning is Not So Mysterious or Different](https://arxiv.org/abs/2503.02113v2)" explains why the classical statistical theorems still apply to deep learning, and how things like double descent or grokking aren't "new phenomena". I believe very strongly in the predictive power of the theory of inductive biases, a topic covered in the paper. The data processing inequality is also a good clsasical theory, and can be understood through the lens of inductive biases to make sense of synthetic data generation. An inductive bias is essentially a prior you place on the space to nudge your model to a solution. So for example if you're predicting demand and you know you demand is cyclical then a well placed sin function in your model will nudge the model to a much better solution than a ton of ReLUs. It's not guaranteed but the general idea is that when you build a model, you are implicitly encoding your priors about the structure of the data. If your priors are bad, your model will be bad. ## Machine Epistemology ![[sin.excalidraw.png]] However, unlike in the narrow domain of classical ML, LLMs are for pursing *general* intelligence. Our "in domain" is literally all knowledge ever made. This means that any inductive biases we place is essentially equivalent to an epistemic claim about intelligence more broadly and the nature of complexity and understanding in our universe. That's a really big responsibility! But it's a responsibility that has been met by all the *time tested* advancements in LLM technology. - Training a high depth transformer is equivalent to saying that all the complexity in our universe is *emergent* and thus a product of composing smaller, lower-order relationships and laws into larger ones (i.e. circuits of the upper layers are incentivized to compose circuits from lower layers) - Training a model with RoPE is equivalent to saying that the relative positioning of a word or sentence is relevant to it's semantic meaning (in most languages) - Training a model with adam or second order optimizers is equivalent to saying the optimization landscape (given the other inductive biases/arch) is approximately convex - Training a model with ademamix or [[Multiscale Muon]] is equivalent to saying that the emergent nature of complexity in our universe necessitates accumulating knowledge over different time scales to capture both lower order patterns required for composition and higher order patterns required for generalizing the lower order ones - Training a model with LSTM or true recurrent layers is equivalent to saying sequential composition is an important aspect of complexity in our universe. (which is true but parallel evolution is far more important as a fraction of knowledge fight me mike) ![[complexity.excalidraw.png]] The beauty of architecture and optimizer is that you are very heavily constrained in the specificity of the epistemic claims you can make. You can't say something wrong like "the optimal model is the one with the minimum description length" which is obviously nonsense. (only sorta kidding :P) I'll write a longer blogpost about this later but this is an extremely important idea. # Data processing inequality The "[Data Processing Inequality](https://en.wikipedia.org/wiki/Data_processing_inequality)" is a classical theory from information theory, it states that data post processing can't *increase* information content. This is fairly easy to derive independently, data generated in the real world tells us things about the real world, data generated synthetically tells us, at most, the same things about the real world plus some things about the synthetic generation process plus some extra things induced/implied by generator. That's not necessarily a bad thing though, sometimes the inductive biases implied the synthetic generation pipeline can be good. But those inductive biases need to be as general and true as the ones that govern architecture, since otherwise you will overfit your model on particular solutions or problem subsets. When we train a model with a particular architecture, we're also still just telling the model about things from the real world world plus whatever is implied by the architecture. It just so happens that those things tend to be really good things to tell! ## How not to make synthetic data It's *really* easy to put bad priors in a synth pipeline, since you can specify all of them in natural language. For example, synthetic math generation can go bad very quickly, since synthetically generated/abstract math is [not heuristic searchable](https://en.wikipedia.org/wiki/No_free_lunch_theorem). You will bias your model off of realistic solution paths, for example if you generate infinite math problems of every variation, it might approach the following problem: $ x^3+2x^2+x = 0 $ by using the formula instead of simply factorizing it. This is because it would have encountered an enormous quantity of quadratic root finding problems, and as a fraction of general root finding problems with real numbered coeffecients, factorizable problems are extremely uncommon. One might say "well that's the more generalizing solution!". It's also computationally inefficient on the set of problems it will actually encounter in the real world as a chatbot. At the end of the day that's what intelligence is fundamentally about: leverage. Also, without a calculator, there's no way the model will actually generalize the quadratic formula for larger numbers, and it'll start hallucinating. This might not occur currently if the model settles in on the correct thought pattern during RL quickly enough and carryover biases from pretraining, but similar failure modes like this are very common in models trained on synthetic data, they're just not as easy to communicate. ## How to make good synthetic data Synth data is not a terrible way to fill in the gaps of understanding, for example generating multiplication tables or addition tables to ensure the model actually generalizing addition circuits and so on. There are also other categories, for example stringing together lots of individual (natural) documents based on their precedence and relevance to teach the model long horizon synthesis. You're providing information and biases to the model that it wouldn't have access to otherwise (i.e. it wouldn't know how to do long context synthesis unless you trained it to do so) ### Rephrasing Rephrasing is a decent option as well, it's equivalent to stating that the distribution of styles on the internet is not particularly relevant and "flattening" it or transferring it is fine if not desirable. You have to be careful to retain all the shitty formatting an such because dealing with unformatted text is, in fact, in-distribution! ### The in-distribution Occasionally, you may not have the data for a task. i.e. there is something you know is in-distribution that you *don't* have data for. In that case you can build a synth data pipeline built out of all the low order aspects of your missing data distribution. Essentially, you are distilling the biases you as a human have learned for the specific subset of tasks into the model. This is spiritually very similar to the cyclical data example from earlier, the domain is small, and thus the pressure to make extremely general biases is also small. Another example I mentioned generating multiplication and addition tables. In that situation, as a human you know that addition and multiplication has specific properties, ones you want to guarantee the model figures out. Therefor, you can generate a bunch of synthetic examples. The synthetic generator encodes the bias (that addition and multiplication have deterministic, digitally encodable rules). ### Careful You have to be extremely careful about this though, because that line of reasoning can be used to justify virtually anything if you're not operating in good faith. If the researchers at your organization are careerists or like to goodhart, you will never have a good sdg pipeline. I'd argue that adding math synth data is not a good idea and that you should just give your model a calculator instead esp. for larger numbers. You don't want to install false confidence in the model if it won't ever really generalize on the task. ### Reasoning Data Even reasoning data is a little fishy imo. It's not terrible considering most of the time the alternative is to SFT the model on reasoning data prior to RL and transferring some learned circuits from one model to another isn't the worst. It's actually very difficult to learn sequential thinking from scratch for a transformer since human CoTs aren't actually anything like what would be helpful to a transformer. On this task, unfortunately, the best inductive bias is just data generated by another transformer. You shouldn't need billions of tokens of this data though, and it's not a way of generating free training data ### Bad Synth Data I'll give some more explicit examples of things you shouldn't do for synth data - translate your entire corpus into various languages, your model will just learn whatever translator you used, with all it's quirks and flaws - try to use synthetically generated non-heuristic searchable tasks, like synthetic abstract math problems - do any meaningful analysis of text with an LLM, especially using a model with a low simplebench score, as the model isn't guaranteed to actually be reading the text, and the flaws will transfer over - generate trillions of tokens of reasoning data and expect the model to learn anything other than just copying the previous model's flaws - generating long context data with poor long ctx inductive biases like context summaries, windowed generation, etc. # Empirical evidence There's a lot of rumors around synthetic data from major labs which I'll cover but we'll look at open source first because it's the most direct evidence for it. Qwen, by far, is the most synthetic data obsessed labs. They've been doing it for awhile and have likely poured far more resources and research into the subject than any other lab. So what's the results of these efforts? I notice, very often on real world, high utility benchmarks, especially those made by community members (for which there is insufficient time to goodhart) qwen models reliably score substantially worse than their other scores would suggest. To best show this off, I'll show the score of a few similar release date models on a popular benchmark that it scores similarly to. Then on a slightly less popular, but very closely related benchmark. | Benchmark | Qwen3 (synth) | GPT OSS<br>120B (synth?) | Kimi K2 (~natural) | GLM 4.6 (~natural) | | ------------------------------------- | ------------------------ | ------------------------ | ------------------ | ------------------ | | **Code / SWE** | *synth noise removal(!)* | *supposedly 100% synth* | *synth rephrasing* | *synth reasoning* | | SWE-bench verified | 69.6%* | 62.4% | 71.3%* | 68.0% | | Terminal bench | 37.5%* | - | 44.5%* | 40.5% (!) | | Terminal bench 2.0 Terminus 2 harness | 23.9%* | 18.7% | 27.8%* | 24.5% (!) | | **Language Modeling** | | | | | | SimpleBench | 31.0%^ | 22% | 39.6%^ | 47.7%\` | \*Qwen 3 Coder 480B A35B / K2 Instruct-0905 ^Qwen3 235B A22B / K2 Thinking (semi synth) \` GLM 4.7 This implies poor generalization, exactly what is predicted by the inductive bias theory. The GPT OSS score is the best example imo. ## Rumors **Frontier Lab Synth Data Troubles** ![[rumors.png]] from [the meta drama woman on twitter](https://x.com/suchenzang/status/1991771756676399497) **Ilya's shift in tone** You'll notice that while before, Ilya talked a lot about scaling and the "fossil fuel of machine learning" during his neurIPS talk, his statements since have showed a clear shift away from data being the bottleneck.