Monday, May 18, 2026 · 9:41 AM
ok dumb question: is this just “make AI writing less LinkedIn-core”?
basically yes
but the sneaky interesting part is they don’t define “better writing” as “a judge liked it more”
they define it as: does the model’s writing look like it came from the same distribution as real human writing?
distribution = vibes with math?
lol pretty much
imagine you’re trying to spot a fake restaurant menu
one dish might look plausible. but across 1,000 dishes, the fake menu keeps using the same sauce, same adjectives, same garnish, same “chef’s kiss” move
ah so one answer can be fine, but bulk output gives away the model
exactly. AI writing often fails at the population level
too many repeated structures, overly tidy arguments, favorite phrases, same pacing, same generic detail level
the report’s claim is: SFT alone doesn’t fix that
wait what. isn’t supervised fine-tuning literally “show it good human writing until it copies the pattern”?
😮that’s the counterintuitive bit
SFT teaches “for this prompt, produce this kind of answer”
but human-likeness is partly a group property: the spread, variety, weirdness, and frequency of patterns across many answers
so SFT is like training a student on answer keys, but not teaching them the whole class’s range of handwriting?
yes. or like teaching cooking from one perfect plate per recipe
they learn “correct dish,” but maybe not the normal human range: messy plating, substitutions, mild chaos, different spice instincts
how do they measure this without just saying “this sounds sloppy”?
three main lenses from the summary
n-gram token L2 distance: are words and short phrases overused or underused? this catches stuff like em-dash addiction
MMD: compares embedding distributions, so it can see if the overall content/style cloud is in the wrong place
JMQ: a judge model compares model output against human reference output
so they’re not only asking “is this good?” — they’re asking “does this look statistically close to the human reference set?”
MMD sounds like a crypto token
tragically, no yacht
think of embeddings as dropping every essay onto a map
human essays form one weather system. model essays form another. MMD asks how far apart those weather systems are
and the answer was “SFT weather is still weird”?
yep. according to the report, SFT outputs still differ strongly from human references
and just tweaking temperature doesn’t cleanly solve it
why not? higher temp = more variety, right?
more variety, yes. better match, not automatically
different sampling temperatures optimize different metrics
so one temperature might help phrase diversity while another helps judge preference or embedding match
annoying. no magic knob
exactly. the paper frames distribution fine-tuning, DFT, as optimizing closer to the thing they actually care about: matching the output distribution
not just “imitate this one target answer”
so DFT is like training the whole restaurant to match the archive, not just copying individual dishes?
that’s the idea
the report says DFT beat a strong “super baseline” that got to cherry-pick the best SFT hyperparameter / sampling result per metric
that sounds like a pretty stacked baseline
yeah, which makes the claim more interesting
they also report a 4B DFT model beating a 14B SFT superbaseline on MMD, and an 8B SFT superbaseline on JMQ
🤯smaller model beating bigger model by training objective, not vibes
yup. if the metric is distribution match, a smaller model trained for that can beat a larger model trained in the usual way
at least on the reported setup
where does the “AI slop” stuff fit in?
slop is the human-facing symptom
the model overuses certain rhythms: “not X but Y,” tidy three-part structures, generic caveats, polished-but-empty paragraphs
distribution metrics are trying to catch the statistical fingerprint underneath that cringe
so they’re making anti-slop measurable
that’s the cleanest takeaway
instead of “make it sound human,” ask: across lots of outputs, which tokens, structures, embeddings, and judge comparisons are out of line with humans?
product-wise, any spicy details?
a couple
their demo asks for structured inputs — prompt, outline, writing style, use case — so users state intent instead of just “write me a thing”
and they add copy-paste friction by injecting random fruits/animals into copied text
lol anti-spam banana watermark
basically. plus no public API, apparently to reduce automated spam use
so the technical goal and product constraints are aligned: better writing, but not “industrialized fake human internet at scale”
what should i actually remember from this?
three things
1. SFT can imitate examples without matching the human distribution
2. sampler tweaks help some metrics and hurt others, so temp isn’t a universal slop dial
3. if you care about human-like writing at scale, measure distribution mismatch directly: token patterns, embedding clouds, and judge comparisons
so “SFT is not all you need” = don’t just teach the model the answers, teach it the population statistics
exactly. less “copy this essay,” more “match the ecosystem these essays came from”
ok go forth and side-eye every em dash on the internet
too late, cursed forever
Read Mon, May 18 · 10:02 AM