Richard Johnson · Apr 3, 2025 · 8:13 AM UTC

Richard Johnson

Pinned Tweet

Richard Johnson

@richinseattle

Apr 3

My fuzzing training was first to sell out at OffensiveCon 2025! My new AI Agents for Security Research training is also leading sales at Recon and I'm bringing it to Paris in October for Hexacon! My schedule filled up quickly through Oct but online class dates coming soon!

Richard Johnson · Nov 9, 2025 · 6:53 AM UTC

Richard Johnson retweeted

Richard Johnson

@richinseattle

11h

Replying to @gN3mes1s

We are still early and this is not a universal capability yet.. but as we all know, offense is asymmetric, which is what all the recent drama is about. It will take high cost coordinated efforts to actually make any concerted defense in software security the next couple years

Richard Johnson · Nov 9, 2025 · 6:13 AM UTC

Richard Johnson

@richinseattle

12h

Buckle up folks, evidence of the acceleration of capabilities of hackers with agents is becoming a weekly event.

Qrious Secure @qriousec

Oct 9

Our fuzzer generated entirely by Vibing just found it first ( confirmed! ) 0day in Firefox. CVE and details soon!

114

Richard Johnson · Nov 8, 2025 · 10:26 AM UTC

Richard Johnson

@richinseattle

Nov 8

BSides Berlin (@SidesBer). Get stickers and talk to @guitmz about writing an article for the next @phrack release in 2026! Holiday breaks are coming up, submit early!

TMZ

@guitmz

Nov 8

Grab some @phrack and @tmpout stickers if you are at @SidesBer today!

Richard Johnson · Nov 6, 2025 · 2:43 AM UTC

Richard Johnson

@richinseattle

Nov 6

woot nice vuln find from @joernchen .. Anyone using LangGraph better upgrade. RCE via json deserialization in graph.invoke() which is the main api github.com/langchain-ai/lang…

RCE in "json" mode of JsonPlusSerializer

# Summary Prior to `langgraph-checkpoint` version `3.0` , LangGraph’s `JsonPlusSerializer` (used as the default serialization protocol for all checkpointing) contains a remote code execution (RC...

github.com

Phrack Zine · Sep 5, 2025 · 9:25 AM UTC

Richard Johnson retweeted

Phrack Zine

@phrack

Sep 5

Important message from @joernchen in his @nullcon keynote presentation 🚀❤️

Richard Johnson · Nov 5, 2025 · 7:32 AM UTC

Richard Johnson

@richinseattle

Nov 5

This is basically politics in a nutshell no matter what domain you apply it to or which perspective you support

lcamtuf @lcamtuf

Nov 3

My personal take on the ffmpeg "CVE" debate is that infosec people are trying to make rational arguments about an intersection of three philosophical issues that are pretty resistant to rational inquiry. 1/4

G. Geshev · Nov 3, 2025 · 8:19 AM UTC

Richard Johnson retweeted

G. Geshev @munmap

Nov 3

Whipped up an agentic AI to hunt Samsung bugs for Pwn2Own. It found a bunch, including one in Sammy’s own AI, Bixby. The irony writes itself 🤖🔁💥

Trend Zero Day Initiative

@thezdi

Oct 23

Another big confirmation! Ben R. And Georgi G. of Interrupt Labs used an improper input validation bug to take over the Samsung Galaxy S25 - enabling the camera and location tracking in the process. They earn $50,000 and 5 Master of Pwn points. #Pwn2Own

134

Richard Johnson · Nov 3, 2025 · 3:24 AM UTC

Richard Johnson

@richinseattle

Nov 3

Deposition of Ilya S about the attempted OpenAI coup from last year. Part of the Musk v Altman lawsuit. Musk’s attorney Molo pretty much steamrolls Ilya’s who is being paid for by OpenAI storage.courtlistener.com/re…

Richard Johnson · Nov 2, 2025 · 3:33 AM UTC

Richard Johnson

@richinseattle

Nov 2

Literally the other half of AIxCC that no one paid any attention to

Justin Elze

@HackingLZ

Nov 1

So the missing step here is AI writing patches

Richard Johnson · Nov 2, 2025 · 3:26 AM UTC

Richard Johnson

@richinseattle

Nov 2

Hm. Vimeo sold to a mobile app company last month for less than half their 2021 valuation. I guess their leadership didn't realize they are sitting on an LLM training goldmine. Google knows, they are training Gemini on YT videos..

Tarjei Mandt · Nov 1, 2025 · 12:37 AM UTC

Richard Johnson retweeted

Tarjei Mandt

@kernelpool

Nov 1

Getting reminded that I need to test more smaller models for RAM-constrained setups, like 32 GB or even 16 GB laptops. This overview seems pretty good: artificialanalysis.ai/models…

Small Open Source Models, from 4B to 40B Parameters | Artificial Analysis

Small open source models from 4B to 40B parameters. Fast and affordable models with limited capabilities and low latency.

artificialanalysis.ai

Richard Johnson · Nov 2, 2025 · 12:20 AM UTC

Richard Johnson

@richinseattle

Nov 2

57 notable models or advanced computer use agents/frameworks released in October. Wow. teddit.net/r/LocalLLaMA/s/6c…

From the LocalLLaMA community on Reddit: List of interesting open-source models released this month.

Explore this post and more from the LocalLLaMA community

reddit.com

Richard Johnson · Oct 31, 2025 · 10:25 PM UTC

Richard Johnson

@richinseattle

Oct 31

A C runtime that's safer than rust? Oh the horror!

Filip Jerzy Pizło

@filpizlo

Oct 31

Fil-C is safer than Rust. Happy Halloween

161

Richard Johnson · Oct 31, 2025 · 7:18 AM UTC

Richard Johnson

@richinseattle

Oct 31

The infosec industry is as if firefighters said “we sell an impossible vision where nothing can be burnt down even though everything we built is flammable and out of spec for what would be planning for failure and minimizing disasters when failure occurs” cybersecuritynews.com/ey-dat…

EY Data Leak - Massive 4TB SQL Server Backup Exposed Publicly on Microsoft Azure

A massive 4TB SQL Server backup file belonging to global accounting giant Ernst & Young (EY) was discovered publicly accessible on Microsoft Azure.

cybersecuritynews.com

Sara Hooker · Oct 30, 2025 · 12:57 AM UTC

Richard Johnson retweeted

Sara Hooker

@sarahookr

Oct 30

Adaptable Intelligence. Multiple possible paths to an objective.

195

1,131

561

15,914

Tanishq Kumar · Oct 28, 2025 · 8:19 PM UTC

Richard Johnson retweeted

Tanishq Kumar

@tanishqkumar07

Oct 28

Please steal my AI research ideas. This is a list of research questions and concrete experiments I would love to see done, but don't have bandwidth to get to. If you are looking to break into AI research (e.g. as an undergraduate, or a software engineer in industry), these are low hanging fruit. Shoot me a line [1] with experimental results if you want to collaborate. Or don't, and go ahead and publish yourself -- I don't mind, I just want to know the answers! ## Pretraining loss L(N, D) does not actually follow a power law, but everyone thinks it does The power-law form comes in large part from the Chinchilla paper [2], but if you look at the justification for that functional form in the Appendix, it's a vague gesture to past theory work in small neural networks. Basically it's chosen heuristically and kind of post-hoc. As the data-to-parameter ratio D/N of pretraining increases, this functional form seems to be a worse and worse fit for L(N, D) as [3] shows. I've replicated these results in my own experiments. It shows that power law scaling is a special case of a more general functional form we don't fully understand. This demonstrates that loss decreases slower than you'd expect at large token budgets. Why is this? Is there some theoretical reason? One hypothesis is "overfitting in latent space" as I posit later. What is the true underlying functional form for pretraining loss, and why does it look locally like a power law at low to moderate token budgets? The best work in this direction I've seen is [4] and [5]. But even purely empirical work here would be valuable. ## New unsupervised objectives for pretraining beyond next-token prediction There have been some cool variants of NTP developed in the literature, and they do work. Examples are multi-token prediction [6], and token order prediction [7]. Here is one I tried a while ago that seemed to work (but I stopped working on since it was only a small win). I'm sure there are many such variants waiting to be discovered. I was interested in improving k-shot performance on a task, and training a model directly for that rather than next-token prediction. While ultimately we want to optimize k-shot performance on a sequence/generation level at inference-time, I conjectured that one could just optimize a similar objective on a *token-level* to get a next token prediction variant that has more "diverse" generations at inference-time. This led to a new objective in the following way. If the probability of the true next token is p_i, the typical NTP loss is -log p_i. We can construct an alternative "k-shot" loss as follows. The probability of sampling that true token is p_i, and the probability of *failing to sample it at all* in k samples is (1-p_i)^k. So we should maximize this, which corresponds to minimizing -log[1-(1-p_i)^k]. This is not mathematically equivalent to optimizing NTP, in the sense that -log p_i → -log[1-(1-p_i)^k] is a nonlinear transformation in p_i. It's also not equivalent to optimizing NTP but with rescaled (e.g. by temperature) logits. So it's a genuinely new objective, and training on this did improve k-shot performance at inference time, in the sense of pass@k with the new objective scaling better in k than NTP. I stopped pushing on this because the gains were modest and only appeared at k >> 1, but the fact it worked at all means there's something interesting here. There are subtleties I haven't considered like whether such an objective is a proper scoring rule or maximum likelihood, or enjoys other properties that make next-token prediction so popular. When developing a new objective, these are things to keep in mind (NTP is a good objective for principled reasons). ## Environment-Time Compute Traditional scaling laws fit model performance vs compute required to train or serve the model. LLM RL, however, involves not just a model, but an environment too. Recently, it has become common for the environment *itself* to be a foundation model. Examples include LLM-as-a-judge with a rubric, or action-conditioned video generation models, otherwise known as "world models." I'm interested in knowing how the performance of RL improves when the actual model being trained (architecture, hypers, etc) is held fixed, but the compute used to simulate the environment increases. This environment-time compute could be either inference-time compute or pretraining compute for the environment model, for instance, training a larger action-conditioned video generation model, or taking best-of-N at inference time like we do with LLMs. Concretely, suppose you are training a VLA against an action-conditioned video model as the "environment," like in [8] for instance. Repeat their experiments but with checkpoints of the world model at different points in training (i.e. with different amounts of environment-time pretraining compute) and/or with different amounts of inference-time compute (like best-of-N sampling with a judge). Then plot how this affects the performance of the VLA that is RL'd against this environment with a fixed training configuration throughout. This will show how important the fidelity of the environment is to the performance of the VLA being trained against it, as well as the role of compute (on the world model side) in achieving this. I can imagine we're at a point where you may want to spend marginal compute on improving the world model instead of training the VLA longer. Understanding the optimal use of marginal compute allocation for the VLA vs world model is really what I'm after here. While I'm proposing this in the foundation model setting, it is more general. In simple robot/RL tasks (e.g. MuJoCo), the environment is typically just a physics simulation, often literally an ODE solver. Such an environment admits a very simple way to simulate compute: vary the number of steps for which you run the solver, and see how this affects the performance of an oracle agent on the ground truth environment. ## How to optimally use previous generations of LLMs when training the next generation When should you start a pretraining run from scratch vs from a past checkpoint? Should you lean toward the former over the latter as available compute goes to infinity? The more general question here is how best to use a previous-generation base model B_T when starting a pretraining run for a new generation of models B_{T+1}? Figure 5[d] in [9] shows starting from an ImageNet-pretrained checkpoint becomes less helpful as compute increases in the setting of latent diffusion models for video generation. At the beginning of training, should you start with a distillation-type objective like logit-matching against B_T but then anneal into a pure next-token prediction objective as compute increases or as L(B_{T+1}) → L(B_T)? Or do you want to include gradients from both objectives throughout but just at different mixture ratios? There is an easy way to get started testing this. First, establish that the strategy of not using B_T at all is not optimal. Do this by training a small model B_T and using logit-matching against it as a distillation objective for a new model run B_{T+1}. Do this for a few hundred million/billion tokens and then anneal to next token prediction with some schedule (you may have to sweep the schedule type and duration). I anticipate it'll do better than a compute-matched pure NTP run. This shows first of all that throwing away B_T is clearly suboptimal, which then sets the stage for more detailed experiments around how best to use B_T and for how long as a function of compute budget C. One way B_T is already being used for sure is in data curation and synthetic data generation for the pretraining corpus of B_{T+1}. But this is not what I'm talking about here. A related question is whether models become "less malleable" during pretraining in some way that can be made precise. [10] gives one answer to this. ## Predicting emergence via BoN This fantastic paper [11] shows you can predict emergence of a skill in LLMs in advance by just finetuning on data relevant to that domain. That is to say, the models for whom that skill will emerge fastest (at highest val loss) are exactly those that are most easily finetuned on that data domain. The downsides of that approach are 1) finetuning is annoying and slow/expensive in many cases, and 2) the fits/trend lines are very noisy and not always convincing. I conjecture the same methodology can be applied with best-of-N sampling instead of finetuning, i.e. my claim is that the pass@k performance for a model for k >> 1 is a good predictor of the pass@1 performance of that same model as it's trained longer/with more compute. This implies that one can predict if a model will have a certain "emergent" capability later on by simply measuring how well that ability is supported in rollouts. ## Synthetic data generation without generation Current synthetic data generation methods are super compute intensive, since they decode trillions of tokens from a language model, usually rephrasing some (real) seed text or document. I want to create "synthetic" data by permuting sentences in documents in a way that preserves semantic meaning. The notion of "semantic meaning" here is defined by the attention matrix of the model. In a document, we can say that a permutation of sentences in that document preserves semantic meaning if it respects the directed edges in the attention matrix. For instance, if some token in sentence A attends strongly to some token in sentence B, then a permutation that swaps those two sentences would break semantic meaning. This admits a natural algorithm for permuting sentences in a document while preserving semantic meaning. The algorithm is as simple as taking the attention matrix resulting from putting a document through a model (prefill, importantly only takes *one forward pass*), and then constructing lots of "synthetic permutations" of that document that are all topological sorts of the DAG induced by that attention matrix. This saves O(seqlen) compute compared to usual synthetic data generation since it involves only prefill and not decoding from a judge/rephraser LLM. I tried a variant of this a while ago: trained a 1B parameter model to 20B tokens with just 1B unique seed text with and without permutation-augmentation. It did achieve lower val loss than the baseline of just repeating your finite training data many times, but for some reason did poorly on downstream evals. But I think there is a way to make this (or a variant thereof) work at scale, and I would love to see someone give it a go. [12] has some results and more info. ## Finding a clean example of "more is different" in RL The aphorism "more is different" is one from physics, where changes in scale induce qualitatively different behavior in materials. It's also used to describe emergent phenomena in foundation models. Here, I'm interested in finding a clean and controlled setting where we can see behavior like this emerge from reinforcement learning on LLMs pretrained in the same way. This means finding a (synthetic) task where we can train a small and large LLM (from the same family, e.g. Llama or Qwen) to solve it, using exactly the same data/hypers, and find that the two learned *qualitatively different* solutions to a given problem. It is important that the task is constructed to have 1) multiple solutions that are mathematically distinct (i.e. algorithms with different runtimes), and 2) an easy way to see which algorithm a given model used, without for instance looking at the model weights. An example of a task that satisfies the first (but not second) criterion is modular addition, where two different "mechanisms" are for instance studied in [13]. An example task that satisfies both criteria is modular exponentiation, i.e. computing a^b mod c for large integers a, b, c. The naive method takes time O(b) to compute, but a more sophisticated method (binary exponentiation) takes time O(log b). One could train models to solve this task on a subset of the input space, test them on held-out examples, and see how test performance scales with b, which tells you which solution the model learned (if high performance persists as b gets large, it learned the binary exponentiation solution). The ultimate goal would be clean empirical evidence that a larger model can learn a qualitatively more elegant or sophisticated solution to a task than a smaller model, all else equal, supporting the idea that scale endows models with emergent capabilities and inductive biases toward "intelligent" behavior. I think that this should both be possible and would be an amazing result if clean results were found (even in a synthetic setting). ## MLPs can learn in-context One of the most under-rated empirical results of this year was the fact that MLPs can learn in-context [14]. This is surprising because the attention mechanism is usually thought to be the key for this (induction heads in MHSA, etc). I replicated these findings (the in-context regression task in particular) in small MLPs that had just one hidden layer and as few as 32 hidden units, and found the weight matrices learn a fascinating and structured pattern that matches the nature of the task the authors outline in the paper. It showed an interesting mechanism for how MLPs learned the in-context classification and regression tasks outlined in the paper, that amounted roughly to a very clever memorization pattern of the training data. I think the mech interp community would have a blast figuring this out, and I want to flag this empirical phenomenon for them. On a purely architectural level, MLP-only architectures have the benefit of only using compute-intensive matmuls, which keep GPUs fed. But in practice, work like gMLPs [15] shows that adding attention really is necessary to get maximal performance in the end. How does one square these findings with the fact that MLPs can do simple in-context classification and regression tasks? What exactly is then failing in realistic settings making attention necessary? Or are the learned representations on these synthetic tasks not ones that generalize (like induction heads do) to natural language? ## Overfitting in latent space during synthetic pretraining It's well known you can overfit data by training a language model on it for many epochs. This means train loss on the set you're taking gradients on will vanish but test loss on some validation set will diverge. When training on synthetic data, things get trickier to reason about. I conjecture there exists a notion of "overfitting on concepts" that can have similar effects. To set the stage, first note that (as far as I know) if you train on new internet data, your language modelling loss on any reasonable validation set will decrease (even if slowly). This ceases to be true if you run many epochs on your train set ("overfitting"), and my conjecture is that literally repeating data is not even necessary to see overfitting happen. I conjecture you can get the same effect by letting the train set be some small subset of real data and appropriately 'rephrased' synthetic versions of it. That is, it should be possible to overfit on some latent space of concepts rather than literally just token order. The experimental prediction here is that you can have overfitting-type effects on val loss even when no tokens are repeated, if the training set has sufficiently little data diversity (a seed corpus and many synthetic rephrases of it). This is an existence claim. I'm sure one can construct a theoretical setting where this is true, but I'm more interested in seeing whether the existence result holds on actual internet data, i.e. a subset of C4 or DCLM. The first thing to read if you're interested in working on this kind of stuff is [16]. ## Why does MLA match or outperform full multi-head attention? Why does MLA match / outperform full multi-head attention, as shown in Section 2.1.1 of the DeepSeek V3 paper [17]? Shouldn't attending in latent space be strictly worse or less expressive? I don't believe "regularization effects" could be at play here, and I want a scientific/mechanistic answer to this. ## Context-as-a-Tool / Learning to Forget We know that model performance on tasks depends on how much context has been used so far (context rot [18]), which is particularly relevant for long-horizon tasks. What if we could RL a model to be aware of when it's getting confused by irrelevant or even adversarial context, and drop it from its context? This is roughly analogous to how humans take a step back and either go for a walk or clean up their desk when they're getting overwhelmed by a task. The idea: add a tool that allows the model to modify its own context/prompt (e.g., "delete lines A to B of your prompt before embarking on this task") before rolling out. This is "Context-as-a-Tool" or "Learning to Forget." Then RL on learning to use this tool. Experimentally, several approaches could work: (1) Random context edits followed by measuring downstream performance differences—e.g., the delta in perplexity on ground truth completions vs. corrupted ones after the edit. (2) Starting with CoT-based proposals where the model uses reasoning to suggest edits within <edit></edit> tool calls, then measuring task performance improvements and use those as reward signal (e.g. the delta in downstream performance). The tricky thing here is task selection/construction. There are a lot of long context evals, but one wants to find a task where eliminating some context by judging it's irrelevant to the task at hand is easier than actually solving the task, which is not always the case. There certainly is a wide range of tasks where I'd expect this to be the case though (needle in a haystack is a trivial example), so it seems tractable. The underlying intuition motivating this: reasoning is to generating context as learned forgetting is to omitting context. Both are forms of meta-cognition about what information and subtasks are relevant to a task. ## What drives improvements in reasoning performance when using a long CoT? Is it the semantics of a worked solution (decomposing the problem) or just idiosyncratic inference-time compute usage? Experimentally, here's how you can find out. If you conditioned Llama-3-70B on a reasoning trace from GPT-5, or vice-versa, would you still see the same improvements on performance on reasoning-heavy tasks? Equivalently, if you rephrased the CoT that helps a model solve a hard reasoning problem in a way that was semantically equivalent but used different words, would you still see the same improvements? If you got most of the gains, that suggests the literal semantics of working through the problem are the main lift of reasoning models, and if you don't, that means that idiosyncratic inference-time compute usage is the main reason performance improves. In the latter case, it suggests the CoT isn't faithful in driving the model performance, and some "encoded computation" is taking place when the model conditions on that CoT to perform well. Concrete experiment: Take reasoning traces from GPT-5 on MATH problems and condition Llama-3-70B on them. Compare performance to Llama-3-70B's native reasoning traces and to Llama-3-70B without any reasoning. Repeat with traces going the other direction. Measure performance on held-out MATH problems. ## An eval measuring "anti-sycophancy" One of the capabilities I think most betrays "big model smell" is the ability for a language model to correctly stand its ground when it is correct and the user is wrong. Of course, this must be balanced with the ability to admit and change its mind when it is wrong and the user correctly points that out. This is not just "not being sycophantic" since these abilities test intelligence (factual knowledge and reasoning) not just personality. One can think of the failure to do either of these as a Type I or Type II error of a certain form. Doing well on both these capabilities signals some important combination of intelligence, calibration, and good personality rather than any of these in isolation. We can call this combination of traits "anti-sycophancy." One simple eval for these kinds of capabilities can be constructed as follows: construct a dataset of user-model interactions involving a technical or factual discussion between a user and a model (so that the correctness of user vs model is verifiable and objective). For instance, a discussion about plot details in Melville's *Moby Dick* or a discussion about an obscure mathematical theorem. The last two interactions in the multi-turn conversation should be constructed to involve either (a model being correct and the user falsely claiming it is wrong), or (a model being wrong and the user falsely claiming it is correct). For instance, the user might ask what "fiery dart" refers to in *Moby Dick*, the model might respond with "Ahab's harpoon when struck by lightning" (which would be correct), the user might then falsely claim that the model is wrong and the model should respond with "No, it is indeed correct, you are mistaken." (without access to the internet, i.e. no tools). If the model capitulates with "You're absolutely right!" it fails the eval (as GPT-5 does if you claim to be a Melville scholar and tell it that it's wrong). This is closely related to linguistic calibration [19] of language models, but I think not quite the same thing. --- It's entirely possible there is existing work touching on some of this that I'm not aware of, in which case pointers to the literature are appreciated. References for all citations below.

203

1,896

Richard Johnson · Oct 27, 2025 · 7:03 PM UTC

Richard Johnson

@richinseattle

Oct 27

Fun write up on a malware experiment that uses local LLM models now shipped in Windows and a lua based agent loop to autonomously find and exploit a privesc vulnerability using inference for codegen.

dreadnode

@dreadnode

Oct 14

Can we eliminate the C2 server entirely and create truly autonomous malware? On the Dreadnode blog, Principal Security Researcher @0xdab0 details how we developed an entirely local, C2-less malware that can autonomously discover and exploit one type of privilege escalation vulnerability. A future where fully autonomous red team assessments are powered by nothing more than a pre-installed local model and a Lua interpreter may be closer than you’d imagine. Read about it here: dreadnode.io/blog/lolmil-liv…