Kenton Varda · Nov 8, 2025 · 5:21 PM UTC

Kenton Varda

Simon Willison retweeted

Kenton Varda @KentonVarda

The big advantage of MCP over OpenAPI is that it is very clear about auth. OpenAPI supports too many different auth mechanisms, and the schema doesn't necessarily have enough information for a robot to be able to complete the auth flow.

Steve Krouse

@stevekrouse

Nov 7

*gets up on soap box* With the announcement of this new "code mode" from Anthropic and Cloudflare, I've gotta rant about LLMs, MCP, and tool-calling for a second Let's all remember where this started LLMs were bad at writing JSON So OpenAI asked us to write good JSON schemas & OpenAPI specs But LLMs sucked at tool calling, so it didn't matter. OpenAPI specs were too long, so everyone wrote custom subsets Then LLMs got good at tool calling (yay!) but everyone had to integrate differently with every LLM Then MCP comes along and promises a write-once-integrate everywhere story. It's OpenAPI all over again. MCP is just a OpenAPI with slightly different formatting, and no real justification for doing the same work we did to make OpenAPI specs and but different MCP itself goes through a lot of iteration. Every company ships MCP servers. Hype is through the roof. Yet use of MCP use is super niche But now we hear MCP has problems. It uses way too many tokens. It's not composable. So now Cloudflare and Anthropic tell us it's better to use "code mode", where we have the model write code directly Now this next part sounds like a joke, but it's not. They generate a TypeScript SDK based on the MCP server, and then ask the LLM to write code using that SDK Are you kidding me? After all this, we want the LLM to use the SAME EXACT INTERFACE that human programmers use? I already had a good SDK at the beginning of all this, automatically generated from my OpenAPI spec (shout-out @StainlessAPI) Why did we do all this tool calling nonsense? Can LLMs effectively write JSON and use SDKs now? The central thesis of my rant is that OpenAI and Anthropic are platforms and they run "app stores" but they don't take this responsibility and opportunity seriously. And it's been this way for years. The quality bar is so much lower than the rest of the stuff they ship. They need to invest like Apple does in Swift and XCode. They think they're an API company like Stripe, but their a platform company like an OS. I, as a developer, don't want to build a custom chatgpt clone for my domain. I want to ship chatgpt and claude apps so folks can access my service from the AI they already use Thanks for coming to my TED talk

Hamel Husain · Nov 7, 2025 · 10:07 PM UTC

Simon Willison retweeted

Hamel Husain

@HamelHusain

Nov 7

👀 Animals have been assigned. Scheduled to print fall 2026! We have iterated on this with over 3k students (and continue to do so). We give our students access to the full draft as part of our evals course (link in bio).

938

Simon Willison · Nov 7, 2025 · 9:16 PM UTC

Simon Willison

@simonw

Nov 7

Added the @Kimi_Moonshot model prices to llm-prices.com/#it=10000&cit… ... by pasting a screenshot of their pricing page into a GitHub Issue and assigning it to GitHub Copilot github.com/simonw/llm-prices…

200

OpenAI Developers · Nov 7, 2025 · 6:22 PM UTC

Simon Willison retweeted

OpenAI Developers

@OpenAIDevs

Nov 7

GPT-5-Codex-Mini allows roughly 4x more usage than GPT-5-Codex, at a slight capability tradeoff due to the more compact model. Available in the CLI and IDE extension when you sign in with ChatGPT, with API support coming soon.

306

Simon Willison · Nov 7, 2025 · 4:39 PM UTC

Simon Willison

@simonw

Nov 7

"It has never been easier to build an MVP and in turn, it has never been harder to keep focus. When new features always feel like they're just a prompt away, feature creep feels like a never ending battle. Being disciplined is more important than ever." I really feel this one!

Josh Cohenzadeh

@jshchnz

Nov 7

6 months ago @Sentry acquired our company & since then I've been experimenting a lot with AI I wrote up some of my thoughts on vibe coding & dealing with "AiDHD" josh.ing/blog/aidhd

1,013

Simon Willison · Nov 7, 2025 · 4:13 PM UTC

Simon Willison

@simonw

Nov 7

I have a hunch that current LLMs might make it easier to launch a brand new programming language, provided you can describe it in a few thousand tokens and ship it with a compiler and linter that coding agents can use simonwillison.net/2025/Nov/7…

Could LLMs encourage new programming languages?

My hunch is that existing LLMs make it easier to build a new programming language in a way that captures new developers. Most programming languages are similar enough to existing …

simonwillison.net

355

Simon Willison · Nov 7, 2025 · 7:23 AM UTC

Simon Willison

@simonw

Nov 7

New TIL: Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale til.simonwillison.net/llms/c…

Using Codex CLI with gpt-oss:120b on an NVIDIA DGX Spark via Tailscale

I've written about the DGX Spark before. Here's how I got OpenAI's Codex CLI to run on my Mac against a gpt-oss:120b model running on the DGX Spark via a Tailscale network.

til.simonwillison.net

316

Simon Willison · Nov 7, 2025 · 4:41 AM UTC

Simon Willison

@simonw

Nov 7

Replying to @simonw @ArtificialAnlys

Here's K2 Thinking running on a pair of M3 Ultra Mac Studios via MLX

Awni Hannun

@awnihannun

Nov 7

The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality! The model was quantization aware trained (qat) at int4. Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm:

Simon Willison · Nov 6, 2025 · 11:57 PM UTC

Simon Willison

@simonw

Nov 6

Kimi K2 Thinking is "potentially the new leading open weight" according to @ArtificialAnlys

Artificial Analysis

@ArtificialAnlys

Nov 6

MoonshotAI has released Kimi K2 Thinking, a new reasoning variant of Kimi K2 that achieves #1 in the Tau2 Bench Telecom agentic benchmark and is potentially the new leading open weights model Kimi K2 Thinking is one of the largest open weights models ever, at 1T total parameters with 32B active. K2 Thinking is the first reasoning model release within @Kimi_Moonshot's Kimi K2 model family, following non-reasoning Kimi K2 Instruct models released previously in July and September 2025. Key takeaways: ➤ Strong performance on agentic tasks: Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains ➤ Reasoning variant of Kimi K2 Instruct: The model, as per its naming, is a reasoning variant of Kimi K2 Instruct. The model has the same architecture and same number of parameters (though different precision) as Kimi K2 Instruct and like K2 Instruct only supports text as an input (and output) modality ➤ 1T parameters but INT4 instead of FP8: Unlike Moonshot’s prior Kimi K2 Instruct releases that used FP8 precision, this model has been released natively in INT4 precision. Moonshot used quantization aware training in the post-training phase to achieve this. The impact of this is that K2 Thinking is only ~594GB, compared to just over 1TB for K2 Instruct and K2 Instruct 0905 - which translates into efficiency gains for inference and training. A potential reason for INT4 is that pre-Blackwell NVIDIA GPUs do not have support for FP4, making INT4 more suitable for achieving efficiency gains on earlier hardware. Our full set of Artificial Analysis Intelligence Index benchmarks are in progress and we will provide an update as soon as they are complete.

Simon Willison · Nov 6, 2025 · 11:54 PM UTC

Simon Willison

@simonw

Nov 6

Notes on Kimi K2 Thinking, the huge new open weights (but not open source, it's under a "modified MIT license") model from Moonshot AI simonwillison.net/2025/Nov/6…

Kimi K2 Thinking

Chinese AI lab Moonshot's Kimi K2 established itself as one of the largest open weight models - 1 trillion parameters - back in July. They've now released the Thinking version, …

simonwillison.net

735

can · Nov 6, 2025 · 8:02 PM UTC

Simon Willison retweeted

can @marmaduke091

Nov 6

🚨 Potential GPT-5.1 Model: Polaris Alpha It's now available on OpenRouter for everyone to use. Here's the Pelican SVG. It's not reasoning, but still, performance looks promising. Will test further.

leo 🐾

@synthwavedd

Nov 6

🚨The new cloaked "Polaris Alpha" model on @OpenRouterAI appears to be an OpenAI model. It also has configurable reasoning effort (Low/Med/High). GPT-5.1 or Codex Mini are contenders for what this might be. h/t @DavidSZD1

294

Simon Willison · Nov 6, 2025 · 6:49 PM UTC

Simon Willison

@simonw

Nov 6

I found this code example really useful for helping me understand the details of what the new (free) file search RAG feature in the Gemini API can do

Logan Kilpatrick

@OfficialLoganK

Nov 6

Introducing the File Search Tool in the Gemini API, our hosted RAG solution with free storage and free query time embeddings 💾 We are super excited about this new approach and think it will dramatically simplify the path to context aware AI systems, more details in 🧵

594

Simon Willison · Nov 6, 2025 · 6:43 PM UTC

Simon Willison

@simonw

Nov 6

uv makes testing different projects against upgraded dependencies so much easier - no need to think about virtual environments, uv handles them almost invisibly I wrote more about my uv testing tricks in this TIL til.simonwillison.net/python…

Testing different Python versions with uv with-editable and uv-test

A quick uv recipe I figured out today, for running the tests for a project against multiple Python versions.

til.simonwillison.net

Simon Willison · Nov 6, 2025 · 6:31 PM UTC

Simon Willison

@simonw

Nov 6

Here are detailed notes to accompany the video on my blog: simonwillison.net/2025/Nov/6…

Video + notes on upgrading a Datasette plugin for the latest 1.0 alpha

I’m upgrading various plugins for compatibility with the new Datasette 1.0a20 alpha release and I decided to record a video of the process. This post accompanies that video with detailed …

simonwillison.net

Simon Willison · Nov 6, 2025 · 6:30 PM UTC

Simon Willison

@simonw

Nov 6

Made a new video demonstrating my process for upgrading a Datasette plugin using uv and an OpenAI Codex bash one-liner piped.video/watch?v=qy4ci7Ao…

My process for upgrading Datasette plugins with uv and OpenAI Codex...

Here are detailed notes to accompany this video, including all of the commands I ran plus links to additional resources: https://simonwillison.net/2025/Nov/6...

youtube.com

Pamela Fox · Nov 6, 2025 · 5:12 PM UTC

Simon Willison retweeted

Pamela Fox @pamelafox

Nov 6

Great hack from @simonw: make a GitHub repo *just* for experiments, then assign async cloud-based coding agents to a research task. It's a sandboxed environment with no negative effects on your real repos.

Simon Willison

@simonw

Nov 6

I've been getting a lot of value using coding agents for code research tasks recently - I have a dedicated simonw/research GitHub repo and I frequently have them run detailed experiments and write up the results. Here's how I'm doing that + some examples: simonwillison.net/2025/Nov/6…

244

Simon Willison · Nov 6, 2025 · 4:02 PM UTC

Simon Willison

@simonw

Nov 6

And here's an example of one of my code research prompts

Simon Willison · Nov 6, 2025 · 3:59 PM UTC

Simon Willison

@simonw

Nov 6

Here's my research repo - each of the 13 folders is a different research project, and the README is automatically updated by an LLM to include summaries describing each one github.com/simonw/research?t…

Simon Willison · Nov 6, 2025 · 3:56 PM UTC

Simon Willison

@simonw

Nov 6

Code research projects with async coding agents like Claude Code and Codex

I’ve been experimenting with a pattern for LLM usage recently that’s working out really well: asynchronous code research tasks. Pick a research question, spin up an asynchronous coding agent and …

simonwillison.net

519

Simon Willison · Nov 5, 2025 · 3:58 PM UTC

Simon Willison

@simonw

Nov 5

Has anyone seen a coding agent start an automated test suite when one doesn't exist already, without being prompted to do so? Asking because my agents ALWAYS write tests, but all of my projects have a tests/ folder before I fire up the agent so maybe they're picking up on that?