Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

Pinned Tweet

Behnam Neyshabur

@bneyshabur

Oct 5

When @ethansdyer and I joined Anthropic last Dec and spearheaded the discovery team, we decided to focus on unlocking computer-use as a bottleneck for scientific discovery. It has been incredible to work on improving computer-use and witness the fast progress. In OSWorld for example, the performance of E2E foundation models have improved from ~8% a year ago to ~61% today with human performance at 72%. Some interesting observations/challenges with computer-use🧵

698

Behnam Neyshabur · Oct 15, 2025 · 7:09 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 15

Claude Haiku 4.5 is released today! It is as strong as Sonnet 4 on coding and much stronger in computer-use. But more importantly, it is 2Xa faster! Watching computer agent doing things on your computer is much more exciting now!

Mike Krieger

@mikeyk

Oct 15

Excited to ship Claude Haiku 4.5 today! What was state-of-the-art 5 months ago (Sonnet 4) is now available at 1/3 the cost and 2x the speed. Even beats Sonnet 4 at computer use. Available today wherever you get your Claude :)

Behnam Neyshabur · Oct 11, 2025 · 5:45 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 11

My PhD was a roller coaster 🎢:

Behnam Neyshabur

@bneyshabur

23 Dec 2022

I'm only asking people to think hard before committing to an ML PhD program. But an ML PhD could still end up working great for many! Also, I covered particular less discussed cons and did not intend to provide the full picture. 🧵My own PhD was truly a roller coaster 🎢: 1/n

Behnam Neyshabur · Oct 11, 2025 · 5:45 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 11

My personal story of how I got into PhD program:

Behnam Neyshabur

@bneyshabur

9 Dec 2021

Replying to @bneyshabur

In 2010 when I applied for PhD programs, I was a Master student in Iran with no published papers, undergrad GPA C and poor to mediocre English test scores. 2/

Behnam Neyshabur · Oct 11, 2025 · 5:45 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 11

Why you should think really hard before committing to a PhD program in ML (this is from 3 years ago and even more true today given the current AI trends):

Behnam Neyshabur

@bneyshabur

21 Dec 2022

These days, many people are interested in getting a PhD in ML. I think you should think really hard before committing to a PhD program in ML. Why? I'm going to summarize some thoughts in this thread: 1/10

Behnam Neyshabur · Oct 11, 2025 · 5:45 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 11

I still get a lot of questions about doing a PhD in ML so I'm resurfacing a three threads I wrote about this including some fun personal stories 🧵

274

Behnam Neyshabur · Oct 9, 2025 · 7:10 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 9

Very creative!

Hattie Zhou

@oh_that_hat

Oct 9

>be me >be Claude >have read the internet but one day human asks me to draw >no training, no practice, just converting mental image to mouse movements like a toddler holding a crayon >pencil tool not working? np, I'll draw with the eraser >task failed successfully

Behnam Neyshabur · Oct 6, 2025 · 10:08 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 6

You can try this on “Claude for Chrome” research preview to get a sense of where things are: claude.ai/chrome

Claude for Chrome

Now in research preview. Claude can navigate, click buttons, and fill forms—right in your browser.

claude.ai

Behnam Neyshabur

@bneyshabur

Oct 5

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

It has been amazing to work with many talented people in Anthropic on this. Some that I could find here are @the_marwell @oh_that_hat @katie_kang_ @shaya_zarkesh I originally got interested in computer-use when I was at Google thanks to very insightful conversations with @anmol01gulati who has been thinking about this and working on this for a long time.

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

10) Current techniques/models are very slow and we need the time per action to be 0.1 to 1 second (particularly for collaborative agents).

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

9) For computer-use agents to be very useful, their error rate should be much much lower. For example, for ~3h of autonomous work, the error rate per action should be ~1e-5 (assuming 1 action/sec)

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

8) We still need to figure out the right interface for both collaborative and autonomous computer-use agents for maximum productivity. That needs some clever product work and many start ups are working on that.

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

7) We are now transitioning from GPT-2 moment of computer-use to GPT-3 moment when we start to have few-shot behavior working reliably. Imagine being able to upload a video your workflow once (perhaps with your voice explaining what you are doing) and then model can do all future similar workflows with that one example.

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

6) Computers are very rich and empowering environments. You can do A LOT from your computer including collaborating with others to accomplish real tasks in physical world.

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

5) Compared to domains like math and code, computer-use has many many more interactions with the environment and in many cases, it is easier for humans or models to verify success of a given trajectory.

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

4) Using UI is easier for many people so understanding how to do things in UI is going to be a big bottleneck for any collaborative agent that is observing what you do and help you get it done faster.

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

3) There are many tasks that are either impossible or very hard to do using terminal/code and therefore computer-use ability will likely be a bottleneck for building autonomous agents for very long horizon tasks.

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

2) Today, a very significant portion of scientific and knowledge work is being done using GUI. This means there is plenty of demonstration data (in some cases with chain-of-thought in voice/transcript format) available and it is very easy to collect such data.

Behnam Neyshabur · Oct 5, 2025 · 8:07 PM UTC

Behnam Neyshabur

@bneyshabur

Oct 5

1) Computer-use is the most general interface for the non-physical world designed for maximal cross-task generalization.

sam mcallister · Oct 3, 2025 · 12:59 PM UTC

Behnam Neyshabur retweeted

sam mcallister

@sammcallister

Oct 3

GOOD MORNING NEW YORK CITY COME DO YOUR BEST THINKING AT OUR THINKING SPACE IN THE WEST VILLAGE SAY NO TO SLOP

5,958