Shion Honda · Dec 29, 2024 · 9:41 AM UTC

Shion Honda

Pinned Tweet

Shion Honda @shion_honda

29 Dec 2024

I've wrapped up 2024 with my top 10 favorite deep learning papers of the year! 🔍 Dive into the breakthroughs, insights, and ideas that shaped the field. Year in Review: Deep Learning Papers in 2024 | Hippocampus's Garden hippocampus-garden.com/deep_…

Year in Review: Deep Learning Papers in 2024 | Hippocampus's Garden

Reflecting on 2024's deep learning breakthroughs! Discover my top 10 favorite research papers that shaped the field this year.

hippocampus-garden.com

Shion Honda · Nov 8, 2025 · 3:48 PM UTC

Shion Honda @shion_honda

Nov 8

Kimi.ai · Nov 6, 2025 · 3:04 PM UTC

Shion Honda retweeted

Kimi.ai

@Kimi_Moonshot

Nov 6

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns. K2 Thinking is now live on kimi.com in chat mode, with full agentic mode coming soon. It is also accessible via API. 🔌 API is live: platform.moonshot.ai 🔗 Tech blog: moonshotai.github.io/Kimi-K2… 🔗 Weights & code: huggingface.co/moonshotai

566

1,509

939

9,585

Shion Honda · Nov 6, 2025 · 7:27 AM UTC

Shion Honda @shion_honda

Nov 6

コンテキスト内に推論とツール利用を交互に入れるinterleaved thinkingがエージェントの性能を押し上げるという話。 OpenAIのResponses APIでサポートされているものと同じだと思いますが、ChatCompletionでは未サポートなので、確かにまだ一般的ではないかもしれません。

MiniMax (official)

@MiniMax__AI

Nov 3

x.com/i/article/198528154829…

Shion Honda · Nov 4, 2025 · 10:00 PM UTC

Shion Honda @shion_honda

Nov 4

各社のLLMのTau-benchでの性能を同じ条件下で計測したところ、細かな設定の違いが大きな差を生むことがわかりました。大本営発表に惑わされないために、自前のベンチマーク、少なくとも公開ベンチマークを自前で実行するセットアップが必要です。

Shion Honda @shion_honda

Nov 3

Chasing leaderboard scores is a trap. We reproduced Tau-bench across different models and learned: - Hidden configs make apple-to-apple comparisons impossible - Reasoning boosts accuracy, but also latency That’s why you need an internal benchmark. medium.com/alan/benchmarking…

Daisuke Okanohara / 岡野原大輔 · Oct 29, 2025 · 8:11 PM UTC

Shion Honda retweeted

Daisuke Okanohara / 岡野原大輔

@hillbig

Oct 29

MiniMax M2はLLM評価で標準的なAAベンチマークで、全体で5位、オープンモデルの中ではgpt-oss 120B(high)に並ぶ1位タイである。特にエージェント能力を大きく改善し、IFBenchやtauBenchの性能は大きく改善している。アーキテクチャは最近採用されている効率的な注意機構の変種ではなくFull Attentionを利用している（特にM1の時はむしろ線形注意機構の採用で話題となっていたが、戻したことになる）。アーキテクチャはQwen3とほぼ同様だが、ヘッド毎に独立のQK-Normを利用している。 Full Attentionを再度採用した理由として、従来のタスク（MMLU, BBH）などであれば高性能が確認されたが、スケールアップして複雑なマルチホップ推論タスクに鳥くんだ場合、明らかな性能低下を引き起こすためである。これらについては時間が経てば解決する可能性が高いが、まだFull Attentionが必要としている。また、推論時の最適化によってFull Attentionでも問題ないとされる。 M1の強化学習においてはスケールアップしたときしか顕在化しなかった重大な精度問題が発生しており、同様に線形注意機構もスケールアップした場合に問題が生じる問題があった。実際、ハイブリットSWA方式にするのを試したが、コンテキスト長が長くなるほど性能低下が顕著となる問題があり、エージェント向けには許容することができなかった。これは初期の事前学習段階で多くの大域注意パターンが形成されていて、継続事前学習に調整することが困難であったためかもしれず、事前学習からの検討などは今後の課題である。 CoTの学習データの品質が重要であった。従来の多くの研究が特定のベンチマークに過剰適合している問題があった。そのためデータ合成時にフォーマットの多様性を増やし、ルールベースシステムとLLM-as-a-judgeをあわせたデータクリーニングを実施した。また、数学データとコードデータを用いた推論能力の向上において極めて重要であることが再確認され、あらゆるタスクに好影響を与える。ただし広範囲な領域の推論には以前として多様なデータが必要となる。また、より難しく、複雑な問題が学習には有効であり、問題のpass rate（解けた割合）やスコアに応じて問題を拡張していった。エージェント機能はベンチマーク向け、および実際のタスク向けの両方に強化させていることを協調している。特に後者の実タスクにおいては交互思考（Interleved Thinking）、つまりツール利用とその結果に応じた思考を交互に繰り返していくことが重要であることが再確認された。最初に思考するだけでは、長期的な実行を確実に実行することは困難であるためだ。この場合、LLMは外部ツールからの応答内容による干渉に対して安定している必要がある。さらに、この交互使用の学習においてはツールのスケーリング（ツールの種類が増やしていく）だけでなく、システムプロンプト、ユーザープロンプト、環境情報（ツールセット、コードファイル）についてもスケーリングしていく必要があり、これらも増やすことで初めて高いレベルの汎化を達成できる

Shion Honda · Nov 3, 2025 · 7:48 PM UTC

Shion Honda @shion_honda

Nov 3

Benchmarking AI Agents: Stop Trusting Headline Scores, Start Measuring Trade-offs

Don’t chase leaderboards. Run the benchmark yourself and map your score–latency–cost frontier to choose models for production.

medium.com

Alan engineering · Nov 3, 2025 · 7:28 PM UTC

Shion Honda retweeted

Alan engineering @alanengineering

Nov 3

Don't chase leaderboards. There's always a score-latency tradeoff, and LLM providers often hide it. We reproduced Tau-bench to see for ourselves. Here's what we learned: medium.com/alan/benchmarking…

Benchmarking AI Agents: Stop Trusting Headline Scores, Start Measuring Trade-offs

Don’t chase leaderboards. Run the benchmark yourself and map your score–latency–cost frontier to choose models for production.

medium.com

株式会社リクルートデータ推進室 · Oct 31, 2025 · 7:30 AM UTC

Shion Honda retweeted

株式会社リクルートデータ推進室

@Recruit_Data

Oct 31

【ブログ記事公開のお知らせ】【解法紹介】RecSys Challenge 2025 で優勝しました blog.recruit.co.jp/data/arti… RecSys Challenge 2025で取り組んだ課題と得られた学びについて紹介しています💡 ぜひご覧ください！ #RecSysChallenge

【解法紹介】RecSys Challenge 2025 で優勝しました

はじめにこんにちは。チェコ料理とチェコビールの大ファンになったリクルートの長妻です。（チェコは国民一人当たりの年間ビール

blog.recruit.co.jp

Shion Honda · Nov 1, 2025 · 8:27 AM UTC

Shion Honda @shion_honda

Nov 1

Elo vs Bradley-Terry: Which is Better for Comparing the Performance of LLMs? | Hippocampus's Garden hippocampus-garden.com/elo_v…

Elo vs Bradley-Terry: Which is Better for Comparing the Performance of LLMs? | Hippocampus's Garden

Chatbot Arena updated its LLM ranking method from Elo to Bradley-Terry. What changed? Let's dig into the differences.

hippocampus-garden.com

Shion Honda · Nov 1, 2025 · 8:27 AM UTC

Shion Honda @shion_honda

Nov 1

ブログ記事がPLOS Oneの論文に引用されていました😁 journals.plos.org/plosone/ar…

Shion Honda · Oct 30, 2025 · 10:18 PM UTC

Shion Honda @shion_honda

Oct 30

既存APIの参照先をSnowflakeのスナップショットに切り替えるというアプローチでバックテスト基盤を構築する方法を説明した記事。 Change Data Captureという概念は知らなかったので勉強になりました。 tech.layerx.co.jp/entry/2025…

pon / Hiromu Nakamura | 技術書典す01 · Oct 29, 2025 · 11:56 PM UTC

Shion Honda retweeted

pon / Hiromu Nakamura | 技術書典す01

@po3rin

Oct 29

新作書きました！！📝📝 AI Agentブログリレーもついに36日目！ AI Agentのビジネス価値を測る為の仕組みづくり、特に他サービスAPI依存に依存してるAgentのバックテストは辛いぞという話です❄️ #LayerX_AI_Agent_ブログリレー tech.layerx.co.jp/entry/2025…

AI Agentのビジネス価値を計るバックテスト基盤の構築 - LayerX エンジニアブログ

こちらはLayerX AI Agentブログリレー36日目の記事です。 LayerX バクラク事業部で AI/MLOpsエンジニアをしている中村(@po3rin)です。今回はAI Agentのビジネス価値を計るバックテスト基盤を構築した話と、そこから学んだAI Agent開発のプラクティスを紹介します。目次目次 A…

tech.layerx.co.jp

Shion Honda · Oct 30, 2025 · 9:28 PM UTC

Shion Honda @shion_honda

Oct 30

pass@kとpass^k が平均や分散だけでなく歪度など分布の形状も反映するということについて書きました。

Shion Honda @shion_honda

Oct 30

Wrote a short piece on pass@k and pass^k. If you enjoy thinking about evaluation quirks and probability, you might like it. Pass@k and Pass^k Tell Different Stories from Mean Success Rate | Hippocampus's Garden hippocampus-garden.com/pass_…

Shion Honda · Oct 30, 2025 · 9:26 PM UTC

Shion Honda @shion_honda

Oct 30

Pass@k and Pass^k Tell Different Stories from Mean Success Rate | Hippocampus's Garden

These metrics capture coverage and reliability.

hippocampus-garden.com

MiniMax (official) · Oct 27, 2025 · 5:04 AM UTC

Shion Honda retweeted

MiniMax (official)

@MiniMax__AI

Oct 27

We’re open-sourcing MiniMax M2 — Agent & Code Native, at 8% Claude Sonnet price, ~2x faster ⚡ Global FREE for a limited time via MiniMax Agent & API - Advanced Coding Capability: Engineered for end-to-end developer workflows. Strong capability on a wide-range of applications (Claude Code, Cursor, Cline, Kilo Code, Droid, etc) - High Agentic Performance: Robust handling of long-horizon toolchains (mcp, shell, browser, retrieval, code). - Smarter, Faster, Cheaper with efficient parameter activation

122

875

179

2,810

catnose · Oct 27, 2025 · 2:18 AM UTC

Shion Honda retweeted

catnose

@catnose99

Oct 27

AIチャットUIを作るときの地味Tips ユーザーの追加メッセージが送信されたとき ①最後の2つのメッセージ（userとassistant）だけ「min-heightがスクリーン高くらい」のボックスの中に入れる ②最下部までスクロールこうするとAIの回答がストリーミングで下に伸びていってもいい感じに表示される

1,371

Shion Honda · Oct 22, 2025 · 8:34 PM UTC

Shion Honda @shion_honda

Oct 22

LoRAは適切な条件下でFTと同等の学習性能を達成し、特にRLで有効であるということを経験的に示した記事。全ての層（特にMLPやMoE層）に適用し、追加したい情報量に対して十分なLoRAパラメータ数を確保することが重要。 LoRA Without Regret - Thinking Machines Lab thinkingmachines.ai/blog/lor…

LoRA Without Regret

How LoRA matches full training performance more broadly than expected.

thinkingmachines.ai

111

OpenAI · Oct 21, 2025 · 5:20 PM UTC

Shion Honda retweeted

OpenAI

@OpenAI

Oct 21

Meet our new browser—ChatGPT Atlas. Available today on macOS: chatgpt.com/atlas

2,439

4,362

4,449

30,598

Shion Honda · Oct 22, 2025 · 7:48 PM UTC

Shion Honda @shion_honda

Oct 22

Vibe lifing 🏄

Shion Honda · Oct 21, 2025 · 6:11 AM UTC

Shion Honda @shion_honda

Oct 21

長文の入力に対してLLMの性能がどう変わるかを調べた記事。長文処理性能はモデルのコンテキスト長からも NIAHのような簡単なベンチマークからもわからない。人為的に挿入するneedleはhaystackと同じ分布から抽出し、タスクは文全体を読まないと解けないように設計すべき。 nrehiew.github.io/blog/long_…