Тег: #llm — Hacker News Digest

The Timmy Trap (jenson.org)

Ловушка Тимми
Вторая часть цикла о LLM

LLM выглядят умными, потому что пишут гладко. Эта «гладкость» отключает наш скепсис, и мы начинаем человечить машину.

Тест Тьюринга сегодня
Классический тест сравнивал двух собеседников: человека и ИИ. Современная версия сведена к диалогу «человек ↔ LLM». Мы перестали сравнивать и просто судим, а судья у нас настроен на поиск человечности (антропоморфизм). Поэтому даже ELIZA 1960-х, работавшая на if-else, обыгрывала ChatGPT-3.5. Проигрываем не машины, а мы сами.

Трюк с Тимми
На выступлениях я достаю карандаш с глазками и именем Тимми. За 15 секунд зал здоровается, узнаёт его мечту стать UX-дизайнером… и вздыхает, когда я ломаю Тимми пополам. Если мы привязываемся к карандашу за четверть минуты, час с «умной» системой делает нас совсем уязвимыми. Мы оправдываем ошибки LLM словом «галлюцинация», хотя это не сбой, а отсутствие мышления.

Сокращение ≠ резюме
LLM не «суммируют», а просто укорачивают текст. Настоящее резюме требует внешнего контекста и понимания, чего нет у языковой модели.

by metadat • 15 августа 2025 г. в 14:10 • 112 points

Модель	9:05	Lockout	Dreamhold	Lost Pig
Grok 4	86 %	15 %	46 %	33 %
Claude 4 Sonnet	80 %	30 %	53 %	46 %
Gemini 2.5 Flash	80 %	30 %	33 %	46 %
Gemini 2.5 Pro	80 %	30 %	40 %	40 %
DeepSeek R1	80 %	23 %	33 %	33 %
Claude 4 Opus	73 %	30 %	60 %	46 %
gpt-5 Chat	73 %	15 %	53 %	33 %
DeepSeek V3	66 %	23 %	20 %	33 %
gpt-4o	53 %	23 %	40 %	40 %
Qwen3 Coder	53 %	23 %	40 %	33 %
Kimi K2	53 %	30 %	46 %	40 %
glm 4.5	53 %	23 %	33 %	53 %
Claude 3.5 Haiku	38 %	15 %	26 %	26 %
Llama 3 Maverick	33 %	30 %	40 %	33 %
gpt-o3-mini	20 %	15 %	26 %	26 %
Mistral Small 3	20 %	15 %	0 %	20 %
gpt-4o-mini	13 %	23 %	20 %	40 %

Метод	Время	Точн. токенов	Решено задач
Авторегрессия	1×	94 %	21 %
Диффузия 10 шагов	0.6×	95 %	19 %
Диффузия 30 шагов	1.3×	94 %	21 %

The Timmy Trap (jenson.org)

The new science of “emergent misalignment” (quantamagazine.org)

Gemma 3 270M: Compact model for hyper-efficient AI (developers.googleblog.com) 🔥 Горячее 💬 Длинная дискуссия

Why LLMs can't really build software (zed.dev) 🔥 Горячее 💬 Длинная дискуссия

Evaluating LLMs playing text adventures (entropicthoughts.com)

Why are there so many rationalist cults? (asteriskmag.com) 🔥 Горячее 💬 Длинная дискуссия

Почему это происходит?

Nexus: An Open-Source AI Router for Governance, Control and Observability (nexusrouter.com)

Что делает

Плюсы

Дальше

Training language models to be warm and empathetic makes them less reliable (arxiv.org) 🔥 Горячее 💬 Длинная дискуссия

What's the strongest AI model you can train on a laptop in five minutes? (seangoedecke.com) 🔥 Горячее 💬 Длинная дискуссия

Ограничение времени

Скорость

Датасет

Qodo CLI agent scores 71.2% on SWE-bench Verified (qodo.ai)

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens (arstechnica.com)

Japan's largest paper, Yomiuri Shimbun, sues Perplexity for copyright violations (niemanlab.org)

I've seen 12 people hospitalized after losing touch with reality because of AI (twitter.com)

Token growth indicates future AI spend per dev (blog.kilocode.ai)

GitHub is no longer independent at Microsoft after CEO resignation (theverge.com) 🔥 Горячее 💬 Длинная дискуссия

Auf Wiedersehen, GitHub (github.blog)

Claude Code is all you need (dwyer.co.za) 🔥 Горячее 💬 Длинная дискуссия

Vibe-кодим CRUD за один промпт

SPEC.md (сокращённо)

Pricing Pages – A Curated Gallery of Pricing Page Designs (pricingpages.design)

GPT-OSS-120B runs on just 8GB VRAM & 64GB+ system RAM (old.reddit.com)

Show HN: Engineering.fyi – Search across tech engineering blogs in one place (engineering.fyi) 🔥 Горячее

MCP: An (Accidentally) Universal Plugin System (worksonmymachine.ai)

LLMs aren't world models (yosefk.com) 🔥 Горячее 💬 Длинная дискуссия

GPT-5: Overdue, overhyped and underwhelming. And that's not the worst of it (garymarcus.substack.com)

GPTs and Feeling Left Behind (whynothugo.nl)

PCIe 8.0 announced by the PCI-Sig will double throughput again (servethehome.com) 💬 Длинная дискуссия

Curious about the training data of OpenAI's new GPT-OSS models? I was too (twitter.com)

Ch.at – A lightweight LLM chat service accessible through HTTP, SSH, DNS and API (ch.at)

Knuth on ChatGPT (2023) (cs.stanford.edu)

The current state of LLM-driven development (blog.tolki.dev) 💬 Длинная дискуссия

An AI-first program synthesis framework built around a new programming language (queue.acm.org)

My Lethal Trifecta talk at the Bay Area AI Security Meetup (simonwillison.net) 🔥 Горячее

The dead need right to delete their data so they can't be AI-ified, lawyer says (theregister.com)

What the Windsurf sale means for the AI coding ecosystem (ethanding.substack.com)

Let's properly analyze an AI article for once (nibblestew.blogspot.com)

Our European search index goes live (blog.ecosia.org)

The Framework Desktop is a beast (world.hey.com) 🔥 Горячее 💬 Длинная дискуссия

Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally? 🔥 Горячее 💬 Длинная дискуссия

Efrit: A native elisp coding agent running in Emacs (github.com)

The surprise deprecation of GPT-4o for ChatGPT consumers (simonwillison.net) 🔥 Горячее 💬 Длинная дискуссия

GPT-5 vs. Sonnet: Complex Agentic Coding (elite-ai-assisted-coding.dev)

GPT-5

Claude 4 Sonnet

Вывод

AI must RTFM: Why tech writers are becoming context curators (passo.uno)

AI is impressive because we've failed at personal computing (rakhim.exotext.com) 💬 Длинная дискуссия

Google's Genie is more impressive than GPT5 (theahura.substack.com)

Astronomy Photographer of the Year 2025 shortlist (rmg.co.uk)

Getting good results from Claude Code (dzombak.com) 🔥 Горячее 💬 Длинная дискуссия

How attention sinks keep language models stable (hanlab.mit.edu)

GPT-5 leaked system prompt? (gist.github.com) 💬 Длинная дискуссия

Инструмент bio (память)

GPT-5: "How many times does the letter b appear in blueberry?" (bsky.app) 🔥 Горячее 💬 Длинная дискуссия

Achieving 10,000x training data reduction with high-fidelity labels (research.google)

Cursor CLI (cursor.com) 🔥 Горячее 💬 Длинная дискуссия

GPT-5: Key characteristics, pricing and system card (simonwillison.net) 🔥 Горячее 💬 Длинная дискуссия

GPT-5 for Developers (openai.com) 🔥 Горячее 💬 Длинная дискуссия

GPT-5 (openai.com) 🔥 Горячее 💬 Длинная дискуссия

Live: GPT-5 (youtube.com)

Let's stop pretending that managers and executives care about productivity (baldurbjarnason.com)

An LLM does not need to understand MCP (hackteam.io)

AI Ethics is being narrowed on purpose, like privacy was (nimishg.substack.com)

How AI conquered the US economy: A visual FAQ (derekthompson.org) 🔥 Горячее 💬 Длинная дискуссия

Jules, our asynchronous coding agent (blog.google) 🔥 Горячее 💬 Длинная дискуссия

Qwen3-4B-Thinking-2507 (huggingface.co)

Providing ChatGPT to the U.S. federal workforce (openai.com) 💬 Длинная дискуссия

Claude Code IDE integration for Emacs (github.com) 🔥 Горячее 💬 Длинная дискуссия

LLM Inflation (tratt.net)

Teacher AI use is already out of control and it's not ok (reddit.com) 💬 Длинная дискуссия

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model (github.com) 🔥 Горячее 💬 Длинная дискуссия

AI is propping up the US economy (bloodinthemachine.com) 🔥 Горячее 💬 Длинная дискуссия

Open models by OpenAI (openai.com) 🔥 Горячее 💬 Длинная дискуссия

Инструмент `bio` (память)