Reading — Davis Treybig

Currently reading

A running list of papers, blogs, and technical writing I've found interesting across infrastructure, AI, and systems.

01When AI Writes the World's Software, Who Verifies It?Great overview in the recent progress in using AI coding models to autoformalize software systems in Lean↗02Quantization for On-Device ModelsReally impressive results on quantizing models to run on Apple devices. Mirai pairs a custom quantization stack with a specialized inference engine, using post-training quantization (a modified YAQA algorithm) followed by quantization-aware distillation, plus Random Hadamard Transforms to suppress outliers. The standout result is that their 4-bit models deliver 40-60% more tokens per second at the same quality as tools like llama.cpp, and their 8-bit versions run faster than llama.cpp's 4-bit models while staying nearly identical to full precision. The key insight is hardware-aware design - optimizing for GPU efficiency rather than pure compression ratios.↗03Learning to Discover at Test TimePaper exploring methods for doing reinforcement learning at test-time for the purposes of discovery. It makes the interesting point that discovery problems look very different from standard learning objectives, because you care more about finding a single state-of-the-art method than the average performance of the LLM. This lets you alter many fundamental assumptions - such as being OK with updating model weights at test time for a given task, or changing the RL objective to maximize variance rather than optimize for average outcome. They show some really cool results applying this to GPU kernel writing.↗04Pre-training Isn't Bitter EnoughCMU research paper suggesting that the fact that we still hand-craft the learning tasks for language models is to an extent anti-bitter-lesson. While we have generalized learning algorithms, we don't have generalized task-optimization algorithms, and instead we basically hand-tune self-supervised learning objectives when training LLMs. The paper explores an alternative where you instead have a system that co-learns both the learning objective and the model parameters, under the intuition that a learning objective delta that drives a better gradient in the model learning is likely to be a better objective.↗05SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments?Interesting paper that benchmarks how well LLMs can forecast the results of real scientific experiments across natural science domains.↗06Bayesian Forecasting with LLMsInteresting paper on methodologies to make LLMs better forecasters, treating the forecasting state as a buildup of bayesian style probabilities with evidence↗07CL-Bench 1.0New benchmark for continual learning, evaluating how well models retain and build on prior knowledge without catastrophic forgetting.↗08ProgramBenchFacebook Research benchmark for evaluating program synthesis and code understanding capabilities of LLMs.↗09Long-Running ClaudeCool blog by Anthropic about using a long running agent to implement a differentiable Boltzmann solver in JAX. Good example of how to structure a long running agent program in terms of instructions, tests, etc. I think right now this only works in these very, very verifiable domains - in this case there was a reference implementation to compare against.↗10Composer 2Overview of how Cursor trained Composer 2. Some particularly interesting discussions of their training mix (Kimi 2 base, continued pre-training, SFR, then RL), how they do RL (e.g. sequence parallelism, updating model weights mid rollout), and their environments infrastructure. I also like some of the discussion on building evals/benchmarks that are more representative of user behavior - e.g. focusing on under specified queries.↗11Uni 1New model from Luma that is a single, decoder only, autoregressive transformer that interleaves images and text on both input and output. As a result, it appears to have much stronger visual reasoning capabilities, and can also support much more complex input conditioning with long prompts & other image inputs. The controllability is particularly good.↗12After WIMPGood blog playing out how the fundamental assumptions on how web services and users interact is changing. Traditionally, the GUI/UI was the fixed exchange protocol between a user and a service. The future probably looks different - maybe some kind of UI/GUI scaffolding or principles, coupled with some kind of user-attached personal software libraries that outline preferences of that user, and dynamically compile into an application.↗13Challenges and Research Directions for Large Language Model Inference HardwareAmazing overview of how LLM inference merits a structural rethinking of chip & datacenter design. The insane growth of inference vs. training, coupled with novel architectures like MoE & long context, mean that the decode step of LLM inference is increasingly memory bound and latency constrained. They propose four major areas of research to address this - more of a focus on flash based memory vs. just DRAM/HBM, processing near memory (e.g. small compute units located closer to memory) for higher bandwidth, 3D memory stacking for higher bandwidth, and lower latency interconnect approaches.↗14UniFusion: Vision-Language Model as Unified Encoder in Image GenerationPaper from Adobe exploring using a VLM as the universal encoding layer for generative image models, replacing the more common separation of a vision + text encoder. Approach seems to result in models that generalize better and have better transfer in training, though it requires some nuance in how you apply VLM.↗15Interesting Directions in VisionGood overview of some recent trends in vision-language-action models for robotics. Main themes include: integrating tactile data, incorporating 3D reasoning and 3D priors, applying RL on top of base VLM/VLA, and unifying world models with VLAs↗16Agent Design Patterns OverviewIncredible overview of recent design patterns in agents. Resounding theme is adopting computing primitives as the basis set of tools (e.g. bash, file system, code) & offloading all context management to the computer↗17Recursive Language ModelsA proposed system architecture for models/agents, designed around recusively calling LLMs in a REPL environment where the context is represented as a variable in memory that is not shown in any way to the LLM unless it specifically asks for it using various tools (grep, peeking, etc). You essentially are asking the LLM to figure out how to probe & discern the context and identify how to manipulate it with a sequence of recursive sub-calls. In a way this is similar to how systems like Claude Code work, but they are coming more from the model design side vs. a task-specific solution. Amazing paper.↗18Towards a Science of Scaling Agent SystemsInteresting overview of how multi-agent vs. single-agent system design variations impact task quality. High variance of whether it improves or degrades in what situations↗19Automated Self-TestingGreat blog by Replit on agent design for automated testing of code gen agents. Particularly interesting was the design strategy of simply their computer use agent simply write playwright code rather than have specialized tools like select element, etc.↗20Sandbox RuntimeA lightweight agent sandboxing tool from Anthropic, based around filesystem and network restrictions using OS utilities vs. using full fledged container or microVM.↗21Everything is Context: Agentic File System Abstraction for Context EngineeringPaper exploring how the file system access can be an effective tool for context management. Similar to the Manus architecture.↗22Inside ThunderKittens' Python BindingsInteresting overview of part of the ThunderKittens, a framework for writing more performant GPU kernels out of Stanford↗23Radiance Fields and Future of Generative MediaGreat overview of the state of neural radiance fields and the role that 3D models will play in generative media.↗24Cheap RL Tasks Will Waste ComputeInteresting argument on why the world is shifting to extremely high-end, high cost, specific RL data away.↗25Principles of Diffusion ModelsHolistic overview of the principles of diffusion models.↗26Scaling Reasoning in Diffusion Large Language Models via Reinforcement LearningFirst example of applying RL to improve reasoning in diffusion models.↗27LLMs for Scheduling Policies in Distributed SystemsCool example of using an LLM + simulator to optimize a database scheduler. Generator + verifier pattern.↗28Barbarians at the GateInteresting overview of ideas for applying ML to systems research.↗29Supporting Our AI Overlords: Redesigning Data Systems to be Agent-FirstAgent-first data system design from Berkeley.↗30Improving Cursor Tab with Online RLOverview of how Cursor does online RL to improve their Cursor Tab model.↗31Denny Zhou – Reasoning SlidesGood slides on reasoning models and techniques.↗32SkyRL v0.1 — NovaSkyModular RL framework with separate trainer, environment, generator, and reward layers.↗33The Second Half of Machine LearningGood blog on how ML is moving from methods to environments and RL.↗34WeaverCombining multiple weakly supervised verifiers into a strong ensemble verifier.↗35BlockDiff Incremental VM SnapshotsCognition's OSS code sandbox designed for snapshotting.↗36TITANSAlternative autoregressive architecture with dynamic memory blocks for long context.↗37TAO: Test-Time Compute to Train Efficient LLMsCool example of using reasoning models to autonomously fine tune LLMs.↗38Scaling RL ComputeDiscussion of how to scale RL compute and its bottlenecks from General Reasoning.↗39Inductive Moment MatchingDiffusion-like method allowing discrete jumps in sampling and better use of pretrained networks.↗40Trellis3DUnified latent representation for generative 3D objects decoding to radiance fields, Gaussians, and meshes.↗41SIMATrains agents to act in diverse 3D game worlds from language inputs; notably it generalizes well across games where it has no game specific training.↗42Autoregressive Image Generation via Progressive UpsamplingTreats image generation as autoregressive refinement to higher resolutions, competitive with diffusion.↗43Flow MatchingGeneral method for training generative models via matching probability flows instead of noise corruption.↗44ChameleonMixed-modality model encoding text and images in a single token space trained end-to-end.↗45Training Verifiers for Math ProblemsEarly OpenAI paper on training verifiers for mathematical reasoning.↗46MuZeroModel-free RL algorithm mastering games without explicit rule knowledge.↗47Beam SearchUsing beam search as a test-time reasoning strategy.↗48Tree of ThoughtsReasoning strategy that generates candidate thoughts and explores them via tree search heuristics.↗49LLM ReasonersUnifies reasoning as reward technique, world model, and search algorithm; shows search and RAP outperform basic CoT.↗50Reasoning with Language Model is Planning with World ModelCombines a generator and a world-model LLM with MCTS for iterative reasoning.↗51Large Language MonkeysDemonstrates large gains from sampling many outputs and selecting via a verifier.↗52Beyond A*Trains transformers on A*-generated traces to internalize search-like problem solving.↗53Stream of SearchShows that post-training on search-style reasoning traces greatly improves CoT performance.↗54DualFormerTrains on full and partially masked reasoning traces to enable fast vs slow reasoning modes.↗55Training LLMs to Reason in a Continuous Latent SpacePerforms reasoning in latent space instead of over discrete token sequences.↗56Byte Latent TransformersTokenizer-free transformer operating on bytes with entropy-based dynamic patching.↗57The State of Generative Models - 2024 ReviewOverview of late-2024 trends in multimodality, reasoning, tokenization removal, and agents.↗58Model Context ProtocolAnthropic's framework for standardizing tools for LLMs.↗59FastMCPCool framework from Prefect that simplifies building production MCP servers↗60BrushWebGPU and Burn-based engine for training and rendering Gaussian splats.↗61UnboundedPrototype "infinite" game powered by distilled LLMs and diffusion models.↗62lolcatsConverts transformers into linear/state-space style models via attention replacement and LoRA.↗63SQLite in Durable ObjectsSynchronous embedded SQLite inside Cloudflare Durable Objects for session backends.↗64Differential TransformerInteresting idea of modulating attention to a relative score, instead of absolute, in theory reducing attentinon towards irrelevant context.↗65Networks of NetworksCool paper demonstrating how simple compound AI system designs (e.g. judge/verifier, best of K voting) can produce huge performance deltas.↗66MaestroNetflix's JSON-based workflow orchestrator supporting DAG and cyclic graphs.↗67Resource Management for Aurora ServerlessSome interesting discussion on resource and memory management in Aurora Serverless.↗68OpenHouseLinkedIn's open-source control plane/catalog for lakehouse architectures.↗69Exploiting Cloud Object Storage for High-Performance AnalyticsPaper exploring opitmal system design for querying cloud object stores efficiently.↗70The Architecture of Serverless Data SystemsAmazing six-part series on serverless data system design (Aurora, Dynamo, Neon, Kora, etc.).↗71Apple Intelligence OverviewOverview of Apple Intelligence system design & architecture.↗72Accelerating Code Migrations with AIGoogle deep dive on techniques for AI-assisted codebase migration.↗73Efficient finetuning of Llama 3 with FSDP QDoRAAnswer.ai blog an their "continued pre-training" method, QDoRA↗74Hybrid ML + Numerical Weather ModelCombines ML with traditional numerical weather prediction for long-range forecasts and uncertainty bounds.↗75Privacy in Public + Private RetrievalExploration of how to handle retrieval in AI systems assuming mix of private + public data to retrieve over.↗76Dynamic Partitioning for VisualizationTechniques for dynamic partitioning in data visualization.↗77Draco 2Renderer agnostic data visualization format designed to allow for flexible encoding of visualization rules.↗78DynaVisCool idea to dynamically synthesize data visualization editor widgets based on the data visualization task.↗79Formalizing Visualization Design Knowledge as Constraints: Actionable and Extensible Models in DracoFoundational paper on the Draco constraint-based visualization system.↗80SWE-AgentBenchmark for coding agents.↗81In Defense of Dual-Encoders for RerankingExplains why dual encoders underperform cross encoders and how to fix them.↗82Scaling MonosemanticitySeminal Anthropic blog about how to identify semantic features and their associated neurons in deep neural networks.↗83GPUs Go BrrrDiscusses GPU kernel optimization and hardware-aware AI system design.↗84GenieSuper interesting idea of inferring not just videos but action controllable worlds from input images. You train self-supervised on existing video data, and you learn both how to predict the next frame and the action that would have connected those two frames.↗85Foundation Models for Reasoning on ChartsCool applications of foundation models for reasoning about data visualization charts.↗86MeerkatExploration of a dataframe library that natively supports unstructured data types.↗87Reka Core / Edge Tech ReportTechnical report on Reka Core and Edge models.↗88GorillaX Exec EngineInteresting tool-use runtime that natively supports various ideas like undo.↗89RAFTFine tuning strategy to optimize for domain specific RAG workflows.↗90FASTERHigh-performance key-value store with an elegant log-structured design.↗91GarnetMicrosoft's high-performance KV store related to FASTER.↗92LIDASystem for AI-assisted data storytelling and visualization.↗93Mechanistic Design & Scaling of Hybrid ArchitecturesInteresting paper highlighting ways to mechanistically test models in small scale, specific tasks in ways that predict scaling properties, allowing for many architectural approaches to be tested rapidly↗94LumiereVideo generative model with improved temporal–spatial consistency.↗95Large Sequence Models for Software EngineeringCode models trained on the software engineering process (reviews, debugging, etc.).↗96Scaling Data-Constrained Language ModelsDiscusses strategies for scaling models when high-quality data is limited.↗