PaperTide - Efficient AI Daily

Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

该论文是一篇关于前馈式3D场景建模的综述，核心贡献在于提出了一种与具体输出表示形式无关、以模型设计策略为中心的新颖分类法，并围绕特征增强、几何感知、模型效率等五个关键问题组织研究方向。

Seedance 2.0: Advancing Video Generation for World Complexity

Seedance 2.0 是一个统一、高效、大规模的多模态音视频联合生成架构，支持文本、图像、音频和视频四种输入模态，在视频和音频生成的关键子维度上实现了全面改进。

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

该论文提出了SpatialEvo框架，通过构建确定性几何环境（DGE）来生成无噪声的交互式监督信号，实现了三维空间推理能力的自我进化，避免了传统自进化方法中伪标签的误差累积问题。

Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents

本文提出了内存迁移学习（MTL），通过构建跨异构编程任务的统一内存池，使代码智能体能够迁移元知识（如验证例程），平均性能提升3.7%，并揭示了抽象程度决定迁移有效性的关键原则。

From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space

本文提出PreRL和Dual Space RL (DSRL)方法，通过在预训练空间直接优化边际分布P(y)来增强大语言模型的推理能力，并利用负样本强化(NSR)机制有效剪枝错误推理空间，最终实现比传统强化学习更优的推理性能。

Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself

本文提出了Free Geometry框架，使前馈式3D重建模型能够在测试时通过轻量级LoRA更新进行快速自校准，无需3D真值，利用多视图一致性提升重建精度。

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

LangFlow通过将连续扩散模型与流匹配结合，并引入新的NLL边界、信息均匀噪声调度和自条件训练，首次使连续扩散语言模型在困惑度和生成质量上媲美甚至超越离散扩散模型，证明了连续扩散在语言建模中的潜力。

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

本文提出了TREX，一个通过多智能体系统和树状搜索来自动化LLM微调全流程的方法，并构建了FT-Bench基准进行评估。

TIP: Token Importance in On-Policy Distillation

本文提出了TIP方法，用于在策略蒸馏中识别关键令牌，通过结合学生模型熵值和师生分歧度来选择最具信息量的令牌进行训练，从而在仅使用少量令牌的情况下达到或超越全令牌训练的效果，显著降低了内存消耗。

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

该论文提出了一个名为MERRIN的多模态证据检索与推理基准，用于评估AI智能体在嘈杂网络环境中进行多跳推理和跨模态检索的能力。其核心贡献在于构建了一个包含未指定模态查询、视频音频等被忽视模态以及复杂冲突证据的挑战性测试集，揭示了当前搜索智能体在跨模态推理和噪声处理上的显著不足。

Geometric Context Transformer for Streaming 3D Reconstruction

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies

ROSE: Retrieval-Oriented Segmentation Enhancement

Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model's knowledge but demand up-to-date external

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

提出KnowRL框架，通过将提示设计转化为最小充分指导问题，利用原子知识点和约束子集搜索来缓解强化学习中的奖励稀疏性，从而显著提升大语言模型的推理能力。

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

本文系统研究了大型语言模型的在线策略蒸馏（OPD）的动态与机制，揭示了其成功或失败的两个关键条件，并提出了两种实用的策略来挽救失败的蒸馏过程。

Toward Autonomous Long-Horizon Engineering for ML Research

该论文提出了AiScientist系统，通过结合分层编排和基于文件的持久状态管理，实现了自主长周期机器学习研究工程，显著提升了在相关基准测试上的性能。

Lyra 2.0: Explorable Generative 3D Worlds

Lyra 2.0 提出了一个用于生成大规模、可探索的3D世界的框架，通过利用每帧3D几何进行信息路由以解决空间遗忘问题，并采用自增强历史训练来纠正时间漂移，从而生成长距离、3D一致的视频轨迹，进而重建出高质量的3D场景。

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

本文介绍了Nemotron 3 Super，一个120B参数的混合Mamba-注意力专家混合模型，其核心贡献在于采用了新型LatentMoE架构以优化FLOP和参数效率，并集成了MTP层以实现原生推测解码，从而显著提升了推理吞吐量。

Towards Long-horizon Agentic Multimodal Search

该论文提出了一种名为LMM-Searcher的长视野多模态深度搜索框架，其核心贡献是通过基于文件的视觉表示机制，将视觉资源卸载到外部文件系统并用轻量级文本标识符映射，从而在长序列任务中管理异构信息并降低上下文开销。该方法还引入了数据合成流程来生成复杂跨模态多跳推理查询，并通过蒸馏高质量轨迹微调模型，在多个基准测试中实现了最先进的性能。

Self-Adversarial One Step Generation via Condition Shifting

提出APEX框架，通过条件偏移从流模型中内生提取对抗性校正信号，实现了无需外部判别器的单步高质量图像生成，在保持架构的同时兼容全参数和LoRA微调，显著提升了训练效率和推理速度。

Generative Refinement Networks for Visual Synthesis

提出生成式精炼网络（GRN），通过分层二值量化（HBQ）解决自回归模型的离散化瓶颈，并引入全局精炼机制和熵引导采样，实现了复杂度感知的自适应步长生成，在图像重建和生成任务上取得新记录。

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

本文提出了Lightning OPD，一种离线策略蒸馏框架，通过预计算教师模型在监督微调轨迹上的对数概率来确保教师一致性，从而在无需实时教师服务器的情况下实现与标准在线策略蒸馏相当的性能，显著提升了大型推理模型后训练的效率。

Accelerating Speculative Decoding with Block Diffusion Draft Trees

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potenti

GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and raster

Parcae: Scaling Laws For Stable Looped Language Models

Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting loo

Learning Versatile Humanoid Manipulation with Touch Dreaming

Humanoid robots promise general-purpose assistance, yet real-world humanoid loco-manipulation remains challenging because it requires whole-body stability, dexterous hands, and contact-aware perception under frequent contact changes. In this work, we study dexterous, contact-rich humanoid loco-manipulation. We first develop an RL-based whole-body controller that provides stable lower-body and torso execution during complex manipulation. Built on this controller, we develop a whole-body humanoid

VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict

SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from two factors: training data dominated by clip-level supervision lacking precise timestamps, and benchmarks that fail to simulate real-world scenarios where short events are obscured by dense background sounds. In this pap

Exploration and Exploitation Errors Are Measurable for Language Model Agents

本文提出了一种在无法访问智能体内部策略的情况下，从观察到的行为中系统区分和量化语言模型智能体探索与利用错误的方法。核心贡献是设计了可控环境和一个策略无关的评估指标，用于衡量智能体在部分可观察环境中的探索与利用能力，并发现当前先进模型在此任务上仍存在困难。

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices a

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generate

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

本文提出了一种名为MEDS的记忆增强动态奖励塑形框架，通过利用历史行为信号来识别和惩罚重复出现的错误模式，从而在强化学习中提高采样多样性和模型性能。

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

提出了OmniShow框架，通过统一通道条件化和门控局部上下文注意力机制，解决了多模态条件下人机交互视频生成中可控性与质量的权衡问题，并设计了分阶段联合训练策略以应对数据稀缺。

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

本文提出了一种信息论探测框架，揭示了统一多模态模型（UMMs）中存在的“伪统一”现象，即视觉与语言编码的信息熵轨迹不同，导致文本生成与图像合成行为割裂。研究表明，真正的多模态协同需要信息流的一致性，而不仅仅是共享参数。

CodeTracer: Towards Traceable Agent States

本文提出了CodeTracer，一种用于代码智能体的可追溯状态架构，通过解析运行工件、重建状态转换历史并进行故障起源定位，以解决智能体调试难题。

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

提出了下一代音频-语言大模型Audio Flamingo Next，通过引入长音频支持（长达30分钟）和新的时序音频思维链推理范式，显著提升了模型在语音、环境声音和音乐理解与推理任务上的性能。

Introspective Diffusion Language Models

该论文提出了内省扩散语言模型（I-DLM），通过一种新颖的内省跨步解码（ISD）算法，在保持扩散模型并行解码优势的同时，继承了自回归训练的内省一致性，从而在模型质量上追平了同规模的自回归模型。从系统角度看，它构建了基于自回归优化的推理引擎，并定制了静态批次调度器，显著提升了高并发服务场景下的吞吐量。

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

本文提出AggAgent，一种用于长视野智能体任务的并行测试时扩展方法，通过一个专门的聚合智能体来高效整合并行生成的多个任务轨迹，在多个基准测试中显著优于现有聚合方法，且额外开销极小。

Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

本文提出了一种名为轨迹诱导偏好优化（TIPO）的新方法，用于解决移动GUI智能体在隐私个性化任务中因执行轨迹结构异质性导致的优化不稳定问题。该方法通过偏好强度加权和填充门控机制，在保持任务执行能力的同时，显著提升了智能体与用户隐私偏好的对齐度和区分度。

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

本文系统性地综述了大型语言模型强化学习中的信用分配问题，提出了一个涵盖推理RL和智能体RL的二维分类法，并贡献了结构化论文清单、报告清单和基准协议等可复用资源。

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

该论文提出了一个名为General365的基准测试，专门用于评估大语言模型在脱离专业知识的通用场景下的推理能力，揭示了当前模型在通用推理方面仍存在显著不足。

Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

While large language models hold promise for complex medical applications, their development is hindered by the scarcity of high-quality reasoning data. To address this issue, existing approaches typically distill chain-of-thought reasoning traces from large proprietary models via supervised fine-tuning, then conduct reinforcement learning (RL). These methods exhibit limited improvement on underrepresented domains like rare diseases while incurring substantial costs from generating complex reaso

SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

Prior representative ReAct-style approaches in autonomous Software Engineering (SWE) typically lack the explicit System-2 reasoning required for deep analysis and handling complex edge cases. While recent reasoning models demonstrate the potential of extended Chain-of-Thought (CoT), applying them to the multi-turn SWE task creates a fundamental dilemma: retaining full reasoning history leads to context explosion and ``Lost-in-the-Middle'' degradation, while discarding it would force the agent to

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter up

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, re

Panoptic Pairwise Distortion Graph

In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information

ADD for Multi-Bit Image Watermarking

As generative models enable rapid creation of high-fidelity images, societal concerns about misinformation and authenticity have intensified. A promising remedy is multi-bit image watermarking, which embeds a multi-bit message into an image so that a verifier can later detect whether the image is generated by someone and further identify the source by decoding the embedded message. Existing approaches often fall short in capacity, resilience to common image distortions, and theoretical justifica

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

This paper localizes the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, but interchange testing (p<0.001) and knockout cascade confirm it is causally necessary. Interchange screen

SHARE: Social-Humanities AI for Research and Education

This intermediate technical report introduces the SHARE family of base models and the MIRROR user interface. The SHARE models are the first causal language models fully pretrained by and for the social sciences and humanities (SSH). Their performance in modelling SSH texts is close to that of general purpose models (Phi-4) which use 100 times more tokens, as shown by our custom SSH Cloze benchmark. The MIRROR user interface is designed for reviewing text inputs from the SSH disciplines while pre

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

提出了一个名为ClawGUI的开源全栈框架，用于GUI智能体的训练、评估和部署，解决了该领域基础设施缺失、评估标准不统一以及模型难以在真实设备上部署的问题。

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

提出了一种多响应判别式奖励模型，能在单次前向传播中同时评估所有候选响应，实现了N倍的推理加速和计算量减少，并在多个多模态奖励基准上取得了最先进的结果。

Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding

We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful sce

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representa

Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule-mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM-Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and

Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions

We address the problem of tactile localization, where the goal is to identify image regions that share the same material properties as a tactile input. Existing visuo-tactile methods rely on global alignment and thus fail to capture the fine-grained local correspondences required for this task. The challenge is amplified by existing datasets, which predominantly contain close-up, low-diversity images. We propose a model that learns local visuo-tactile alignment via dense cross-modal feature inte

3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

Real-time free-viewpoint rendering requires balancing multi-camera redundancy with the latency constraints of interactive applications. We address this challenge by combining lightweight geometry with learning and propose 3DTV, a feedforward network for real-time sparse-view interpolation. A Delaunay-based triplet selection ensures angular coverage for each target view. Building on this, we introduce a pose-aware depth module that estimates a coarse-to-fine depth pyramid, enabling efficient feat

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

该论文提出RationalRewards，一种通过生成结构化、多维度推理（而非单一分数）来改进视觉生成模型的奖励模型。其核心贡献在于：1）训练时提供细粒度、可解释的奖励用于强化学习；2）推理时通过“生成-批判-优化”循环，无需更新模型参数即可提升输出质量，并提出了无需昂贵标注的PARROT训练框架。

Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure

Sema Code 是一个可嵌入、可插拔的AI编程框架，其核心贡献在于将智能体推理引擎与客户端层完全解耦，使其成为一个可编程的共享基础设施，并通过多租户隔离、自适应上下文压缩等机制支持跨平台部署。

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

提出Self-Distillation Zero方法，通过让模型同时扮演生成器和修订者角色，将稀疏的二元奖励转化为密集的令牌级自监督，显著提升了训练样本效率，无需外部教师或高质量演示数据。

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

本文提出了“人为区域适应”新范式及GG-EZ方法，旨在优化多模态视觉语言模型对特定区域（如东南亚）的文化相关性，同时保持其全球泛化能力，在多个模型架构上验证了有效性。

Narrative-Driven Paper-to-Slide Generation via ArcDeck

提出了ArcDeck，一个通过构建语篇树和全局承诺文档来显式建模论文逻辑流程的多智能体框架，将论文生成幻灯片任务转化为结构化叙事重建任务，并引入了新的评估基准ArcBench。

HDR Video Generation via Latent Alignment with Logarithmic Encoding

High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. A natural solution is to learn new representations for HDR, which introduces additional complexity and data requirements. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured