今日从 arXiv 订阅中筛选 10 篇论文。

⚡ Scaling World-Model RL Through Diffusion Policy Optimization

用扩散策略优化统一世界模型中的搜索与价值学习,解决模型偏差和错误累积问题。

Scaling World-Model RL Through Diffusion Policy Optimization

⚡ InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

交错视觉-文本思维链 (VT-CoT) + 自修正草图,提升 VLM 多轮视觉推理深度。

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

⚡ How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

揭示 VLM 在跨视角空间推理中"用语言思考而非真正视觉思考"的局限。

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

⚡ E3C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

可控的第一人称视频生成,结合 3D 点云环境记忆与人体姿态控制。

E3C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

⚡ D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

3DGS 做 VLN 语义建图,开放集语义分组替代稠密特征采样。

⚡ DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation

自适应帧采样密度作为原生 token,配合 SD-GRPO 实现单步多粒度证据获取。

⚡ LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

并行 box 解码替代序列化坐标生成,提升 grounding 吞吐与精度。

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

⚡ Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models

发现 VideoLLM 存在"事件袋"行为:跨段实体关联失败、幻觉交互。

⚡ Sentinel: Embodied Cooperative Spatial Reasoning and Planning

去中心化具身智能体在城市场景中的协同空间推理与重新规划。

Sentinel: Embodied Cooperative Spatial Reasoning and Planning

⚡ Adaptation-Free Heterogeneous Collaborative Perception with Unseen Agent Configurations

零适配异构协同感知,box 级消息转 ego 兼容特征,仅需 120 字节/帧。


自动生成于 2026-05-27 · 基于 arXiv Daily Digest