Bibliographic record
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
- Authors
- Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang
- Publication year
- 2025
- OA status
- oa_green
Print
Need access?
Ask circulation staff for physical copies or request digital delivery via Ask a Librarian.
Digital copy
Unavailable in your region (PD status unclear).
Abstract
While Multimodal Large Language Models (MLLMs) excel at holistic
understanding, they struggle in capturing the dense world with complex scenes,
requiring fine-grained analysis of intricate details and object
inter-relationships. Region-level MLLMs have been a promising step. However,
previous attempts are generally optimized to understand given regions in
isolation, neglecting crucial global contexts. To address this, we introduce
Grasp Any Region (GAR) for comprehen- sive region-level visual understanding.
Empowered by an effective RoI-aligned feature replay technique, GAR supports
(1) precise perception by leveraging necessary global contexts, and (2)
modeling interactions between multiple prompts. Together, it then naturally
achieves (3) advanced compositional reasoning to answer specific free-form
questions about any region, shifting the paradigm from passive description to
active dialogue. Moreover, we construct GAR-Bench, which not only provides a
more accurate evaluation of single-region comprehension, but also, more
importantly, measures interactions and complex reasoning across multiple
regions. Extensive experiments have demonstrated that GAR-1B not only maintains
the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5
on DLC-Bench, but also excels at modeling relationships between multiple
prompts with advanced comprehension capabilities, even surpassing InternVL3-78B
on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms
in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong
capabilities can be easily transferred to videos.
understanding, they struggle in capturing the dense world with complex scenes,
requiring fine-grained analysis of intricate details and object
inter-relationships. Region-level MLLMs have been a promising step. However,
previous attempts are generally optimized to understand given regions in
isolation, neglecting crucial global contexts. To address this, we introduce
Grasp Any Region (GAR) for comprehen- sive region-level visual understanding.
Empowered by an effective RoI-aligned feature replay technique, GAR supports
(1) precise perception by leveraging necessary global contexts, and (2)
modeling interactions between multiple prompts. Together, it then naturally
achieves (3) advanced compositional reasoning to answer specific free-form
questions about any region, shifting the paradigm from passive description to
active dialogue. Moreover, we construct GAR-Bench, which not only provides a
more accurate evaluation of single-region comprehension, but also, more
importantly, measures interactions and complex reasoning across multiple
regions. Extensive experiments have demonstrated that GAR-1B not only maintains
the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5
on DLC-Bench, but also excels at modeling relationships between multiple
prompts with advanced comprehension capabilities, even surpassing InternVL3-78B
on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms
in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong
capabilities can be easily transferred to videos.
Copies & availability
Realtime status across circulation, reserve, and Filipiniana sections.
Self-checkout (no login required)
- Enter your student ID, system ID, or full name directly in the table.
- Provide your identifier so we can match your patron record.
- Choose Self-checkout to send the request; circulation staff are notified instantly.
| Barcode | Location | Material type | Status | Action |
|---|---|---|---|---|
| No holdings recorded. | ||||
Digital files
Preview digitized copies when embargo permits.
-
View digital file
original
APPLICATION/PDF · 6.51 MB
Links & eResources
Access licensed or open resources connected to this record.
- oa Direct