Long-Video Understanding · Multimodal Agents
Multimodal agent researchfor long-form video understanding
I am Meng Qianke, a master's student at Hangzhou Dianzi University. My work centers on memory organization, video condensation, multi-step reasoning, and evaluation harnesses for long-form video understanding, aiming to make multimodal agents more reliable over long temporal contexts.
Current focus: VideoARM, progressive video condensation, and long-video experiment loops
Research Focus
Long-Video Understanding, Multimodal Agents, and 3D Visual Grounding
My research focuses on multimodal large models and agent systems, especially long-form video understanding, video QA, hierarchical memory, MLLM-agent reasoning, and 2D-3D visual grounding. Recent outputs include CVPR 2026 and ICME 2026 papers on long-form video understanding and an arXiv preprint on 3D visual grounding.
Long-Video Understanding
Study video QA, event condensation, temporal memory, and multi-step reasoning for understanding long-duration video content.
MLLM Agent Systems
Build multimodal agents with tool use, memory management, planning, and environment interaction for complex vision-language tasks and research workflows.
3D Visual Grounding
Investigate 2D-3D mappings, zero-shot 3D visual grounding, and cross-view consistency for open-world spatial semantic understanding.
Research Output
- Paper2026
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding (CVPR 2026)
A hierarchical-memory and agentic-reasoning framework for long-form video understanding, accepted to CVPR 2026. The work focuses on information compression, memory organization, and multi-step reasoning for long-video QA.
- Paper2026
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding (ICME 2026)
A study on progressive video condensation and MLLM-agent reasoning for long-form video understanding, accepted to ICME 2026.
- Preprint2026
Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding
A robust zero-shot 3D visual grounding method based on multiple consistent 2D-3D mappings. Qianke Meng is the third author; the paper is available on arXiv.
- Contest2025
Aesthetic Feature Modeling of Classical Jiangnan Gardens (National First Prize, China Graduate Mathematical Modeling Contest)
A mathematical modeling study of aesthetic features and spatial layout patterns in classical Jiangnan gardens, awarded National First Prize in the China Graduate Mathematical Modeling Contest.
Featured Projects
VideoARM
A long-form video understanding research project on hierarchical memory, agentic reasoning, and long-video QA, corresponding to the CVPR 2026 paper.
LongVideo Exploration
An active exploration line for long-video understanding, focusing on coarse event modeling, visual human evaluation, video memory, and multi-agent experiment loops.
VideoARM-MCP
An MCP service-wrapper exploration for connecting VideoARM-style long-video understanding capabilities to general agent workflows.
DingTalk GPU Monitor
A lightweight NVIDIA GPU utilization and memory monitor without admin privileges, with DingTalk alert notifications.
Experience
- ●Sep 2024 - Present

Master's Student · Hangzhou Dianzi University
- • Computer Technology
- • Media Intelligence Lab (MIL)
- • Research interests: multimodal large models, agent systems, long-video understanding, and video QA
- ●Sep 2020 - Jun 2024

Undergraduate Student · Henan University
- • Computer Science and Technology
- • Bachelor of Engineering
Get in touch
Let's connect around multimodal research, collaborations, or personal projects.
Direct Contact
Scan to Connect

