QK

Long-Video Understanding · Multimodal Agents

Multimodal agent researchfor long-form video understanding

I am Meng Qianke, a master's student at Hangzhou Dianzi University. My work centers on memory organization, video condensation, multi-step reasoning, and evaluation harnesses for long-form video understanding, aiming to make multimodal agents more reliable over long temporal contexts.

Current focus: VideoARM, progressive video condensation, and long-video experiment loops

Meng Qianke
VideoARM
Hierarchical memory reasoning
Video Condensation
Progressive video condensation
Harness
Long-video experiment tooling

Research Focus

Long-Video Understanding, Multimodal Agents, and 3D Visual Grounding

My research focuses on multimodal large models and agent systems, especially long-form video understanding, video QA, hierarchical memory, MLLM-agent reasoning, and 2D-3D visual grounding. Recent outputs include CVPR 2026 and ICME 2026 papers on long-form video understanding and an arXiv preprint on 3D visual grounding.

Long-Video Understanding

Study video QA, event condensation, temporal memory, and multi-step reasoning for understanding long-duration video content.

MLLM Agent Systems

Build multimodal agents with tool use, memory management, planning, and environment interaction for complex vision-language tasks and research workflows.

3D Visual Grounding

Investigate 2D-3D mappings, zero-shot 3D visual grounding, and cross-view consistency for open-world spatial semantic understanding.

Research Output

View research →
  • Paper2026

    VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding (CVPR 2026)

    A hierarchical-memory and agentic-reasoning framework for long-form video understanding, accepted to CVPR 2026. The work focuses on information compression, memory organization, and multi-step reasoning for long-video QA.

    Long-Form Video UnderstandingAgentic ReasoningHierarchical MemoryVideo QA
  • Paper2026

    Progressive Video Condensation with MLLM Agent for Long-form Video Understanding (ICME 2026)

    A study on progressive video condensation and MLLM-agent reasoning for long-form video understanding, accepted to ICME 2026.

    Long-Form Video UnderstandingMLLM AgentVideo CondensationMultimodal
  • Preprint2026

    Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding

    A robust zero-shot 3D visual grounding method based on multiple consistent 2D-3D mappings. Qianke Meng is the third author; the paper is available on arXiv.

    3D Visual Grounding2D-3D MappingZero-ShotMultimodal
  • Contest2025

    Aesthetic Feature Modeling of Classical Jiangnan Gardens (National First Prize, China Graduate Mathematical Modeling Contest)

    A mathematical modeling study of aesthetic features and spatial layout patterns in classical Jiangnan gardens, awarded National First Prize in the China Graduate Mathematical Modeling Contest.

    Mathematical ModelingAesthetic AnalysisSpatial Modeling

Featured Projects

View all

Experience

  1. Hangzhou Dianzi University logo

    Master's Student · Hangzhou Dianzi University

    • Computer Technology
    • Media Intelligence Lab (MIL)
    • Research interests: multimodal large models, agent systems, long-video understanding, and video QA
    Sep 2024 - Present
  2. Henan University logo

    Undergraduate Student · Henan University

    • Computer Science and Technology
    • Bachelor of Engineering
    Sep 2020 - Jun 2024

Get in touch

Let's connect around multimodal research, collaborations, or personal projects.

Scan to Connect

WeChat QR Code
WeChat
Scan to add
Xiaohongshu QR Code
Xiaohongshu
Scan to add