Detecting and Grounding Multi-Modal Media Manipulation (DGM
4
) is an emerging task that aims to identify and locate manipulated elements in both textual and visual media. Given the complexity of this task, the model requires more sophisticated reasoning capabilities to align multi-modal features and capture forgery traces. To this end, we propose a Concentrated reasoning and Unified reconstruction framework (CrUr) for DGM
4
. Instead of adhering to traditional hierarchical reasoning paradigms, we directly carry out all inference tasks using integrated multi-modal features. Specifically, we ex...