📝 Publications

Neural Residual Diffusion Models for Deep Scalable Vision Generation
Zhiyuan Ma, Liangliang Zhao, Biqing Qi, Bowen Zhou
- a simple yet meaningful change to the common architecture of deep generative networks by introducing a series of learnable gated residual parameters that conform to the generative dynamics that facilitates effective denoising, dynamical isometry and enables the stable training of extremely deep networks.

Zhiyuan Ma, Guoli Jia, Biqing Qi, Bowen Zhou
- A safe and high-traceable Stable Diffusion framework (namely Safe-SD) to adaptively implant the graphical watermarks (e.g., QR code) into the imperceptible structure-related pixels.

LMD: faster image reconstruction with latent masking diffusion
Zhiyuan Ma, Zhihuan Yu, Jianjun Li, Bowen Zhou
- A simple but faster image reconstruction framework with Latent Masking Diffusion, which stands on the shoulder of DPMs and MAEs.

Zhiyuan Ma, Guoli Jia, Bowen Zhou
- A spatio-temporal guided adaptive editing algorithm, which realizes adaptive image editing by introducing a soft-attention strategy to dynamically vary the guiding degree from the editing conditions to visual pixels from both temporal and spatial perspectives.

Generative multi-modal knowledge retrieval with large language models
Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, et al.
- An end-to-end generative framework for multi-modal knowledge retrieval by taking advantage of the fact within LLMs can effectively serve as virtual knowledge bases, even when trained with limited data.

Zhiyuan Ma, Zhihuan Yu, Jianjun Li, Guohui Li
- A cloze- and verify- style hybrid prompt framework with bridging language models and human priors in prompt tuning for VQA.

Cmal: A novel cross-modal associative learning framework for vision-language pre-training
Zhiyuan Ma, Zhihuan Yu, Jianjun Li, Guohui Li
- A novel cross-modal associative learning model with anchor points detection and cross-modal associative learning for vision-language pre-training.

GLAF: global-to-local aggregation and fission network for semantic level fact verification
Zhiyuan Ma, Zhihuan Yu, Jianjun Li, Guohui Li
- we introduce a fresh perspective to revisit the fact verification task and propose a novel Global-to-Local Aggregation and Fission Network (GLAF) to capture latent logical relations hidden in evidence clues for more accurate fact verification.

Zhiyuan Ma, Jianjun Li, Guohui Li, Yongjing Cheng
- A unified (vision, language, knowledge..) Transformer semantic representation framework with feature alignment and intention reasoning, referred to UniTranSeR, for multimodal task-oriented dialog systems.

Intention reasoning network for multi-domain end-to-end task-oriented dialogue
Zhiyuan Ma, Jianjun Li, Zezheng Zhang, Guohui Li, Yongjing Cheng
- A novel intention mechanism to better model deterministic entity knowledge for joint and multi-hop reasoning in multi-domain end-to-end task-oriented dialogue.
📝 Selected Papers
NeurIPS 2024
Neural Residual Diffusion Models for Deep Scalable Vision Generation. Zhiyuan Ma, Liangliang Zhao, Biqing Qi, Bowen Zhou.NeurIPS 2024
(Spotlight) Ultramedical: Building specialized generalists in biomedicine. Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, et al.NeurIPS 2024
Exploring Adversarial Robustness of Deep State Space Models. Biqing Qi, Yang Luo, Junqi Gao, Pengfei Li, Kai Tian, Zhiyuan Ma, et al.ACM MM 2024
Safe-SD: Safe and Traceable Stable Diffusion with Text Prompt Trigger for Invisible Generative Watermarking. Zhiyuan Ma, Guoli Jia, et al.AAAI 2024
LMD: faster image reconstruction with latent masking diffusion. Zhiyuan Ma, Zhihuan Yu, Jianjun Li, Bowen Zhou.AAAI 2024
AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing. Zhiyuan Ma, Guoli Jia, Bowen Zhou.AAAI 2024
Generative multi-modal knowledge retrieval with large language models. Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, et al.AAAI 2023
(Oral) HybridPrompt: bridging language models and human priors in prompt tuning for visual question answering. Zhiyuan Ma, Zhihuan Yu, Jianjun Li, Guohui Li.ACM MM 2022
(Oral) Cmal: A novel cross-modal associative learning framework for vision-language pre-training. Zhiyuan Ma, Zhihuan Yu, Jianjun Li, Guohui Li.COLING 2022
GLAF: global-to-local aggregation and fission network for semantic level fact verification. Zhiyuan Ma, Zhihuan Yu, Jianjun Li, Guohui Li.ACL 2022
UniTranSeR: A unified transformer semantic representation framework for multimodal task-oriented dialog system. Zhiyuan Ma, Jianjun Li, Guohui Li, Yongjing Cheng.EMNLP 2021
Intention reasoning network for multi-domain end-to-end task-oriented dialogue. Zhiyuan Ma, Jianjun Li, Zezheng Zhang, Guohui Li, Yongjing Cheng.