Safe and Secure Generative AI
Publications
Prompt injection attacks and defenses
- Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong. "DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks". In IEEE Symposium on Security and Privacy, 2025.
Distinguished Paper Award - Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. "Optimization-based Prompt Injection Attack to LLM-as-a-Judge". In ACM Conference on Computer and Communications Security (CCS), 2024.
- Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. "PLeak: Prompt Leaking Attacks against Large Language Model Applications". In ACM Conference on Computer and Communications Security (CCS), 2024.
- Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. "Formalizing and Benchmarking Prompt Injection Attacks and Defenses". In USENIX Security Symposium, 2024.
Detecting and attributing AI-generated content
- Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, and Neil Gong. "A Transfer Attack to Image Watermarks". In International Conference on Learning Representations (ICLR), 2025.
- Hongbin Liu, Moyang Guo, Zhengyuan Jiang, Lun Wang, and Neil Zhenqiang Gong. "AudioMarkBench: Benchmarking Robustness of Audio Watermarking". In NeurIPS Datasets and Benchmarks, 2024.
- Zhengyuan Jiang, Moyang Guo, Yuepeng Hu, Jinyuan Jia, and Neil Zhenqiang Gong. "Certifiably Robust Image Watermark". In European Conference on Computer Vision (ECCV), 2024.
- Zhengyuan Jiang, Moyang Guo, Yuepeng Hu, and Neil Zhenqiang Gong. "Watermark-based Attribution of AI-Generated Content". In arXiv, 2024.
- Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, and Neil Gong. "Stable Signature is Unstable: Removing Image Watermark from Diffusion Models". In arXiv, 2024.
- Zhengyuan Jiang, Jinghuai Zhang, and Neil Zhenqiang Gong. "Evading Watermark based Detection of AI-Generated Content". In ACM Conference on Computer and Communications Security (CCS), 2023.
Preventing harmful content generation and jailbreaking
- Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Zhenqiang Gong. "GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis". In Annual Meeting of the Association for Computational Linguistics (ACL), 2024.
- Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. "SneakyPrompt: Jailbreaking Text-to-image Generative Models". In IEEE Symposium on Security and Privacy (IEEE S&P), 2024.
Hallucinations
- Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong. "Visual Hallucinations of Multi-modal Large Language Models". In Findings of the Association for Computational Linguistics (ACL Findings), 2024.
Robustness to common perturbations
- Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Neil Zhenqiang Gong, Yue Zhang, and Xing Xie. "PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts". In arxiv, 2023.
Poisoning and backdoor attacks to embedding foundation models (e.g., CLIP) and defenses
These foundation models are foundations of many generative AI systems. For instance, CLIP text encoder is used in text-to-image models, and vision encoder is used in multi-modal LLM.- Hongbin Liu, Michael K Reiter, and Neil Zhenqiang Gong. "Mudjacking: Patching Backdoor Vulnerabilities in Foundation Models". In USENIX Security Symposium, 2024.
- Jinghuai Zhang, Hongbin Liu, Jinyuan Jia, and Neil Zhenqiang Gong. "Data Poisoning based Backdoor Attacks to Contrastive Learning". In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Hongbin Liu, Jinyuan Jia, and Neil Zhenqiang Gong. "PoisonedEncoder: Poisoning the Unlabeled Pre-training Data in Contrastive Learning". In USENIX Security Symposium, 2022.
- Jinyuan Jia, Yupei Liu, and Neil Zhenqiang Gong. "BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning". In IEEE Symposium on Security and Privacy, 2022.
Intellectual property and data-use auditing in model training
We study intellectual property for both model providers and users. For model providers, we study stealing attacks to foundation models and their defenses. For users, we study data auditing/tracing in pre-training foundation models.- Zonghao Huang, Neil Zhenqiang Gong, and Michael K Reiter. "A General Framework for Data-Use Auditing of ML Models". In ACM Conference on Computer and Communications Security (CCS), 2024.
- Roy Xie, Junlin Wang, Ruomin Huang, Minxing Zhang, Rong Ge, Jian Pei, Neil Zhenqiang Gong, and Bhuwan Dhingra. "ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods". In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
- Yupei Liu, Jinyuan Jia, Hongbin Liu, and Neil Zhenqiang Gong. "StolenEncoder: Stealing Pre-trained Encoders in Self-supervised Learning". In ACM Conference on Computer and Communications Security (CCS), 2022.
- Hongbin Liu*, Jinyuan Jia*, Wenjie Qu, and Neil Zhenqiang Gong. "EncoderMI: Membership Inference against Pre-trained Encoders in Contrastive Learning". In ACM Conference on Computer and Communications Security (CCS), 2021. *Equal contribution
Talks
- Talks on safe and robust generative were given at Google, Microsoft Research Asia, Privacy and Security in ML Seminars, ICLR 2022 Workshop on Socially Responsible Machine Learning (SRML), etc.. The talk given at Google is available on YouTube [here]
Code and Data
- [Code and data] for prompt injection attacks and defenses.
- [Code and data] for PromptBench.
- [Code and data] for watermark-based detection of AI-generated content.
- [Code and data] for BadEncoder.
Slides
- Safe and Robust Generative AI [Slides].
- Secure Foundation Models [Slides].
- Robustness of Watermark-based Detection of AI-generated Content [Slides].