中国人民公安大学学报（自然科学版）

2026, 02, v.32 15-27

深度伪造人脸检测综述

田华伟^1,2 张腾¹ 李根¹ 肖延辉³

1.中国人民公安大学信息网络安全学院 2.中国人民公安大学社会安全风险数智防控实验室 3.中国人民公安大学国家安全学院

基金项目(Foundation): 中国人民公安大学“双一流”建设项目(2026SYL0113)

邮箱(Email):

DOI:

发布时间： 2026-05-20

出版时间： 2026-05-20

网络发布时间： 2026-05-20

移动端阅读

107	0	30
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

近年来，深度伪造技术呈现出快速发展态势，其所带来的风险与挑战颇为严峻。在此背景下，针对深度伪造检测技术的研究需求显得尤为迫切。通过聚焦人脸深度伪造，以对抗的视角，在分析人脸深度伪造技术发展概况的基础上，系统地梳理了深度伪造人脸检测技术的研究进展。首先，从单模态伪造检测出发，对静态图像与动态视频检测技术的演化路径进行论述。然后，聚焦多模态检测策略，从模态解耦、深度融合与模态一致性3个维度，分析归纳当前主流多模态深度伪造检测的技术方案与代表性方法，并探究了其在跨模态对齐与特征融合等方面的研究前景。最后，分析了现有技术面临的主要挑战并阐明了未来的研究方向，为后续人脸深度伪造检测领域的研究奠定了坚实基础。

关键词： 深度伪造; 多模态人脸伪造检测; 跨模态对齐; 扩散模型;

Abstract：

In recent years, deepfake technology has developed rapidly, posing severe risks and challenges. There is an urgent need for research on deepfake detection techniques in this context. By focusing on deepfake face and from an adversarial perspective, the research progress of deepfake face detection technologies is systematically reviewed, based on an analysis of the development landscape of deepfake face technologies. First, starting from single-modal forgery detection, the evolutionary pathways of static image and dynamic video detection technologies are discussed. Then, multi-modal detection strategies are concentrated on, and mainstream technical frameworks and representative methods are analyzed from three perspectives: modality decoupling, deep feature fusion, and modality consistency. The research prospects in cross-modal alignment and feature fusion are further discussed. Finally, the major challenges faced by existing detection techniques are examined and potential directions for future research are outlined, providing a solid foundation for subsequent studies in deepfake detection.

KeyWords： deepfake; multimodal deepfake face detection; cross-modal alignment; diffusion model;

如需获取全文，请访问cnki.net

参考文献

[1] KINGMA D P,WELLING M.Auto-encoding variational bayes[J].ArXiv,2013.

[2] GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[J].Advances in Neural Information Processing Systems,2014,27,2672-2680.

[3] HO J,JAIN A,ABBEEL P.Denoising diffusion probabilistic models[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.

[4] 李旭嵘，纪守领，吴春明，等.深度伪造与检测技术综述[J].软件学报，2021,32(2):496-518.

[5] LIU P,TAO Q,ZHOU J .Evolving from single-modal to multi-modal facial deepfake detection:progress and challenges[J].ArXiv,2024.

[6] CHEN L,MADDOX R K,DUAN Z,et al.Hierarchical cross-modal talking face generation with dynamic pixel wise loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:7832-7841.

[7] WANG Z,ZHANG P,QI J,et al.Omnitalker:Real time text-driven talking head generation with in-context audio-visual style replication[J].ArXiv,2025.

[8] MA Y,WANG S,DING Y,et al.Talkclip:Talking head generation with text-guided expressive speaking styles[J].ArXiv,2023.

[9] DALE K,SUNKAVALLI K,JOHNSON M K,et al.Video face replacement[C]//Proceedings of the 2011 SIGGRAPH Asia Conference,2011:1-10.

[10] LI L,BAO J,YANG H,et al.Faceshifter:Towards high fidelity and occlusion aware face swapping[J].ArXiv,2019.

[11] ROMBACH R,BLATTMANN AL,ORENZ D,et al.High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2022:10684-10695.

[12] THIES J,ZOLLHOFER M,STAMMINGER M,et al.Face2face:Real-time face capture and reenactment of RGB videos[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016:2387-2395.

[13] ZHANG H,DAI T,XU Y,et al.FaceDNeRF:Semantics-driven face reconstruction,prompt editing and re lighting with diffusion models[J].Advances in Neural Information Processing Systems,2023,36:55647-55667.

[14] RADFORD A,METZ L,CHINTALA S.Unsupervised representation learning with deep convolutional generative adversarial networks[J].ArXiv,2015.

[15] TOMASEVIC D,BOUTROS F,LIN C,et al.ID Booth:Identity-consistent face generation with diffusion models[J].ArXiv,2025.

[16] KARRAS T,LAINE S,AILA T.A style-based genera tor architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:4401-4410.

[17] SONG W,YE Z,SUN M,et al.AttriDiffuser:Adversarially enhanced diffusion model for text-to-facial attribute image synthesis[J].Pattern Recognition,2025:111447.

[18] PEROV I,GAO D,CHERVONIY N,et al.DeepFaceLab:Integrated,flexible and extensible face-swap ping framework[J].ArXiv,2020.

[19] ROSSLER A,COZZOLINO D,VERDOLIVA L,et al.Faceforensics + + :Learning to detect manipulated facial images[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2019:1-11.

[20] LI Y,YANG X,SUN P,et al.Celeb-DF:A large-scale challenging dataset for deepfake forensics[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020:3207-3216.

[21] ZI B,CHANG M,CHEN J,et al.Wilddeepfake:A challenging real-world dataset for deepfake detection[C]//Proceedings of the 28th ACM International Conference on Multimedia,2020:2382-2390.

[22] KHALID H,TARIQ S,KIM M,et al.FakeAVCeleb:A novel audio-video multimodal deepfake dataset[J].ArXiv,2021.

[23] CAI Z,STEFANOV K,DHALL A,et al.Do you really mean that?content driven audio-visual deepfake dataset and multimodal method for temporal forgery localization[C]//2022 International Conference on Digital Image Computing:Techniques and Applications (DICTA),2022:1-10.

[24] CAI Z,GHOSH S,ADATIA A P,et al.AV-Deep fake1M:A large-scale LLM-driven audio-visual deep fake dataset[C]//Proceedings of the 32nd ACM International Conference on Multimedia,2024:7414-7423.

[25] LIU J,WANG J,HOU S,et al.Beyond face swap ping:a diffusion-based digital human benchmark for multimodal deepfake detection[J].ArXiv,2025.

[26] YANG X,LI Y,LYU S.Exposing deep fakes using in consistent head poses[C]//ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP),2019:8261-8265.

[27] WOLTER M,BLANKE F,HEESE R,et al.Wavelet packets for deepfake image analysis and detection[J].Machine Learning,2022,111(11):4295-4327.

[28] D'AMELIO A,LANZAROTTI R,PATANIA S,et al.On using rPPG signals forDeepFake detection:A cautionary note[J].Lecture Notes in Computer Science,2023:235-246.

[29] MATERN F,RIESS C,STAMMINGER M.Exploiting visual artifacts to expose deepfakes and face manipulations[C]//2019 IEEE Winter Applications of Computer Vision Workshops (WACVW),2019:83-92.

[30] FRANK J,EISENHOFER T,SCHONHERR L,et al.Leveraging frequency analysis for deep fake image recognition[C]//International Conference on Machine Learning,2020:3247-3258.

[31] PATEL D,ZOUAGHI H,MUDUR S,et al.Visual dubbing pipeline with localized lip-sync and two-pass identity transfer[J].Computers & Graphics,2023,110:19-27.

[32] AFCHAR D,NOZICK V,YAMAGISHI J,et al.Mesonet:a compact facial video forgery detection network[C]//2018 IEEE International Workshop on Information Forensics and Security (WIFS),2018:1-7.

[33] NGUYEN H H,YAMAGISHI J,ECHIZEN I.Use of a capsule network to detect fake images and videos[J].ArXiv,2019.

[34] ZHAO H,ZHOU W,CHEN D,et al.Multi-attentional deepfake detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:2185-2194.

[35] SUN Z,HAN Y,HUA Z,et al.Improving the efficiency and robustness of deepfakes detection through precise geometric features[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:3609-3618.

[36] MONTSERRAT D M,HAO H,YARLAGADDA S K,et al.Deepfakes detection with automatic face weighting[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops,2020:668-669.

[37] GRAVINA M,GALLI A,DE MICCO G,et al.FEADD:Facial expression analysis in deepfake detection[C]//International Conference on Image Analysis and Processing,2023:283-294.

[38] SAHA S,PERERA R,SENEVIRATNE S,et al.Undercover deepfakes:detecting fake segments in videos[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2023:415-425.

[39] YAN Z,ZHAO Y,CHEN S,et al.Generalizing deep fake video detection with plug-and-play:Video-level blending and spatiotemporal adapter tuning[C]//Proceedings of the Computer Vision and Pattern Recognition Conference,2025:12615-12625.

[40] ZHANG Y,LI Q,YU Z,et al.Distilled transformers with locally enhanced global representations for face forgery detection[J].Pattern Recognition,2025,161:111253.

[41] KHALID H,KIM M,TARIQ S,et al.Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors[C]//Proceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deep fake Generation and Detection,2021:7-15.

[42] PARK C,MOON B,JEON M,et al.X3A:Efficient multimodal deepfake detection with score-level fusion[C]//Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing,2025:767-774.

[43] MONGELLI L,MAIANO L,AMERINI I.CMDD:A novel multimodal two-stream CNN deepfakes detector[C]//CEUR Workshop Proceedings,2024,3677:17-30.

[44] ZHOU Y,LIM S N.Joint audio-visual deepfake detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision,2021:14800-14809.

[45] YANG W,ZHOU X,CHEN Z,et al.Avoid-DF:Audio visual joint learning for detecting deepfake[J].IEEE Transactions on Information Forensics and Security,2023,18:2015-2029.

[46] WANG R,YE D,TANG L,et al.AVT2-DWF:Improving deepfake detection with audio-visual fusion and dynamic weighting strategies[J].IEEE Signal Processing Letters,2024.

[47] ILYAS H,JAVED A,MALIK K M.AVFakeNet:A unified end-to-end dense swin transformer deep learning model for audio-visual deepfakes detection[J].Applied Soft Computing,2023,136:110124.

[48] NIE F,NI J,ZHANG J,et al.FRADE:Forgery-aware audio-distilled multimodal learning for deepfake detection[C]//Proceedings of the 32nd ACM International Conference on Multimedia,2024:6297-6306.

[49] OORLOFF T,KOPPISETTI S,BONETTINI N,et al.AVFF:Audio-visual feature fusion for video deepfake detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2024:27102-27112.

[50] FENG C,CHEN Z,OWENS A.Self-supervised video forensics by audio-visual anomaly detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition,2023:10491-10503.

[51] XU J,CHEN J,SONG X,et al.Identity-driven multi-media forgery detection via reference assistance[C]//Proceedings of the 32nd ACM International Conference on Multimedia,2024:3887-3896.

[52] MITTAL T,BHATTACHARYA U,CHANDRA R,et al.Emotions don't lie:An audio-visual deepfake detection method using affective cues[C]//Proceedings of the 28th ACM International Conference on Multimedia,2020:2823-2832.

[53] HOSLER B,SALVI D,MURRAY A,et al.Do deep fakes feel emotions?A semantic approach to detecting deepfakes via emotional inconsistencies[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:1013-1022.

[54] SHAO R,WU T,LIU Z.Detecting and grounding multi-modal media manipulation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2023:6904-6913.

[55] SALVI D,LIU H,MANDELLI S,et al.A robust approach to multimodal deepfake detection[J].Journal of Imaging,2023,9(6):122.

[56] LIU X,YU Y,LI X,et al.Magnifying multimodal forgery clues for deepfake detection[J].Signal Processing:Image Communication,2023,118:117010.

[57] KHAREL A,PARANJAPE M,BERA A.DF-TransFusion:Multimodal deepfake detection via lip-audio cross attention and facial self-attention[J].ArXiv,2023.

[58] MUPPALLA S,JIA S,LYU S.Integrating audio-visual features for multimodal deepfake detection[C]//2023 IEEE MIT Undergraduate Research Technology Conference (URTC),2023:1-5.

[59] ASTRID M,GHORBEL E,AOUADA D.Detecting audio-visual deepfakes with fine-grained inconsistencies[J].ArXiv,2024.

[60] YU Y,LIU X,NI R,et al.PVASS-MDD:Predictive visual-audio alignment self-supervision for multimodal deepfake detection[J].IEEE Transactions on Circuits and Systems for Video Technology,2023,34(8):6926-6936.

[61] LIU X,YU Y,LI X,et al.MCL:multimodal contrastive learning for deepfake detection[J].IEEE Transactions on Circuits and Systems for Video Technology,2023,34(4):2803-2813.

[62] YOON J H,PANIZO-LLEDOT A,CAMACHO D,et al.Triple-modality interaction for deepfake detection on zero-shot identity[J].Information Fusion,2024,109:102424.

[63] ZHANG K,PEI W,LAN R,et al.Lightweight joint audio-visual deepfake detection via single-stream multi-modal learning framework[J].ArXiv,2025.

[64] HASHMI A,SHAHZAD S A,LIN C W,et al.AVTENET:Audio-visual transformer-based ensemble network exploiting multiple experts for video deepfake detection[J].ArXiv,2023.

[65] GU Y,ZHAO X,GONG C,et al.Deepfake video detection using audio-visual consistency[C]//International Workshop on Digital Watermarking,2020:168-180.

[66] ZHANG Y,LIN W,XU J.Joint audio-visual attention with contrastive learning for more general deepfake detection[J].ACM Transactions on Multimedia Computing,Communications and Applications,2024,20(5):1-23.

[67] DU Y,WANG Z,LUO Y,et al.CAD:A general multimodal framework for video deepfake detection via cross-modal alignment and distillation[J].ArXiv,2025.

[68] CHUGH K,GUPTA P,DHALL A,et al.Not made for each other-audio-visual dissonance-based deepfake detection and localization[C]//Proceedings of the 28th ACM International Conference on Multimedia,2020:439-447.

[69] WANG J,WU B,LIU L,et al.FauForensics:boosting audio-visual deepfake detection with facial action units[J].ArXiv,2025.

[70] YIN Q,LU W,CAO X,et al.Fine-grained multimodal deepfake classification via heterogeneous graphs[J].International Journal of Computer Vision,2024,132(11):5255-5269.

[71] ASTRID M,GHORBEL E,AOUADA D.Audio-visual deepfake detection with local temporal inconsistencies[J].ArXiv,2025.

[72] LIU W,SHE T,LIU J,et al.Lips are lying:Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes[J].Advances in Neural Information Processing Systems,2024,37:91131-91155.

[73] KOUTLIS C,PAPADOPOULOS S.DiMoDif:Discourse modality-information differentiation for audio-visual deepfake detection and localization[J].ArXiv,2024.

[74] GAO Y,WANG X,ZHANG Y,et al.Temporal feature prediction in audio-visual deepfake detection[J].Electronics,2024,13(17):3433.

[75] DONG X,BAO J,CHEN D,et al.Protecting celebrities from deepfake with identity consistency transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2022:9468-9478.

[76] COZZOLINO D,PIANESE A,NIE?NER M,et al.Audio-visual person-of-interest deepfake detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2023:943-952.

[77] TIAN M,KHAYATKHOEI M,MATHAI J,et al.Unsupervised multimodal deepfake detection using intra and cross-modal inconsistencies[J].ArXiv,2023.

[78] HAQ I U,MALIK K M,MUHAMMAD K.Multimodal neurosymbolic approach for explainable deepfake detection[J].ACM Transactions on Multimedia Computing,Communications and Applications,2024,20(11):1-16.

[79] ZHANG Z,WANG Y,CHENG L,et al.Asap:Advancing semantic alignment promotes multi-modal manipulation detecting and grounding[C]//Proceedings of the Computer Vision and Pattern Recognition Conference,2025:4005-4014.

[80] LIU H,TAN Z,CHEN Q,et al.Unified frequency-assisted transformer framework for detecting and grounding multi-modal manipulation[J].International Journal of Computer Vision,2024:1-18.

[81] WANG J,LIU B,MIAO C,et al.Exploiting modality specific features for multi-modal manipulation detection and grounding[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP),2024:4935-4939.

基本信息:

中图分类号:TP391.41;TP18

引用信息:

[1]田华伟,张腾,李根,等.深度伪造人脸检测综述[J].中国人民公安大学学报(自然科学版),2026,32(02):15-27.

基金信息:

中国人民公安大学“双一流”建设项目(2026SYL0113)

发布时间：

2026-05-20

出版时间：

2026-05-20

网络发布时间：

2026-05-20

请选择需要下载的pdf数据

中国人民公安大学学报（自然科学版）

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文

请选择需要下载的pdf数据

中国人民公安大学学报（自然科学版）

使用微信“扫一扫”功能。将此内容分享给您的微信好友或者朋友圈

引用

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈