| 149 | 0 | 59 |
| 下载次数 | 被引频次 | 阅读次数 |
伴随大语言模型(Large Language Models, LLMs)在国家安全领域展现出新的赋能潜力,通过构建该领域的系统数据集与评估体系并开展测评,探索其潜在价值。鉴于大语言模型评估体系难以有效捕捉“战略模糊性”等高阶政治概念,提出主观/客观-真实/生成一致性验证方法,整合国家安全学、情报学与计算机科学知识,构建主客观测评框架及高质量基准数据集。实验选取10个开源大语言模型展开测评,同时引入生成测评数据进行对比分析。此外,结合“4·15国家安全教育日”等实际场景,对重点领域进行针对性测评。研究表明,不同大语言模型答题能力差异显著,Doubao-pro-32k/240615模型以85.4分(满分100分)领先,合成数据题的表现优于真实题,但主观题作答存在内容规范性不足等问题。大语言模型具有显著的效率提升和决策辅助价值,在实际场景中展现出巨大的应用潜力。
Abstract:With large language models(LLMs) exhibiting emerging potential in the national security domain, a systematic dataset and evaluation framework is constructed to explore their latent value. Given the difficulty of existing LLMs evaluation systems in effectively capturing high-level political concepts such as “strategic ambiguity”, a subjective/objective-real/generated consistency verification method is proposed. Knowledge from national security studies, intelligence science and computer science are integrated to establish a comprehensive evaluation framework along with a high-quality benchmark dataset. Experiments evaluate 10 open-source LLMs, with comparative analysis incorporating synthetically generated test data. Furthermore, targeted assessments in key areas are conducted, incorporating real-world scenarios such as China's “April 15 National Security Education Day”. Findings reveal significant disparities in performance across models, with Doubao-pro-32k/240615 achieving the highest score(85.4/100). Synthetic data questions yield better results than real-data questions, though subjective responses exhibit issues such as insufficient content standardization. LLMs offers notable efficiency gains and decision-support value, showing great potential in practical applications.
[1] 魏钰明,贾开,曾润喜,等.DeepSeek突破效应下的人工智能创新发展与治理变革[J].电子政务,2025(3):2-39.
[2] 李白杨,白云,詹希旎,等.人工智能生成内容(AIGC)的技术特征与形态演进[J].图书情报知识,2023,40(1):66-74.
[3] 许志伟,李海龙,李博,等.AIGC大模型测评综述:使能技术、安全隐患和应对[J].计算机科学与探索,2024,18(9):2293-2325.
[4] HUANG Y,BAI Y,ZHU Z,et al.C-eval:A multi-level multi-discipline Chinese evaluation suite for foundation models[C]//Process of the 37th International Conference on Systems,2023:62991.
[5] BOMMASANI R,LIANG P,LEE T.Holistic evaluation of language models[J].Annals of the New York Academy of Sciences,2023,1525:140-146.
[6] 李晓松,李增华,赵柯然,等.科技情报研究领域的大语言模型测评工作思考[J].情报理论与实践,2024,47(11):170-176,200.
[7] 赵志枭,胡蝶,刘畅,等.人文社科领域中文通用大模型性能评测[J].图书情报工作,2024,68(13):132-143.
[8] 金源,李成智.AI大模型大语言模型的财务能力测评与启示——基于CPA考试的ChatGPT与国产大模型实测[J].财会月刊,2024,45(18):44-51.
[9] 唐明伟,陈宙,丁晗萱,等.大语言模型中文问答正确性对比实验研究——以ChatGPT 3.5、Claude 1.0和文心一言2.1为例[J].情报探索,2024(7):71-78.
[10] 张华平,李林翰,李春锦.ChatGPT中文性能测评与风险应对[J].数据分析与知识发现,2023,7(3):16-25.
[11] 柳顺政,柴新夏,周峰,等.大语言模型地质学知识测评与数据集构建[J].自然资源信息化,2025(4):49-55.
[12] 王子星,齐乐,廉晓丹,等.医疗领域聊天机器人的发展与应用:从传统方法到大语言模型[J].协和医学杂志,2025,16(5):1170-1178.
[13] 问鸿滨,赵名君.国家安全学学科建设:历程、问题与对策[J].情报杂志,2022,41(11):82-88.
[14] 陈成鑫.国家安全学学科建设的目标与路径研究——基于26所高校的分析[J].北京警察学院学报,2024(3):112-118.
[15] 余池,陈亮,许海云,等.基于大语言模型的专利命名实体识别方法研究[J].数据分析与知识发现,2025,9(6):47-62.
基本信息:
中图分类号:TP18;D631
引用信息:
[1]耿鹏志,王优雅,李白杨,等.基于多维度测评的国家安全学领域大语言模型能力研究[J].中国人民公安大学学报(自然科学版),2025,31(04):76-86.
基金信息:
国家重点研发计划项目(2023YFC3321604)
2025-05-16
2025
2026-01-07
2026
2
2025-11-15
2025-11-15