ThreatVisionEval: Evaluating Multimodal Large Language Models for Threat Modeling Architecture Diagram Understanding

Authors

  • Santosh Pai Research Scholar, Institute of Computer Science and Information Science, Srinivas University, Mangalore, India Author
  • Srinivasa Rao Kunte R Research Professor, Institute of Computer Science and Information Science, Srinivas University, Mangalore, India Author

DOI:

https://doi.org/10.55011/STAIQC.2025.5212

Keywords:

: Threat modeling, artificial intelligence, Machine learning, Cybersecurity

Abstract

Threat Modeling is essential for identifying cybersecurity threats in software before implementation. It is a cornerstone of secure development. Currently, most threat modeling tasks are performed manually. Often human practioners have to recreate the threat modeling diagrams using archtiecture diagrams leading to delays. In practice, business teams often provide diagrams with ambiguous or unclear details. Advances in Artificial Intelligence (AI) have enabled multimodal Large Language Models (LLMs) to process both text and images. These models can extract security-relevant information from raw architectural diagram images. However, existing benchmarks for these models primarily assess general visual reasoning rather than security-specific capabilities. Key elements for threat modeling, such as entities, assets, call flows, trust boundaries, threat actors, and security properties, are missing from current LLM benchmarks.

This paper introduces ThreatVisionEval, a conceptual evaluation framework for multimodal LLMs. These are AI models capable of analyzing both images and text for vision-based threat modeling. The five core elements in the framework are : (1) a hierarchical task taxonomy,covers element and security property detection (2) a diagram variability model, handling notation, clarity, completeness, domain, and complexity; (3) a ground-truth annotation schema based on STRIDE elements (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege); (4) a metric suite including per-category F1 scores, Flow-Triple F1 for flow extraction, Boundary-IoU for boundary detection, and the Threat-Modeling-Readiness Score (TMRS); and (5) a prompt library with example questions and instructions, covering zero-shot, chain-of-thought, security-persona, and few-shot protocols.

By incorporating these five core elements, ThreatVisionEval offers clear definitions, reproducible protocols, and a practical roadmap. The framework enables systematic comparison of vision-language models for automated threat modeling. The paper also presents a research agenda. It recommends that the security, AI, and research communities adopt ThreatVisionEval as the standard for evaluating security-critical diagram understanding. This will facilitate accelerated progress and ensure robust, consistent outcomes.

Downloads

Published

2025-12-24

Similar Articles

1-10 of 30

You may also start an advanced similarity search for this article.