Trustworthy AI for Cybersecurity
Machine Learning models are now deeply integrated into critical infrastructures. As these systems make increasingly relevant decisions, their security becomes a key requirement. Over the last 20 years, research in Adversarial Machine Learning has uncovered a wide spectrum of vulnerabilities that can compromise the integrity, availability, or confidentiality of AI systems. However, reliably evaluating model robustness to adversarial attacks remains a major challenge. In practice, robustness is often measured using gradient-based attacks that optimize perturbations to simulate worst-case inputs. These empirical evaluations might provide an inaccurate picture of model security, as small flaws in the attack setup can lead to overly optimistic results. Without systematic testing and diagnostic tools, even well-intentioned evaluations risk repeating past mistakes.
Today, as AI systems transition from narrow classifiers to large, general-purpose language and vision models, new attack surfaces have emerged. Notably, the same principles that enabled adversarial attacks and defenses on standard deep learning models, now reappear in Large Language Models (LLMs). Within these new scenarios, adversaries no longer modify pixels, but manipulate text prompts to induce harmful or policy-violating behavior, exploiting the gaps left by safety-alignment. For instance, LLMs can be compromised through jailbreak attacks, where an attacker crafts a prompt that bypasses a model’s safety alignment and elicits restricted outputs.
However, while the fundamental principles of Adversarial ML remain relevant, securing modern foundation models is considerably more complex. Attacks against LLMs, in fact, are increasingly realistic, and yet the notion of security itself becomes harder to define. Evaluating robustness in this new context is also more challenging: there is no clear analogue of a perturbation budget, and assessing a model’s harmful or policy-violating responses involves semantic and behavioral criteria rather than simple misclassifications.
This tutorial will guide participants on this evolving security landscape, tracing how the principles, methods, open problems, and lessons learned from Adversarial Machine Learning are now being reinterpreted and readapted for Large Language Models. We will emphasize how, despite the radical shift in model capabilities, we are once again facing the same foundational challenges, from non-standardized evaluation procedures to the constant need for systematic testing and diagnostic tools. Hence, the tutorial aims to provide participants with a unified understanding of adversarial threats and how to study, test, and defend against them.