Anonymity Trojans inside the AI Model.

Fixxx · Feb 7, 2025

The risks associated with the use of AI systems will be studied and mitigated by humanity for decades. One of the least explored risks today is the trojanization of a model, where a useful and seemingly correctly functioning machine learning system contains hidden functionality or intentionally introduced errors. Creating such a "Trojan horse" can be done in several ways, differing in complexity and application. These are not future predictions, but real cases.

Malicious Code in the Model

Some formats for storing ML models can contain executable code. For example, arbitrary code can be executed when loading a file in the pickle format - a standard serialization format for Python, used particularly in the deep learning library PyTorch. In another popular machine learning library, TensorFlow, models in .keras and HDF5 formats can contain a "lambda layer", which essentially executes arbitrary Python commands. It's easy to hide malicious functionality within this code. TensorFlow documentation warns that a model in TensorFlow can read and write files, send and receive data over the network and even launch child processes during execution. Essentially, it behaves like a full-fledged program. Malicious code can trigger immediately upon loading an ML model. In the popular public model repository Hugging Face, around a hundred models with malicious functionality were discovered in February 2024. Of these, 20% created a remote access shell on an infected device, while 10% executed additional software.

Data Poisoning

A model can be trojanized during its training phase by manipulating the training data. This process is called Data Poisoning and can be either targeted or untargeted. In the first case, the model is trained to respond incorrectly to a specific query (for example, always stating that Gagarin was the first person on the Moon), while in the second case, the attackers aim to generally degrade the model's quality. Targeted attacks are difficult to detect in an already trained model because they require specific input data. However, poisoning the training data for a large model is quite costly, as it involves changing a significant amount of information without being noticed. In practice, there are known cases of manipulating models that continue to learn during operation. A notable example is the poisoning of Microsoft's Tay-chatbot, which users trained to exhibit racism and extremism in less than a day. More practical are attacks on Gmail, where attackers periodically attempt to poison the spam classifier by marking tens of thousands of legitimate emails as spam. Their goal is to increase the "pass rate" of spam in user's inboxes. The same objective can be achieved by altering labels for training in annotated datasets, as well as injecting poisoned data into the process of adapting an already trained model to a specific domain.

Shadow Logic

The latest method of malicious modification of AI systems involves introducing additional branches into the computational graph of the model. This attack doesn't use executable code and doesn't require manipulating the training process, but the modified model can produce the desired response to pre-selected input data. The attack is based on the fact that machine learning models structure the computations needed during training and execution through a computational graph. This graph describes the order in which the blocks of the neural network should be connected and sets their operational parameters. The computational graph is designed separately for each model, although in some architectures of ML models, it's dynamic. Researchers have shown that the computational graph of an already trained model can be modified by adding a branch at the initial stages of its operation that detects a "special signal" in the model's input data and, upon detection, triggers further model operations according to a separately programmed logic. In an example from the research, a popular video object recognition model, YOLO, is altered so that it doesn't "see" people in the frame if a cup is also present. The danger of this method is that it's applicable to any neural network models and is independent of the model's storage format, modality and application area. A backdoor can be implemented for natural language processing, object detection, any classification tasks and multimodal language models. Moreover, such a modification can persist if the model undergoes further training and fine-tuning.

How to protect against Backdoors in AI Models

A key security measure is thorough supply chain control. This means that for any AI system, it's essential to ensure that the origin is known and that there is confidence in the absence of malicious changes:

the computational environment in which the model operates;
the code of the libraries used to run the model;
the data on which the model is trained;
the data used in fine-tuning;
the files of the model itself.

Large ML repositories are gradually implementing digital signatures to verify the origin of models and code. In cases where strict control over the origin of data and code is not possible, at a minimum, it's necessary to avoid using models obtained from dubious sources and to limit oneself to known providers with a good reputation. It's also important to use secure formats for storing ML models. The Hugging Face repository issues warnings when loading models that may execute code and the primary storage format for models is Safetensors, which blocks code execution. In summary, as AI systems become increasingly integrated into various applications, the potential for targeted attacks and malicious modifications grows. Understanding the risks associated with trojanization, data poisoning and shadow logic is crucial for developing effective defenses. By implementing stringent supply chain controls, utilizing secure storage formats, and remaining vigilant against potential threats, organizations can better protect their AI systems from exploitation and ensure the integrity of their machine learning models.

Anonymity Trojans inside the AI Model.

Fixxx