Some of the AI models built by buzzy French startup Mistral are significantly more likely to produce child sexual exploitation material (CSEM) and content linked to chemical and nuclear threats compared to its competitors, according to a new report.
Mistral, which launched in Paris two years ago, builds generative AI models similar to those powering US tech giant OpenAI’s ChatGPT. The startup has quickly become one of French tech’s hottest companies, raising more than €1bn in a year and reaching a €5.8bn valuation.
A new report published by US-based AI safety startup Enkrypt AI suggests some of Mistral’s models are 60 times more likely to generate CSEM and up to 40 times more likely to produce potentially dangerous CBRN (chemical, biological, radiological and nuclear) content, when compared to models developed by OpenAI and Anthropic.
Enkrypt AI’s team carried out red teaming on the models, a common cybersecurity method that consists of simulating harmful behaviours to test a system’s answer to them. In this case, they used prompts specifically designed to bypass content filters to test the models’ response to CSEM and CBRN queries.
The team found that on average 68% of harmful prompts submitted to Mistral’s models elicited unsafe content ranging from creating a script to convince a minor to meet in person, to providing information on increasing the persistence of toxic nerve agents in the environment.
“The model shows considerable weakness in operational and safety risk areas,” says the report.
A Mistral spokesperson told Sifted: “Mistral AI has a zero tolerance policy on child safety. Red teaming for child sexual abuse material vulnerability is essential work and we are partnering with [digital safety organisation] Thorn on the topic. We will examine the results of the report in detail.”
What did Enkrypt AI do?
Enkrypt AI focused on two models developed by Mistral, Pixtral 12B and Pixtral Large. Both are ‘multimodal’ models, meaning they excel at handling input in both text and images. They can be used, for instance, to carry out data analysis or write text based on a photograph, chart or graphic.
The red teaming team crafted hundreds of prompts based on text and images aiming to ‘trick’ the model into producing content that is restricted or policy-violating. This was typically done by inserting harmful instructions within seemingly safe images.
Out of the prompts designed to test whether the model could assist with creating coercive content to manipulate minors or adults into performing exploitative acts, 84% were successful with Pixtral 12B, says the report.
Nearly all (98%) of the prompts intended to gather information on the synthesis, handling, weaponisation and storage of toxic chemical agents were successful, on both Pixtral 12B and Pixtral Large.
“Multimodal AI promises incredible benefits, but it also expands the attack surface in unpredictable ways,” said Sahil Agarwal, the CEO of Enkrypt AI. “This research is a wake-up call: the ability to embed harmful textual instructions within seemingly innocuous images has real implications for enterprise liability, public safety, and child protection.”
According to the report, none of the prompts for CSEM targeting OpenAI’s GPT-4o and Anthropic's Claude 3.7 Sonnet, which are both multimodal models, were successful.
The success rate of prompts for CBRN material on GPT-4o was 2%, and 1.5% on Claude 3.7 Sonnet.