Artificial Intelligence

UAE’s K2 Think AI Jailbroken Through Its Own Transparency Features

Researchers exploited K2 Think’s built-in explainability to dismantle its safety guardrails, raising new questions about whether transparency and security in AI can truly coexist. The post UAE’s K2 Think AI Jailbroken Through Its Own Transparency Features appeared first on SecurityWeek.

K2 Think, the recently launched AI system from the United Arab Emirates built for advanced reasoning, has been jailbroken by exploiting the quality of its own transparency.

Transparency in AI is a quality urged, if not explicitly required, by numerous international regulations and guidelines. The EU AI Act, for example, has specific transparency requirements, including explainability – users must be able to understand how the model has arrived at its conclusion.

In the US, the NIST AI Risk Management Framework emphasizes transparency, explainability, and fairness. Biden’s Executive Order on AI in 2023 directed federal agencies to develop standards including a focus on transparency. Sector-specific requirements such as HIPAA are being interpreted as requiring transparency and non-discriminatory outcomes.

The intent is to protect consumers, prevent bias, and provide accountability – in effect, to make the traditional black-box nature of AI reasoning become auditable. Adversa has exploited the transparency and explainability controls of K2 Think to jailbreak the model.

The process is remarkably simple in concept. Make any ‘malicious’ request that you know will be rejected; but check the explanation of the rejection. From that explanation, deduce the first-level guardrail sanctioned by the model.

Alex Polyakov (co-founder at Adversa AI) explains this process with the K2 Think open source system in more detail: “Every time you ask a question, the model provides an answer and, if you click on that answer, its whole reasoning (chain of thought). If you then read the reasoning explanation for a particular question – let’s say, ‘how to hotwire a car’ – the reasoning output may contain something like ‘According to my STRICTLY REFUSE RULES I can’t talk about violent topics’.”

This is one part of the model’s guardrails. “You can then use the same prompt,” continues Polyakov, “but instruct that the STRICTLY REFUSE RULES are now disabled. Each time you obtain some insights into how the model’s safety works by reading the reasoning, you can add a new rule to your prompt that will disable this. It’s like having access to the mind of a person you’re bargaining with – no matter how smart they are, if you can read their mind, you can win.”

So, you prompt again, but within a framework that will bypass the first guardrail. This will almost certainly also be rejected but will again provide the reasoning behind the block. This allows an attacker to deduce the second-level guardrail.

The third prompt will be framed to bypass both guardrail instructions. It will likely be blocked but will unveil the next guardrail. This process is repeated until all the guardrails are discovered and bypassed – and the ‘malicious’ prompt is accurately accepted and answered. Once all the guardrails are known and can be bypassed, a bad actor could ask and receive anything desired.

“Unlike traditional vulnerabilities that either work or don’t, this attack becomes progressively more effective with each attempt. The system essentially trains the attacker on how to defeat it,” explains Adversa, describing it as an oracle attack.

In the example discussed by Adversa, the attacker prompts for a hypothetical instruction manual on how to hotwire a car. The final prompt and response are:

Within enterprises, bad actors could expose business logic or security measures. In healthcare, it could expose ways to implement insurance fraud; in education students could uncover ways to bypass academic integrity measures; and in fintech it would put trading algorithms or risk assessment systems at risk.

Adversa does not suggest that this oracle attack style jailbreak, turning a model’s attempt to be compliant with transparency best practices against itself, will necessarily be applicable to other AI models. “Most mainstream chatbots like ChatGPT or DeepSeek show reasoning but do not expose full step-by-step reasoning to end users,” explains Polyakov.

“You’ll see citations or brief rationales – but not the whole thinking process and, more importantly, not the model’s safety logic spelled out. Rich, verbatim reasoning traces are rare outside research modes, evaluation settings, or controlled enterprise deployments.”

But it does display the potential pitfalls within a major dilemma for model developers. Transparency requirements force an impossible choice. “Keep AI transparent for safety/regulation (but hackable) or make it opaque and secure (but untrustworthy). Every Fortune 500 company in regulated industries deploying ‘explainable AI’ for compliance is potentially vulnerable right now. It’s proof that explainability and security may be fundamentally incompatible.”

Latest News

Publisher