Introducing AMS: Activation-based model scanner for open-weight LLM safety verification

by Glen Messenger, Google Kubernetes Engine (GKE)

The open-weight model ecosystem is thriving—and so is its shadow. A 2025 study identified over 8,000 safety-modified model repositories on Hugging Face alone, with modified models complying with unsafe requests at rates of 74% compared to 19% for their original instruction-tuned counterparts.

For organizations deploying open-weight models, a critical question emerges: how do you know the model you downloaded is safe to run?

We believe defensive security tools should be widely available. AMS represents our contribution to a safer AI ecosystem—one where developers everywhere can verify model integrity before deployment.

Today we're releasing AMS (Activation-based Model Scanner), an open source tool that answers this question in 10–40 seconds—without sending a single prompt.

The Problem with Behavioral Testing

Traditional safety verification relies on behavioral testing: send harmful prompts, check if the model refuses. This approach has three fundamental limitations.

It's slow. Comprehensive benchmarks like HarmBench require hundreds of queries. For organizations running continuous integration pipelines or screening large model registries, this can be impractical.

It's incomplete. No benchmark covers every harmful behavior. Models can exhibit safe behavior on known test sets while remaining unsafe on novel or out-of-distribution prompts.

It's gameable. Models can be fine-tuned to refuse benchmark prompts while complying with novel attacks—a known limitation of purely behavioral evaluation approaches.

A Structural Approach

AMS scanner validating clean and tampered models at select layers of the model stack, using activation geometry comparisons to detect anomalies — Clean vs Tampered Models

AMS takes a different approach entirely. Instead of testing what a model says, it measures how a model thinks.

Safety training creates measurable geometric structure in a model's activation space. Instruction-tuned models develop internal "direction vectors"—representations that separate harmful content from benign content with high statistical confidence (4–8σ separation). When safety training is removed—through fine-tuning, abliteration, or training on unfiltered data—this geometric structure collapses.

AMS measures this collapse directly. The approach is grounded in recent research on representation engineering, which demonstrates that high-level concepts are encoded linearly in LLM activation space and can be reliably extracted via simple linear probes on intermediate-layer hidden states.

git clone https://github.com/GoogleCloudPlatform/activation-model-scanner.git
cd activation-model-scanner && pip install -e .

# Standard scan (3 concepts: harmful_content, injection_resistance, refusal_capability)
ams scan ./my-model

# Quick scan (2 concepts, ~40% faster)
ams scan ./my-model --mode quick

# Full scan (4 concepts including truthfulness)
ams scan ./my-model --mode full

# JSON output for CI/CD pipelines
ams scan ./my-model --json

What AMS Detects

AMS operates as a two-tier scanner. Tier 1 measures whether safety-relevant activation structure exists at all—no baseline required. Tier 2 compares a model's activation fingerprint against a verified baseline to detect subtle modifications, including supply chain substitution.

In our validation across 14 model configurations:

Instruction-tuned models (Llama, Gemma, Qwen) show 3.8–8.4σ separation—consistent with strong safety training
Uncensored variants (Dolphin, Lexi) show collapsed separation at 1.1–1.3σ—flagged as CRITICAL
Abliterated models show partial degradation at 3.3σ—flagged as WARNING
Base models (no safety training) show 0.69σ—confirming the absence of safety structure
Quantized models (INT4/INT8) show less than 5% separation drift—safe to scan production deployments

Use Cases

Diagram showing three threat vectors : fine-tuned backdoors (hidden trigger behaviours), weight poisoning (direct parameter edit) and supply chain swap (substituted checkpoint) — Threat Landscape

CI/CD Safety Gates

Integrate AMS into your model deployment pipeline to block unsafe models before they reach production. An example Github Actions workflow:

jobs:
model-safety-check:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v3

    - name: Install AMS
      run: pip install ams-scanner[cli]

    - name: Scan model
      run: |
        ams scan ./model \
          --verify meta-llama/Llama-3-8B-Instruct \
          --json > scan-results.json

    - name: Upload results
      uses: actions/upload-artifact@v3
      with:
        name: ams-scan-results
        path: scan-results.json

Supply Chain Verification

Confirm that downloaded weights match their claimed identity using Tier 2 fingerprint comparison.

# First, create a baseline from the official model
ams baseline create ./my-model

# Then verify an unknown model against it
ams scan ./suspicious-model --verify ./my-model

Registry Screening

Automatically screen models at upload or download time to flag degraded safety structure before deployment.

# Standard scan (3 concepts: harmful_content, injection_resistance, refusal_capability)
ams scan ./my-model

# Quick scan (2 concepts, ~40% faster)
ams scan ./my-model --mode quick

# Full scan (4 concepts including truthfulness)
ams scan ./my-model --mode full

# JSON output for CI/CD pipelines
ams scan ./my-model --json

How It Works

AMS processes a set of contrastive prompt pairs—examples that differ only in whether they contain harmful content—through the model under inspection. It extracts hidden states at an intermediate layer (typically 35–40% depth), computes a direction vector that separates the two classes, and measures class separation as a σ score.

Flowchart illustrating AMS scanning process: contrastive prompt pairs enter the model, hidden states are extracted at an intermediate layer, direction vectors are computed, and class separation is measured to produce PASS, WARNING, or CRITICAL results — How it Works

The key insight is that this measurement requires no generation, no benchmark queries, and no ground-truth labels. The entire scan completes in a single forward pass per prompt pair, typically 10–40 seconds on GPU hardware.

The probe consists of a single direction vector (~16KB for standard 4096-dimensional models). No model weights are modified. The tool works with any Hugging Face-compatible model.

Get Started

AMS is available now under Apache 2.0:

GitHub: github.com/GoogleCloudPlatform/activation-model-scanner/

We welcome contributions, baseline additions for new model families, and feedback from the communities. See the contributing guide in the repository for details.

opensource.google.com

Google Open Source Blog