The open-weight model ecosystem is thriving—and so is its shadow. A 2025 study identified over 8,000 safety-modified model repositories on Hugging Face alone, with modified models complying with unsafe requests at rates of 74% compared to 19% for their original instruction-tuned counterparts.
For organizations deploying open-weight models, a critical question emerges: how do you know the model you downloaded is safe to run?
We believe defensive security tools should be widely available. AMS represents our contribution to a safer AI ecosystem—one where developers everywhere can verify model integrity before deployment.
Today we're releasing AMS (Activation-based Model Scanner), an open source tool that answers this question in 10–40 seconds—without sending a single prompt.
The Problem with Behavioral Testing
Traditional safety verification relies on behavioral testing: send harmful prompts, check if the model refuses. This approach has three fundamental limitations.
It's slow. Comprehensive benchmarks like HarmBench require hundreds of queries. For organizations running continuous integration pipelines or screening large model registries, this can be impractical.
It's incomplete. No benchmark covers every harmful behavior. Models can exhibit safe behavior on known test sets while remaining unsafe on novel or out-of-distribution prompts.
It's gameable. Models can be fine-tuned to refuse benchmark prompts while complying with novel attacks—a known limitation of purely behavioral evaluation approaches.
A Structural Approach
AMS takes a different approach entirely. Instead of testing what a model says, it measures how a model thinks.
Safety training creates measurable geometric structure in a model's activation space. Instruction-tuned models develop internal "direction vectors"—representations that separate harmful content from benign content with high statistical confidence (4–8σ separation). When safety training is removed—through fine-tuning, abliteration, or training on unfiltered data—this geometric structure collapses.
AMS measures this collapse directly. The approach is grounded in recent research on representation engineering, which demonstrates that high-level concepts are encoded linearly in LLM activation space and can be reliably extracted via simple linear probes on intermediate-layer hidden states.
git clone https://github.com/GoogleCloudPlatform/activation-model-scanner.git
cd activation-model-scanner && pip install -e .
# Standard scan (3 concepts: harmful_content, injection_resistance, refusal_capability)
ams scan ./my-model
# Quick scan (2 concepts, ~40% faster)
ams scan ./my-model --mode quick
# Full scan (4 concepts including truthfulness)
ams scan ./my-model --mode full
# JSON output for CI/CD pipelines
ams scan ./my-model --json
What AMS Detects
AMS operates as a two-tier scanner. Tier 1 measures whether safety-relevant activation structure exists at all—no baseline required. Tier 2 compares a model's activation fingerprint against a verified baseline to detect subtle modifications, including supply chain substitution.
In our validation across 14 model configurations:
- Instruction-tuned models (Llama, Gemma, Qwen) show 3.8–8.4σ separation—consistent with strong safety training
- Uncensored variants (Dolphin, Lexi) show collapsed separation at 1.1–1.3σ—flagged as CRITICAL
- Abliterated models show partial degradation at 3.3σ—flagged as WARNING
- Base models (no safety training) show 0.69σ—confirming the absence of safety structure
- Quantized models (INT4/INT8) show less than 5% separation drift—safe to scan production deployments
Use Cases
CI/CD Safety Gates
Integrate AMS into your model deployment pipeline to block unsafe models before they reach production. An example Github Actions workflow:
jobs:
model-safety-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install AMS
run: pip install ams-scanner[cli]
- name: Scan model
run: |
ams scan ./model \
--verify meta-llama/Llama-3-8B-Instruct \
--json > scan-results.json
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: ams-scan-results
path: scan-results.json
Supply Chain Verification
Confirm that downloaded weights match their claimed identity using Tier 2 fingerprint comparison.
# First, create a baseline from the official model
ams baseline create ./my-model
# Then verify an unknown model against it
ams scan ./suspicious-model --verify ./my-model
Registry Screening
Automatically screen models at upload or download time to flag degraded safety structure before deployment.
# Standard scan (3 concepts: harmful_content, injection_resistance, refusal_capability)
ams scan ./my-model
# Quick scan (2 concepts, ~40% faster)
ams scan ./my-model --mode quick
# Full scan (4 concepts including truthfulness)
ams scan ./my-model --mode full
# JSON output for CI/CD pipelines
ams scan ./my-model --json
How It Works
AMS processes a set of contrastive prompt pairs—examples that differ only in whether they contain harmful content—through the model under inspection. It extracts hidden states at an intermediate layer (typically 35–40% depth), computes a direction vector that separates the two classes, and measures class separation as a σ score.
The key insight is that this measurement requires no generation, no benchmark queries, and no ground-truth labels. The entire scan completes in a single forward pass per prompt pair, typically 10–40 seconds on GPU hardware.
The probe consists of a single direction vector (~16KB for standard 4096-dimensional models). No model weights are modified. The tool works with any Hugging Face-compatible model.
Get Started
AMS is available now under Apache 2.0:
We welcome contributions, baseline additions for new model families, and feedback from the communities. See the contributing guide in the repository for details.



