Anomaly detection with few labeled samples under distribution mismatch

SPADE: Semi-Supervised Anomaly Detection under Distribution Mismatch

What is SPADE?

Recently, we have open-sourced SPADE (Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling), a semi-supervised framework for anomaly detection that overcomes some of the drawbacks of alternative anomaly detection methods.

What Problem does SPADE Solve?

Anomaly detection is the process of identifying samples in a dataset that diverge from some expected pattern. This process has wide applications in several industries such as API security, financial fraud and manufacturing defect detection. SPADE is especially designed for semi-supervised settings where we have a handful of labeled data and a large number of unlabeled data.

When is SPADE better for your Use Case?

Creating a large labeled set of anomalous and non-anomalous samples for supervised learning can be time-consuming, expensive and error-prone. So unsupervised and semi-supervised methods have become an active area of research.

Most of these semi-supervised methods make the assumption that the labeled and unlabeled data come from the same distribution, that is, they are generated by the same underlying process—physical, financial, manufacturing or other process. This assumption is often violated in different ways—the labeled data could contain one type of anomaly while the unlabeled data contains other types of anomalies; or the labeled data could only contain samples that were easy to label. In these and potentially other cases, SPADE has been shown to have better performance than alternatives.

How does it Work?

SPADE constructs an ensemble of One-Class Classifiers (OCCs); each OCC is a Gaussian Mixture Model trained in a self-supervised manner on a disjoint subset of the unlabeled samples and non-anomalous samples.

moving image of the process of SPADE training an ensemble of OCC, providing pseudo-labels, and then using both labeled and pseudo-labeled sampled to train a supervised model for anomaly detection

Figure 1. SPADE first trains an ensemble of OCC to provide pseudo-labels to the unlabeled samples. Then, both labeled and pseudo-labeled samples are used to train a supervised model for the anomaly detection.

The ensemble is used to obtain pseudo-labels for the unlabeled data. A pseudo-label of is-anomalous or not-anomalous is assigned only if all the members of the ensemble agree. The pseudo-labels and any original labels are used together to train a supervised anomaly detector model. In the version of SPADE that we are open-sourcing, this model is a Tensorflow Random Forest that is trained with a binary cross-entropy loss. Once trained on the labels and pseudo-labels, the detector model can be used for online or batch prediction.

Example Use Cases

The above described benefits of SPADE are highlighted in our experiments as detailed in the published paper (in TMLR with feature certification). Here we present some results on a selection of datasets that demonstrate SPADE performance when (a) there are new types of anomalies in the unlabeled dataset, (b) when the labeled anomalies are easy to label, and (c) when the dataset contains only positively labeled and unlabeled samples.

Graph showing SPADE performance compared against other supervised, semi-supervised and unsupervised methods.

Figure 2. SPADE performance compared against other supervised, semi-supervised and unsupervised methods. Details about the datasets and the methods can be found in our paper.

As shown in Figure 2, SPADE consistently outperforms alternative methods. The CoverType and Thyroid datasets have Creative Commons Attribution 4.0 International (CC BY 4.0) licenses and are present in the SPADE repository.

How to use SPADE

We have just open-sourced SPADE. The repository contains scripts that build a Docker container and push the container, then run the container as a Vertex Custom Job on Google Cloud Platform. The dataset is read from BigQuery. Metrics such as AUC, Precision and Recall can currently be tracked in the job logs. The job launch script is configured with a default set of hyperparameters as described in the documentation. Users may need to adjust the hyperparameters to obtain optimal performance. The final trained anomaly detection model artifact is written to Google Cloud Storage (GCS). This artifact can be deployed as a Vertex Endpoint to serve predictions (not demonstrated in this repository).

Ways to Help

By open sourcing SPADE, we hope to foster more usage of this innovative anomaly detection method in the community, as well as invite contributions to improve the method. The SPADE model and code is freely available on Github under the Apache-2.0 license. SPADE is currently set up to run in a Docker container as a Vertex Custom Job on Google Cloud Platform. It can also be run by installing from PyPi using pip install spade-anomaly-detection. Users can upload their dataset to BigQuery, and run the training job on Vertex, or on a local machine from the PyPi installation.

More detailed usage instructions are available in the documentation.

By Raj Sinha and Jinsung Yoon, Cloud AI Research Team

opensource.google.com

Google Open Source Blog