Scoring Mechanisms in Anomaly Detection

Z-Score Method

Calculation: For a given feature (e.g., wrong_fragment), the z-score is computed as:

z=(x−μ)σ z

where x is the data point, μ is the mean, and σ(sigma) is the standard deviation.
Thresholding: Data points with∣z∣>2 are considered anomalies.
Labeling: Assign 1 to anomalies and 0 to normal points.
Evaluation: Using the confusion matrix, the model's performance is assessed. For instance, a high number of false negatives indicates that many anomalies were not detected

Elliptic Envelope

Concept: Assumes data follows a Gaussian distribution and fits an ellipse to encompass the majority of data points.

Labeling: Predictions are -1 for anomalies and 1 for normal points. These are mapped to 1 and 0, respectively, for consistency.
Evaluation: The confusion matrix reveals the model's precision and recall. A significant number of false negatives suggests that the model misses many anomalies.

Local Outlier Factor (LOF)

Concept: Measures the local density deviation of a given data point concerning its neighbors.

Parameter Tuning: The choice of k (number of neighbors) significantly affects performance. A small k may lead to over fitting, while a large k might overlook local anomalies.

Evaluation: By varying k, metrics like accuracy, precision, and recall are plotted to identify the optimal value.

One-class SVM

Concept: Learns a decision function for novelty detection, classifying new data as similar or different from the training set.

Labeling: Predictions are -1 for anomalies and 1 for normal points, which are then mapped accordingly

Evaluation: The confusion matrix indicates the model's ability to detect anomalies, balancing between false positives and false negatives.

Isolation Forest

Concept: An ensemble method that isolates anomalies instead of profiling normal data points.

Scoring: An anomaly score s(x,n)s(x, n)s(x,n) is computed as:

s(x,n)=2−E(h(x))c(n) s(x, n) = 2 - \frac{E(h(x))}{c(n)}s(x,n)=2−c(n)E(h(x))

where E(h(x))E(h(x))E(h(x)) is the average path length to isolate point xxx, and c(n)c(n)c(n) is the average path length of unsuccessful searches in a Binary Search Tree.

Labeling: Predictions are -1 for anomalies and 1 for normal points, mapped accordingly.

Evaluation: The confusion matrix helps assess the model's precision and recall, indicating its effectiveness in isolating anomalies.

Performance Metrics

For each method, the following metrics are computed:

Accuracy: Proportion of correct predictions (both anomalies and normal points).

Precision: Proportion of correctly identified anomalies out of all points labeled as anomalies.

Recall: Proportion of actual anomalies that were correctly identified.

These metrics are derived from the confusion matrix, which tabulates true positives, false positives, true negatives, and false negatives.

Anomaly detection techniques - Scoring and performance

Table of contents