① Model Selection

v6.1

② Alert Input

Paste JSON to validate
Fields detected: alert_id event_type source_ip destination_ip severity raw_log
Loading samples…

Upload a .json array or .csv file (up to 25 alerts for UI; use pipeline.py for larger runs).

Ctrl+Enter

Verdict

Run an investigation to see the verdict here.

Triage Summary

Run an investigation to see the triage summary here.

Executive Summary

One-sentence conclusion appears after investigation.

Alert Identity

Alert metadata appears here after investigation.

Top Signals

0 signals

Main drivers of the classification appear here after investigation.

+ pushes toward Malicious pushes toward Benign High impact (≥66%) Medium (33–66%) Low (<33%) Numbers are signed contribution weights — bigger = stronger influence.

Risk Threshold

0.50

Classification sensitivity. Alerts with malicious probability at or above the threshold are labelled Malicious; below are Benign.

0.00 1.00
Score
Verdict
Delta

Full Agent Reasoning

The agent's full reasoning chain will appear here after investigation.

MITRE ATT&CK Matrix

0 techniques detected

ATT&CK tactic & technique matrix appears here after investigation.

Investigation Trace

0 steps

The agent's subagent trace will stream here during investigation.

Session Timeline

0 alerts in history

Alerts you investigate will appear on this timeline.

Risk Gauge

Awaiting investigation…

Signal Distribution

0 signals

Signal mix appears here after investigation.

Attack Flow

Run an investigation to see the flow.

Signal Contribution

0 signals

Signal magnitudes appear here after investigation.

Risk Score Cascade

A step-by-step waterfall showing how the risk score was built will appear here.

Final Assessment

Summary appears after investigation.

Cross-Incident History

0 incidents
Time Alert ID Event Type Source IP Destination Verdict Conf Tokens
Loading incident history…

300-Alert Benchmark

Primary · Deep Agent

150 Benign / 150 Malicious · Primary evaluation dataset. Two Deep Agent models compared.

20-Alert Sanity Set

Smoke test · Multi-model

10 Benign / 10 Malicious · Quick sanity test used to evaluate model candidates before committing to full benchmarks.

450-Alert Large-Scale

Legacy · Classic ReAct

225 Benign / 225 Malicious · Previous-generation benchmark run on the classic ReAct architecture before migration to Deep Agent.

Generate Report

No alert loaded

Pick a format for the alert currently loaded in the Verdict panel. Summary is a compact executive-style brief; Full includes the complete subagent trace and raw reasoning.

Summary Report

Compact
  • Alert ID + core metadata
  • Final verdict, risk score, attack type
  • Top Signals (explainability)
  • 1-line reasoning

Full Report

Detailed
  • Everything in Summary
  • Full input alert JSON
  • Subagent analysis trace
  • Complete agent reasoning, metrics

Run an investigation in the Investigate tab first to enable report generation.

Generated Reports

No reports yet. Generate one from the card above.

CERBERUS

MSCS 670 · Spring 2026

An autonomous Tier-1 SOC analyst that classifies security alerts as Malicious or Benign. Built on the LangChain deepagents framework — every alert goes through a lead orchestrator plus three parallel subagents and a calibrated risk scorer. No pre-agent bypass.

299/300
Accuracy · 300-alert benchmark
0.997
F1 Score · Precision 100% · Recall 99.3%
$0.026
Cost · Full 300-alert run
11s
Avg latency · Per alert

Agent Architecture

LangChain Deep Agents
Input
Alert JSON
event_type · raw_log · IPs · severity
Lead Orchestrator
Gemini 2.5 Flash
Emits 3 parallel task calls in a single turn
Subagent · Parallel
network-forensics
  • check_network_context
Subagent · Parallel
log-analysis
  • analyze_log_pattern
Subagent · Parallel
threat-intel
  • check_ip_reputation
  • query_threat_intelligence
  • geolocate_and_check_travel
Risk Synthesis
calculate_risk_score
Weighted evidence + sigmoid calibration → probability
Output
Verdict JSON
verdict · confidence · reasoning → SQLite history

Model Comparison

Tested on 300-alert benchmark
Gemini 2.5 Flash RECOMMENDED
google/gemini-2.5-flash
Accuracy299/300 (99.7%)
F1 Score0.997
Precision100%
Recall99.3%
Avg Latency~12.6 s / alert
Avg Tokens~584 / alert
Cost (300 run)~$0.026

Best overall accuracy and cost-performance balance. Used as primary.

GPT-4o-mini BACKUP
openai/gpt-4o-mini
Accuracy293/300 (97.7%)
F1 Score0.983
Precision99.1%
Recall97.5%
Avg Latency~24.1 s / alert
Avg Tokens~797 / alert
Cost (300 run)~$0.036

Fallback for when Gemini is unavailable. Slower and slightly less accurate.

Tech Stack

Python 3.11+ LangChain Deep Agents OpenRouter Gemini 2.5 Flash Flask SQLite Vanilla JS Inline SVG

Datasets

DatasetSizePurpose
Sanity set20 labeled alertsSmoke tests during development
Benchmark set300 labeled alertsPrimary evaluation (150 B / 150 M)
Hidden test set500 alertsProfessor's final competition evaluation
MSCS 670 · Special Topics: Agentic AI · Spring 2026
Joaan Bin Jassim Academy for Defence Studies