Disclaimer: Opinions expressed are solely my own and do not reflect the views or opinions of my employer or any other affiliated entities. Any sponsored content featured on this blog is independent and does not imply endorsement by, nor relationship with, my employer or affiliated organisations.

A Quick Note Before We Start

When I wrote about how AI transforms detection engineering from narrow precision to broad coverage, I originally planned this as Part 2 of that series. The logic made sense: Part 1 shows how AI enables comprehensive detections, Part 2 would explain how to handle the resulting alert volume.

But as I started writing, I realized this piece actually belongs with a different blog I published a couple months ago: "Why SOC Analysts Ignore Your Playbooks".

Here's why: The detection engineering blog is about what you can now detect. This blog is about how you investigate what you detect. And the investigation problem traces back to the same root causes I identified in the playbooks blog: broken processes, ignored documentation, and tribal knowledge that walks out the door when your best analysts leave.

So consider this Part 2 of the playbooks series, not the detection engineering one. Though honestly, they're all connected; you can't deploy comprehensive detections without machine-executable investigation procedures, and you can't build those procedures if your playbooks are PDF documents nobody follows.

If you haven't read Part 1 yet, start there. It explains why playbooks are broken and how to fix them by coupling them with detections. This post takes that foundation and shows you how to transform those playbooks into logic that AI can actually execute.

Let's dive in.

The Problem We Thought We Solved

In Part 1, I argued that the biggest mistake in SecOps is shipping detections without playbooks. That shadow detection sitting in your SIEM that nobody understands? That's the root cause of your broken feedback loop, your alert fatigue, and your analyst burnout.

The fix seemed straightforward:

Build the playbook when you build the detection
Use GenAI to draft it in seconds
Automate the steps you can
Let AI learn from analyst behavior for the rest

And for human analysts, this works. You now have documented procedures, analysts can follow them (or at least reference them when needed), and new joiners can learn how your organization operates.

But then you try to hand these playbooks to an AI agent to execute autonomously.

And it all falls apart.

This edition is sponsored by Prophet Security

Put every alert through a complete investigation.

Prophet AI pulls the right context, follows a reproducible line of questioning, and returns a clear determination with linked evidence and an audit trail. Analysts stay in control while investigations finish faster and hold up under review.

See it on your data. Request a demo at prophetsecurity.ai

Why Your "Fixed" Playbooks Still Break AI

Let's take a typical playbook that follows my advice from Part 1. It's not a dusty PDF,it's embedded with the detection, it's maintained, and it actually reflects how your team works:

Suspicious Login Investigation Procedure

When you receive a suspicious login alert, follow these steps:

Check whether the login location matches the user's typical locations
Review their login history for the past 30 days in your identity provider
If the location is unusual, verify whether the user recently traveled
Check if the device fingerprint is recognized
If multiple red flags exist, contact the user's manager
If you cannot confirm legitimacy within 2 hours, force password reset

This is a GOOD playbook by Part 1 standards. It's clear, actionable, tied to a specific detection. An experienced analyst knows exactly what to do.

But hand this to an AI agent, and watch what happens:

What exactly is an "unusual location"?

50km away? 500km? Different country? Different continent?
What if the user works remotely and travels frequently?
What if they're using a VPN that makes location data unreliable?

What constitutes "multiple red flags"?

Two flags? Three? Which combinations matter more?
Is unusual location + recognized device better or worse than normal location + unrecognized device?
How do you weight these factors?

What if your identity provider is temporarily unavailable?

Does the AI wait? For how long?
Does it proceed without that data? How does that affect its confidence?
Does it automatically escalate?

When should AI act autonomously vs. escalate to a human?

Can it force a password reset on its own? For which user types?
What if the "suspicious" login is the CEO accessing email at 2 AM?

Human analysts handle this ambiguity through experience, context, and judgment. They know that "unusual" for the CEO is different than "unusual" for a contractor. They know which data sources to trust. They adapt when systems are down.

AI doesn't have that context unless you give it explicitly.

The Missing Layer: Machine-Executable Logic

This is what I've realized since Part 1: there's a gap between what humans need and what AI needs.

Humans need:

Clear guidance on what to investigate
Reference points for common scenarios
Flexibility to adapt based on context
Your Part 1 playbooks solve this

AI needs:

Explicit decision logic with no ambiguity
Quantifiable thresholds and confidence scores
Complete coverage of edge cases and fallbacks
Structured entity resolution across data sources
Your Part 1 playbooks don't solve this

So we need a new layer. Not to replace human-readable playbooks, but to sit underneath them.

Let me show you what this looks like.

The Same Playbook, But Machine-Executable

Here's how that suspicious login investigation transforms when designed for AI execution:

yaml
playbook: suspicious_login_investigation
version: 2.3
owner: identity_security_team

# Data source requirements with structured fallbacks
required_data_sources:
  okta_logs:
    retention_days: 30
    required_fields: [user_id, src_ip, location, device_fingerprint, mfa_status]
    unavailable_action:
      type: cap_max_confidence
      max_confidence: 0.80
  
  vpn_logs:
    retention_days: 7
    unavailable_action:
      type: confidence_penalty
      delta: 0.10

# Entity resolution: how to connect data across systems
entity_resolution:
  user_identity:
    canonical_identifier: email
  
  crowdstrike_endpoint:
    steps:
      - {action: map, from: alert.email, to: active_directory.userPrincipalName}
      - {action: get, field: active_directory.lastLogonComputer}
      - {action: map, from: active_directory.lastLogonComputer, to: crowdstrike.hostname}
      - {action: get, field: crowdstrike.AID}
    fallback: {strategy: use_cache, max_age_hours: 24}

# Confidence scoring: weights sum to 1.0
confidence_scoring:
  method: weighted_sum
  normalize_weights: true
  
  factors:
    - name: location_deviation
      weight: 0.30
      scoring:
        - {when: "distance_km < 50 AND time_plausible", score: 0.80}
        - {when: "distance_km >= 500 OR impossible_travel", score: 0.20}
    
    - name: mfa_status
      weight: 0.25
    
    - name: device_recognition
      weight: 0.25
    
    - name: time_of_day_anomaly
      weight: 0.15
    
    - name: concurrent_activity
      weight: 0.05

# Complete decision ranges (no gaps: 0.0-1.0 covered)
decision_thresholds:
  auto_close_benign: {range: [0.95, 1.00], action: close_with_docs}
  auto_close_fp: {range: [0.90, 0.95], action: close_and_tune_detection}
  low_risk_action: {range: [0.75, 0.90], action: require_mfa_reauth}
  escalate_medium: {range: [0.60, 0.75], action: analyst_review, sla_hours: 4}
  escalate_high: {range: [0.40, 0.60], action: analyst_review, sla_hours: 2}
  escalate_critical: {range: [0.00, 0.40], action: incident_response, sla_minutes: 30}

# Safety rails for autonomous actions
governance:
  suspend_account:
    approval_required: true
    two_person_rule: true
    blast_radius_check: true
  
  isolate_endpoint:
    approval_required: true
    blast_radius_limit: 5

# Audit with privacy controls
audit_trail:
  capture: [data_sources_queried, confidence_scores, reasoning_chain, action_taken]
  privacy: {redact_pii: true, respect_residency: true}
```

See the difference?

No ambiguity. "Unusual location" becomes distance_km >= 500 OR impossible_travel with a confidence score of 0.20.

No gaps. Every possible confidence score from 0.0 to 1.0 maps to a specific action.

Explicit fallbacks. If Okta is down, max confidence caps at 0.80. If VPN logs are unavailable, confidence drops by 0.10.

Measurable weights. Location deviation is weighted 0.30, MFA status 0.25, device recognition 0.25,you can validate these against historical outcomes.

Clear governance. AI can suggest account suspension, but requires human approval with two-person rule.

This isn't documentation for humans to read. It's executable logic for AI to run.

Connecting Back to Part 1: The Full Picture

Remember in Part 1 when I talked about two approaches to fixing playbooks?

Approach 1: Build playbooks during detection engineering
Approach 2: Let AI observe analysts and generate playbooks from behavior

Both of these still apply. But now I'm adding a critical step:

Approach 1 Extended: Detection Engineering → Machine Logic

Build human-readable playbook with detection (Part 1)
Identify which steps can be automated
Transform those steps into machine-executable logic (Part 2)

Approach 2 Extended: Behavioral Learning → Quantified Rules

Let AI observe how analysts investigate alerts (Part 1)
Generate initial playbook from behavior patterns
Codify the decision logic into structured, quantifiable rules (Part 2)

The insight from Part 1 still holds: analysts going by "instinct" have built better processes in their heads.

The challenge in Part 2 is: how do we capture that instinct as explicit, measurable logic that AI can execute?

This is where those analysts who ignore playbooks actually become your most valuable asset. Watch how they handle edge cases. Ask them why they made each decision. What data points did they weigh most heavily? When did they escalate vs. auto-close?

That tribal knowledge you were trying to capture in Part 1? Now you're quantifying it for Part 2.

What This Enables (Beyond Just Automation)

When I published Part 1, the focus was on fixing broken processes. If playbooks are ignored, your feedback loop breaks, your detection tuning stops, and analyst burnout accelerates.

But machine-executable playbooks unlock something bigger: measurable, improvable AI decision-making.

In Part 1, I mentioned the vanity metrics problem, everyone obsesses over alert closure times, but nobody tracks whether AI is making correct decisions.

Machine-executable playbooks make real metrics possible:

AI Decision Accuracy

Of benign closures, how many were actually benign?
When AI says 95% confident, is it right 95% of the time?
Which alert types show the most false positives from AI triage?

Learning Over Time

Does accuracy improve week over week?
Do analyst overrides result in playbook improvements?
Is the system getting better at handling your specific environment?

Drift Detection

Is AI accuracy degrading as your environment changes?
Which confidence factors show the most drift?
Are updates keeping pace with infrastructure changes?

You can't measure any of this with ambiguous playbooks that analysts interpret differently on every shift. But with explicit, structured logic? You can track every decision, validate every confidence score, and continuously improve.

This is the feedback loop Part 1 was trying to fix, now operating at AI speed.

The Six Requirements for Machine-Executable Playbooks

Based on everything I've learned building these for real environments, here's what actually matters:

1. Explicit Decision Logic (No Ambiguity, No Gaps)

Your decision thresholds must cover every possible confidence score from 0.0 to 1.0. No "if unusual, escalate" nonsense. Define EXACTLY what triggers each action.

❌ Bad: If suspicious, escalate to analyst

✅ Good:

decision_thresholds:
escalate_high: {range: [0.40, 0.60], action: analyst_review, sla_hours: 2}
escalate_critical: {range: [0.00, 0.40], action: incident_response, sla_minutes: 30}

2. Data Topology and Entity Resolution

Your AI needs to know how to connect data across your specific tool stack. How do you go from alert email → Active Directory → CrowdStrike endpoint ID?

This was implicit tribal knowledge in Part 1's world. In Part 2, it must be explicit:

entity_resolution:
crowdstrike_endpoint:
steps:
- {action: map, from: alert.email, to: active_directory.userPrincipalName}
- {action: get, field: active_directory.lastLogonComputer}
- {action: map, from: active_directory.lastLogonComputer, to: crowdstrike.hostname}
fallback: {strategy: use_cache, max_age_hours: 24}

3. Confidence Calibration (Validated Against Outcomes)

Remember in Part 1 when I said the feedback loop is broken because analysts ignore the step that says "send false positives back to detection engineering"?

Machine-executable playbooks fix this by requiring outcome validation:

If AI says 90% confident benign, track how often it's actually benign
Adjust confidence weights based on historical accuracy
Recalibrate when you detect drift

The confidence scores must be trustworthy, not aspirational.

4. Reproducible Metrics and Evaluation

When a vendor claims "improves coverage from 50% to 95%," ask: "95% of what, measured how?"

With machine-executable playbooks, you can test against:

Historical incidents with known outcomes
Purple team exercises covering MITRE ATT&CK techniques
Adversary emulation with documented expected results

Define variant coverage as: (# of distinct attack technique variants correctly triaged) / (total curated variants in test set)

This separates real capability from marketing claims.

5. Safety Rails and Governance

In Part 1, I talked about how deterministic automation needs approval workflows. The same applies here, but even more critically.

AI Agents handle: Investigation, enrichment, triage decisions, confidence scoring, and recommending next steps

Deterministic automation handles: Response execution with defined approval workflows, rate limiting, blast radius checks

Why this matters: AI reasoning excels at weighing ambiguous evidence during investigations. However, response actions need predictability, auditability, and fail-safe controls.

Governance controls:

Confidence thresholds determine when AI can auto-close vs. escalate
Escalation criteria define what triggers human involvement
Override tracking captures when analysts disagree with AI (this feeds back into learning)

6. Complete Observability and Transparency

In Part 1, I said the problem with playbooks is they're ignored and never maintained. Part 2 solves this by making every AI decision traceable:

Complete audit trails showing data queried, confidence calculated, thresholds applied
Reasoning chains explaining every decision step
Metrics tracking accuracy, drift, and performance over time
Testing frameworks for validating playbook changes before deployment

Black boxes are unacceptable. If you can't see how AI reaches conclusions, you cannot trust it, improve it, or measure its effectiveness.

This is how you prevent Part 2 from becoming Part 1 all over again,broken processes that nobody maintains.

Two Paths to Implementation (Revisited from Part 1)

In Part 1, I outlined how to build playbooks alongside detections. Now let's extend that with two approaches to machine-executable logic:

Route 1: Customizable Intelligence

The approach: Build your own machine-executable playbooks for your specific environment. You control procedures, entity mappings, confidence weights, and thresholds.

Best for: Mature security operations with established procedures, engineering resources to invest, and unique tool stacks requiring customization.

What you get: Maximum control to encode your specific tribal knowledge and operational context,the "instinct" your best analysts use.

What it requires: Security engineering resources to build, test, and maintain playbooks over time. This is the formalization of the behavioral learning I mentioned in Part 1.

Route 2: Out-of-the-Box Intelligence

The approach: Pre-trained AI with built-in investigation procedures based on security best practices and collective intelligence across many deployments.

Best for: Faster time-to-value, teams lacking resources to build custom playbooks, organizations wanting proven procedures.

What you get: Working baseline from day one, proven procedures, faster deployment.

What you can customize: Thresholds, escalation criteria, and specific organizational context as you learn what works.

The Non-Negotiable Requirement

Regardless of path, transparency is mandatory.

Remember the RPA problem I mentioned in Part 1? RPA solutions failed in cybersecurity because the tech wasn't there and the environment was too unpredictable.

AI solves the unpredictability through non-deterministic reasoning. But only if you can:

✓ See the investigation logic and understand what the AI does
✓ Trace the complete reasoning chain from alert to decision
✓ Audit all decisions with confidence scores and data sources
✓ Modify procedures as your requirements change
✓ Measure effectiveness with accuracy, learning, and drift metrics

Critical questions to ask any AI SOC platform:

Can I see the investigation logic being applied?
How are confidence scores calculated?
What happens when data sources are unavailable?
How do you handle entity resolution across my specific tools?
Can I test changes before deploying them?
How do you measure and track accuracy over time?

Without answers to these, you're just automating chaos faster.

The Bottom Line: Completing the Transformation

In Part 1, I showed you why playbooks are broken and how to fix them by coupling them with detections. That solves the human problem.

But Part 2 is the necessary evolution: those human-readable playbooks don't work for AI execution.

Machine-executable SOPs bridge the gap between broader detections and manageable operations. They're what allow AI to triage thousands of alerts with the same quality your best analyst applies to dozens.

This isn't better documentation. It's a fundamental shift in how security knowledge gets operationalized,moving from ambiguous human guidelines to explicit, testable, measurable logic that can be continuously improved.

The connection to Part 1:

Analysts ignore playbooks → Build them with detections
Detections without playbooks break feedback loops → Automate what you can
Analysts go by instinct → Capture that tribal knowledge
Now codify that knowledge into logic AI can execute → Measure and improve systematically

The six requirements that matter:

Explicit decision logic with complete coverage (no gaps, no ambiguity)
Data topology and entity resolution (connecting data across your specific systems)
Confidence calibration (trustworthy thresholds validated against outcomes)
Reproducible metrics (defensible claims about coverage and accuracy)
Safety rails (governance preventing business disruption)
Complete observability (transparency enabling trust and improvement)

Whether you build custom playbooks or start with proven out-of-the-box intelligence, these requirements don't change. The platform must provide them, and you must be able to validate they're working in your specific environment.

Without Part 1, your processes stay broken.

Without Part 2, you're automating chaos faster.

With both, you're building a security operation that measurably improves over time.

Join as a top supporter of our blog to get special access to the latest content and help keep our community going.

As an added benefit, each Ultimate Supporter will receive a link to the editable versions of the visuals used in our blog posts. This exclusive access allows you to customize and utilize these resources for your own projects and presentations.

Upgrade

From PDF Playbooks to Machine-Executable Logic