LLMs in Security Operations: Helpful Sidekick or Hallucinating Intern?


Large language models (LLMs) are everywhere now. Your inbox, your SIEM, maybe even embedded in your security tool’s new “AI assistant” tab. It’s tempting to believe these tools are ready to triage alerts, write detections, and handle analyst fatigue all on their own.

They aren’t. Not yet.

But that doesn’t mean they’re useless. Like any tool, it’s about understanding the strengths, weaknesses, and the sharp edges before you put it in production.

This post breaks down where LLMs can help in security operations, where they fail (spectacularly), and how to use them without losing your mind—or your SOC’s credibility.


Where LLMs Actually Help

1. Summarizing Noisy Alert Data

Given a blob of log data or an alert cluster, LLMs are surprisingly good at turning it into a plain-English summary.

“This alert is based on a PowerShell process executed by a domain user on a critical host. It used base64 encoding and contacted an external IP not previously seen in the environment.”

This is helpful in:

  • Slack summaries
  • Ticket descriptions
  • Post-incident writeups

2. Writing Detection Rules (Carefully)

LLMs can assist with rule generation when given:

  • Structured inputs (TTPs, Sigma templates)
  • Examples to work from

They’re great at:

  • Converting natural language like “alert on unusual login hours” into Sigma syntax
  • Translating Splunk queries into Elastic DSL (or vice versa)

Just don’t blindly trust the output. Always validate logic and run tests.

3. Reverse Engineering Logs

If you’ve ever stared at a Windows event and wondered what it actually means, LLMs can:

  • Describe obscure fields
  • Suggest what a given process + command line might be doing
  • Help with weird vendor logs you rarely see

4. Enriching Threat Intelligence

Feed it an IOC or phishing email and it can:

  • Summarize likely behavior
  • Recommend MITRE techniques
  • Draft enrichment notes or analyst commentary

Where LLMs Fall Apart

1. Hallucinating Syntax and Facts

LLMs are language prediction machines, not security engines. They’ll:

  • Invent detection logic that looks right but does nothing
  • Mix up fields (e.g., confusing src.ip with destination.ip)
  • Fabricate MITRE mappings that aren’t real

2. Misunderstanding Sequences of Events

They’re not great at:

  • Determining causality (e.g., what came first in a process tree)
  • Understanding context spread across multiple logs
  • Distinguishing between expected vs suspicious behavior in enterprise environments

3. Security Context Is Missing

Most LLMs don’t:

  • Know what’s normal in your environment
  • Understand real asset value, privilege level, or business context
  • Detect nuance in policy or team-specific suppression logic

They don’t know which service accounts are allowed to access your production buckets.
They don’t know that powershell.exe is part of your automated backup process on Wednesdays.
They can’t tell the difference between a test system and a domain controller.

In other words: they can mimic syntax, but not intent.

Without environment-specific context, an LLM might flag perfectly legitimate behavior as malicious—or worse, ignore something subtle but dangerous.

This is why plugging an LLM into your alert pipeline without strong context-aware rules, environment tagging, or human oversight is asking for trouble.


Real-World Example: ChatGPT in the SOC

We’ve used ChatGPT internally for:

  • Explaining strange PowerShell commands in alerts
  • Writing regex for detection tuning
  • Generating summaries of long event chains in postmortems

In one case, we asked ChatGPT to convert a detection idea into a Sigma rule:
“Alert on DNS queries to rare domains from a non-standard process.”

It generated something plausible. But upon testing it:

  • The process field was incorrect for our EDR source
  • The logic didn’t actually filter for “rare” domains—it just listed any external DNS queries
  • It missed edge cases like nslookup.exe spawning from a script

We revised the prompt and iterated until the output was usable. But it took validation, testing, and tuning before we’d trust it in production.

Bottom line: ChatGPT saved us time—but only once we treated it like a junior analyst, not a detection engineer.


Prompt Best Practices for Security Workflows

LLMs are only as good as the prompt you give them. Here’s how to get better results:

1. Be Structured

Instead of:

“Write a detection for suspicious login behavior”

Try:

“Write a Sigma rule to alert when a domain admin logs in from a country not seen in the last 30 days. Use ECS fields and include a false_positives section.”

2. Include Examples

Show what a good input and expected output looks like. The more structured the example, the better the response.

3. Give Context (When Safe)

Tell it:

  • What log format you use (ECS? Windows native?)
  • What tooling it’s for (Elastic? Splunk? Sentinel?)
  • What your intent is (triage alert? full detection? enrichment?)

4. Set the Role

Prompt like:

“You are a detection engineer helping to draft high-fidelity alert rules. Please provide the logic with comments explaining your assumptions.”

5. Always Validate Output

Check field names, logic flow, syntax, and assumptions.
Run it in test mode. Triage it like any other rule.


How to Use LLMs Safely in the SOC

Treat It Like a Junior Analyst

  • Never deploy suggestions without review
  • Use it for first drafts, summaries, and ideas
  • Build workflows around review, testing, and validation

Use Guardrails

  • Pre-define prompt structures and role-based contexts
  • Limit scope: detection assistant, log explainer, etc.
  • Pipe outputs into staging areas (not production alerts)

Don’t Expect ROI from Magic

  • LLMs are productivity boosters, not magic threat hunters
  • If your detection strategy is broken, this won’t fix it
  • If your data’s messy, LLMs will just make prettier messes

Final Thought

LLMs are powerful—but only when you respect their limits. They won’t replace your analysts. But they can absolutely speed up your work, reduce cognitive overhead, and make bad documentation a little less painful.

Use them for summaries, translations, and enrichment. Be skeptical of their outputs. And never forget: they’re still guessing.

Helpful sidekick? Absolutely.
Hallucinating intern? Also yes.

It’s your job to know the difference.


Discover more from Annoyed Engineer

Subscribe to get the latest posts sent to your email.

, ,

Leave a Reply

Your email address will not be published. Required fields are marked *