Looking for a new opportunity as DevOps engineer
I am currently seeking a new opportunity as a DevOps Engineer, available from February 2026. I am open to remote or hybrid work from Prague, Czechia (Europe), for a long-term, full-time position or B2B contract - 300 EUR / daily. Please feel free to contact me for further details. You can also review my professional background on my LinkedIn profile.
LangChain Agents for Infrastructure Automation
Most infrastructure automation is written as linear scripts: do this, then that, handle this error, exit. That model works fine for well-defined, predictable tasks. But when the task requires judgement — “figure out why this deployment is failing and fix it” — a linear script hits a wall.
LangChain agents offer a different model: a reasoning loop that can call tools, observe results, and decide what to do next. This tutorial builds a read-only Kubernetes diagnostic agent backed by Claude.
How the agent reasons#
The agent uses the ReAct pattern: Reason → Act → Observe, repeated until the agent has enough information to produce a final answer.

Prerequisites#
pip install langchain langchain-anthropic anthropic python-dotenv
You also need kubectl configured with access to your cluster and helm if you use Helm releases.
# .env
ANTHROPIC_API_KEY=sk-ant-...
Step 1 — Define the tools#
Each tool wraps a CLI command and returns its output as a string. Tools are read-only by design.
import subprocess
from langchain.tools import tool
ALLOWED_NAMESPACES = {"prod", "staging", "monitoring"}
def _run(cmd: list[str]) -> str:
"""Run a command and return combined stdout+stderr."""
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
return result.stdout + result.stderr
@tool
def kubectl_get(resource_and_args: str) -> str:
"""
Run 'kubectl get <resource_and_args> -o json'.
Example input: 'deploy payments-api -n prod'
"""
parts = resource_and_args.split()
# Enforce namespace allow-list
if "-n" in parts:
ns = parts[parts.index("-n") + 1]
if ns not in ALLOWED_NAMESPACES:
return f"Error: namespace '{ns}' is not in the allowed list."
return _run(["kubectl", "get"] + parts + ["-o", "json"])
@tool
def kubectl_describe(resource_and_args: str) -> str:
"""
Run 'kubectl describe <resource_and_args>'.
Example input: 'pod payments-api-7d9f8b-xkq2p -n prod'
"""
return _run(["kubectl", "describe"] + resource_and_args.split())
@tool
def kubectl_logs(pod_and_args: str) -> str:
"""
Fetch logs from a pod. Always use --tail to limit output.
Example input: 'payments-api-7d9f8b-xkq2p -n prod --tail=50'
"""
return _run(["kubectl", "logs"] + pod_and_args.split())
@tool
def helm_status(release_and_namespace: str) -> str:
"""
Get Helm release status.
Example input: 'payments-api -n prod'
"""
return _run(["helm", "status"] + release_and_namespace.split())
@tool
def slack_notify(message: str) -> str:
"""
Post a message to the #incidents Slack channel.
Use only for final summaries, not intermediate findings.
"""
import os, urllib.request, json
webhook = os.environ.get("SLACK_WEBHOOK_URL", "")
if not webhook:
return "SLACK_WEBHOOK_URL not set — message not sent."
data = json.dumps({"text": message}).encode()
req = urllib.request.Request(webhook, data=data,
headers={"Content-Type": "application/json"})
urllib.request.urlopen(req, timeout=10)
return "Message posted to #incidents."
Step 2 — Build the agent#
from langchain_anthropic import ChatAnthropic
from langchain.agents import AgentExecutor, create_react_agent
from langchain import hub
tools = [kubectl_get, kubectl_describe, kubectl_logs, helm_status, slack_notify]
llm = ChatAnthropic(
model="claude-opus-4-6",
max_tokens=4096,
)
# ReAct prompt from LangChain hub
prompt = hub.pull("hwchase17/react")
agent = create_react_agent(llm=llm, tools=tools, prompt=prompt)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True, # prints the Reason/Act/Observe loop
max_iterations=10, # prevent runaway loops
handle_parsing_errors=True,
)
Step 3 — Run a diagnostic task#
if __name__ == "__main__":
result = agent_executor.invoke({
"input": (
"The payments-api deployment in the prod namespace has been in "
"CrashLoopBackOff for 10 minutes. Investigate the root cause and "
"post a summary to Slack."
)
})
print("\n=== Final Answer ===")
print(result["output"])
Agent reasoning trace (verbose output):
Thought: I need to check the deployment status first.
Action: kubectl_get
Action Input: deploy payments-api -n prod
Observation: {"status": {"readyReplicas": 0, "replicas": 3, ...}}
Thought: No ready replicas. I need to look at pod events.
Action: kubectl_describe
Action Input: pod payments-api-7d9f8b-xkq2p -n prod
Observation: ... Liveness probe failed: HTTP probe failed with statuscode: 500 ...
OOMKilled
Thought: OOMKill. I should check the actual logs before concluding.
Action: kubectl_logs
Action Input: payments-api-7d9f8b-xkq2p -n prod --tail=50
Observation: ... FATAL: runtime: out of memory ...
Thought: Root cause confirmed. Post summary to Slack.
Action: slack_notify
Action Input: *payments-api CrashLoopBackOff — Root cause: OOMKill*
Memory limit: 128Mi (too low). Recommend raising to 256Mi.
Investigate memory growth in connection pool module.
Observation: Message posted to #incidents.
Final Answer: Root cause is OOMKill due to memory limit set to 128Mi...
Total time: 47 seconds.
Safety constraints — do not skip these#
This is the part most tutorials skip. An autonomous agent that can run kubectl is dangerous
if not constrained.
| Constraint | Implementation |
|---|---|
| Read-only by default | Tools only run get, describe, logs — no apply, delete, patch |
| Namespace allow-list | Enforced inside each tool before running the command |
| Iteration cap | max_iterations=10 prevents runaway API loops |
| Audit log | Set verbose=True and pipe output to a structured log |
| Mutation requires human | Write operations live in a separate, human-approved workflow |
Scaling up with LangGraph#
For multi-step workflows (not just diagnosis but also remediation proposals), move the orchestration to LangGraph, which gives you explicit control over the state machine:
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
next_step: str
graph = StateGraph(AgentState)
graph.add_node("diagnose", diagnose_node)
graph.add_node("propose_fix", propose_fix_node)
graph.add_node("notify", notify_node)
graph.add_conditional_edges(
"diagnose",
lambda state: state["next_step"],
{"propose": "propose_fix", "notify": "notify", "end": END},
)
graph.set_entry_point("diagnose")
app = graph.compile()
LangGraph for workflow structure + Claude for reasoning at each node = automation that is both auditable and adaptable.
Looking for a new opportunity as DevOps engineer
I am currently seeking a new opportunity as a DevOps Engineer, available from February 2026. I am open to remote or hybrid work from Prague, Czechia (Europe), for a long-term, full-time position or B2B contract - 300 EUR / daily. Please feel free to contact me for further details. You can also review my professional background on my LinkedIn profile.