LangChain Agents for Infrastructure Automation
Most infrastructure automation is written as linear scripts: do this, then that, handle this error, exit. That model works fine for well-defined, predictable tasks. But when the task requires judgement — “figure out why this deployment is failing and fix it” — a linear script hits a wall.
LangChain agents offer a different model: a reasoning loop that can call tools, observe results, and decide what to do next. This tutorial builds a read-only Kubernetes diagnostic agent backed by Claude.
How the agent reasons#
The agent uses the ReAct pattern: Reason → Act → Observe, repeated until the agent has enough information to produce a final answer.

Prerequisites#
pip install langchain langchain-anthropic anthropic python-dotenv
You also need kubectl configured with access to your cluster and helm if you use Helm releases.
# .env
ANTHROPIC_API_KEY=sk-ant-...
Step 1 — Define the tools#
Each tool wraps a CLI command and returns its output as a string. Tools are read-only by design.
import subprocess
from langchain.tools import tool
ALLOWED_NAMESPACES = {"prod", "staging", "monitoring"}
def _run(cmd: list[str]) -> str:
"""Run a command and return combined stdout+stderr."""
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
return result.stdout + result.stderr
@tool
def kubectl_get(resource_and_args: str) -> str:
"""
Run 'kubectl get <resource_and_args> -o json'.
Example input: 'deploy payments-api -n prod'
"""
parts = resource_and_args.split()
# Enforce namespace allow-list
if "-n" in parts:
ns = parts[parts.index("-n") + 1]
if ns not in ALLOWED_NAMESPACES:
return f"Error: namespace '{ns}' is not in the allowed list."
return _run(["kubectl", "get"] + parts + ["-o", "json"])
@tool
def kubectl_describe(resource_and_args: str) -> str:
"""
Run 'kubectl describe <resource_and_args>'.
Example input: 'pod payments-api-7d9f8b-xkq2p -n prod'
"""
return _run(["kubectl", "describe"] + resource_and_args.split())
@tool
def kubectl_logs(pod_and_args: str) -> str:
"""
Fetch logs from a pod. Always use --tail to limit output.
Example input: 'payments-api-7d9f8b-xkq2p -n prod --tail=50'
"""
return _run(["kubectl", "logs"] + pod_and_args.split())
@tool
def helm_status(release_and_namespace: str) -> str:
"""
Get Helm release status.
Example input: 'payments-api -n prod'
"""
return _run(["helm", "status"] + release_and_namespace.split())
@tool
def slack_notify(message: str) -> str:
"""
Post a message to the #incidents Slack channel.
Use only for final summaries, not intermediate findings.
"""
import os, urllib.request, json
webhook = os.environ.get("SLACK_WEBHOOK_URL", "")
if not webhook:
return "SLACK_WEBHOOK_URL not set — message not sent."
data = json.dumps({"text": message}).encode()
req = urllib.request.Request(webhook, data=data,
headers={"Content-Type": "application/json"})
urllib.request.urlopen(req, timeout=10)
return "Message posted to #incidents."
Step 2 — Build the agent#
from langchain_anthropic import ChatAnthropic
from langchain.agents import AgentExecutor, create_react_agent
from langchain import hub
tools = [kubectl_get, kubectl_describe, kubectl_logs, helm_status, slack_notify]
llm = ChatAnthropic(
model="claude-opus-4-6",
max_tokens=4096,
)
# ReAct prompt from LangChain hub
prompt = hub.pull("hwchase17/react")
agent = create_react_agent(llm=llm, tools=tools, prompt=prompt)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True, # prints the Reason/Act/Observe loop
max_iterations=10, # prevent runaway loops
handle_parsing_errors=True,
)
Step 3 — Run a diagnostic task#
if __name__ == "__main__":
result = agent_executor.invoke({
"input": (
"The payments-api deployment in the prod namespace has been in "
"CrashLoopBackOff for 10 minutes. Investigate the root cause and "
"post a summary to Slack."
)
})
print("\n=== Final Answer ===")
print(result["output"])
Agent reasoning trace (verbose output):
Thought: I need to check the deployment status first.
Action: kubectl_get
Action Input: deploy payments-api -n prod
Observation: {"status": {"readyReplicas": 0, "replicas": 3, ...}}
Thought: No ready replicas. I need to look at pod events.
Action: kubectl_describe
Action Input: pod payments-api-7d9f8b-xkq2p -n prod
Observation: ... Liveness probe failed: HTTP probe failed with statuscode: 500 ...
OOMKilled
Thought: OOMKill. I should check the actual logs before concluding.
Action: kubectl_logs
Action Input: payments-api-7d9f8b-xkq2p -n prod --tail=50
Observation: ... FATAL: runtime: out of memory ...
Thought: Root cause confirmed. Post summary to Slack.
Action: slack_notify
Action Input: *payments-api CrashLoopBackOff — Root cause: OOMKill*
Memory limit: 128Mi (too low). Recommend raising to 256Mi.
Investigate memory growth in connection pool module.
Observation: Message posted to #incidents.
Final Answer: Root cause is OOMKill due to memory limit set to 128Mi...
Total time: 47 seconds.
Safety constraints — do not skip these#
This is the part most tutorials skip. An autonomous agent that can run kubectl is dangerous
if not constrained.
| Constraint | Implementation |
|---|---|
| Read-only by default | Tools only run get, describe, logs — no apply, delete, patch |
| Namespace allow-list | Enforced inside each tool before running the command |
| Iteration cap | max_iterations=10 prevents runaway API loops |
| Audit log | Set verbose=True and pipe output to a structured log |
| Mutation requires human | Write operations live in a separate, human-approved workflow |
Scaling up with LangGraph#
For multi-step workflows (not just diagnosis but also remediation proposals), move the orchestration to LangGraph, which gives you explicit control over the state machine:
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
next_step: str
graph = StateGraph(AgentState)
graph.add_node("diagnose", diagnose_node)
graph.add_node("propose_fix", propose_fix_node)
graph.add_node("notify", notify_node)
graph.add_conditional_edges(
"diagnose",
lambda state: state["next_step"],
{"propose": "propose_fix", "notify": "notify", "end": END},
)
graph.set_entry_point("diagnose")
app = graph.compile()
LangGraph for workflow structure + Claude for reasoning at each node = automation that is both auditable and adaptable.