LangChain Agents for Infrastructure Automation

I am currently seeking a new opportunity as a DevOps Engineer, available from February 2026. I am open to remote or hybrid work from Prague, Czechia (Europe), for a long-term, full-time position or B2B contract - 300 EUR / daily. Please feel free to contact me for further details. You can also review my professional background on my LinkedIn profile.

Most infrastructure automation is written as linear scripts: do this, then that, handle this error, exit. That model works fine for well-defined, predictable tasks. But when the task requires judgement — “figure out why this deployment is failing and fix it” — a linear script hits a wall.

LangChain agents offer a different model: a reasoning loop that can call tools, observe results, and decide what to do next. This tutorial builds a read-only Kubernetes diagnostic agent backed by Claude.

How the agent reasons#

The agent uses the ReAct pattern: Reason → Act → Observe, repeated until the agent has enough information to produce a final answer.

LangChain ReAct agent loop

Prerequisites#

pip install langchain langchain-anthropic anthropic python-dotenv

You also need kubectl configured with access to your cluster and helm if you use Helm releases.

# .env
ANTHROPIC_API_KEY=sk-ant-...

Step 1 — Define the tools#

Each tool wraps a CLI command and returns its output as a string. Tools are read-only by design.

import subprocess
from langchain.tools import tool

ALLOWED_NAMESPACES = {"prod", "staging", "monitoring"}

def _run(cmd: list[str]) -> str:
    """Run a command and return combined stdout+stderr."""
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
    return result.stdout + result.stderr

@tool
def kubectl_get(resource_and_args: str) -> str:
    """
    Run 'kubectl get <resource_and_args> -o json'.
    Example input: 'deploy payments-api -n prod'
    """
    parts = resource_and_args.split()
    # Enforce namespace allow-list
    if "-n" in parts:
        ns = parts[parts.index("-n") + 1]
        if ns not in ALLOWED_NAMESPACES:
            return f"Error: namespace '{ns}' is not in the allowed list."
    return _run(["kubectl", "get"] + parts + ["-o", "json"])

@tool
def kubectl_describe(resource_and_args: str) -> str:
    """
    Run 'kubectl describe <resource_and_args>'.
    Example input: 'pod payments-api-7d9f8b-xkq2p -n prod'
    """
    return _run(["kubectl", "describe"] + resource_and_args.split())

@tool
def kubectl_logs(pod_and_args: str) -> str:
    """
    Fetch logs from a pod. Always use --tail to limit output.
    Example input: 'payments-api-7d9f8b-xkq2p -n prod --tail=50'
    """
    return _run(["kubectl", "logs"] + pod_and_args.split())

@tool
def helm_status(release_and_namespace: str) -> str:
    """
    Get Helm release status.
    Example input: 'payments-api -n prod'
    """
    return _run(["helm", "status"] + release_and_namespace.split())

@tool
def slack_notify(message: str) -> str:
    """
    Post a message to the #incidents Slack channel.
    Use only for final summaries, not intermediate findings.
    """
    import os, urllib.request, json
    webhook = os.environ.get("SLACK_WEBHOOK_URL", "")
    if not webhook:
        return "SLACK_WEBHOOK_URL not set — message not sent."
    data = json.dumps({"text": message}).encode()
    req = urllib.request.Request(webhook, data=data,
                                 headers={"Content-Type": "application/json"})
    urllib.request.urlopen(req, timeout=10)
    return "Message posted to #incidents."

Step 2 — Build the agent#

from langchain_anthropic import ChatAnthropic
from langchain.agents import AgentExecutor, create_react_agent
from langchain import hub

tools = [kubectl_get, kubectl_describe, kubectl_logs, helm_status, slack_notify]

llm = ChatAnthropic(
    model="claude-opus-4-6",
    max_tokens=4096,
)

# ReAct prompt from LangChain hub
prompt = hub.pull("hwchase17/react")

agent = create_react_agent(llm=llm, tools=tools, prompt=prompt)

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,       # prints the Reason/Act/Observe loop
    max_iterations=10,  # prevent runaway loops
    handle_parsing_errors=True,
)

Step 3 — Run a diagnostic task#

if __name__ == "__main__":
    result = agent_executor.invoke({
        "input": (
            "The payments-api deployment in the prod namespace has been in "
            "CrashLoopBackOff for 10 minutes. Investigate the root cause and "
            "post a summary to Slack."
        )
    })
    print("\n=== Final Answer ===")
    print(result["output"])

Agent reasoning trace (verbose output):

Thought: I need to check the deployment status first.
Action: kubectl_get
Action Input: deploy payments-api -n prod
Observation: {"status": {"readyReplicas": 0, "replicas": 3, ...}}

Thought: No ready replicas. I need to look at pod events.
Action: kubectl_describe
Action Input: pod payments-api-7d9f8b-xkq2p -n prod
Observation: ... Liveness probe failed: HTTP probe failed with statuscode: 500 ...
             OOMKilled

Thought: OOMKill. I should check the actual logs before concluding.
Action: kubectl_logs
Action Input: payments-api-7d9f8b-xkq2p -n prod --tail=50
Observation: ... FATAL: runtime: out of memory ...

Thought: Root cause confirmed. Post summary to Slack.
Action: slack_notify
Action Input: *payments-api CrashLoopBackOff — Root cause: OOMKill*
Memory limit: 128Mi (too low). Recommend raising to 256Mi.
Investigate memory growth in connection pool module.
Observation: Message posted to #incidents.

Final Answer: Root cause is OOMKill due to memory limit set to 128Mi...

Total time: 47 seconds.

Safety constraints — do not skip these#

This is the part most tutorials skip. An autonomous agent that can run kubectl is dangerous if not constrained.

Constraint	Implementation
Read-only by default	Tools only run `get`, `describe`, `logs` — no `apply`, `delete`, `patch`
Namespace allow-list	Enforced inside each tool before running the command
Iteration cap	`max_iterations=10` prevents runaway API loops
Audit log	Set `verbose=True` and pipe output to a structured log
Mutation requires human	Write operations live in a separate, human-approved workflow

Scaling up with LangGraph#

For multi-step workflows (not just diagnosis but also remediation proposals), move the orchestration to LangGraph, which gives you explicit control over the state machine:

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    next_step: str

graph = StateGraph(AgentState)
graph.add_node("diagnose", diagnose_node)
graph.add_node("propose_fix", propose_fix_node)
graph.add_node("notify", notify_node)

graph.add_conditional_edges(
    "diagnose",
    lambda state: state["next_step"],
    {"propose": "propose_fix", "notify": "notify", "end": END},
)
graph.set_entry_point("diagnose")
app = graph.compile()

LangGraph for workflow structure + Claude for reasoning at each node = automation that is both auditable and adaptable.

Looking for a new opportunity as DevOps engineer