zoff.tech

Google Cloud

How we build production AI on Google Cloud: GKE runtimes, governed sandboxes, identity, networking, observability, and agent-ready environments.

Google Cloud is a strong fit when the hard part is governed runtime: Kubernetes, identity, networking, observability, and developer environments that need to be created and destroyed constantly.

We choose it when runtime control is the product. That usually means developer infrastructure, agent sandboxes, CI environments, or workloads where isolation and lifecycle management matter more than a quick chatbot surface.

Where Google Cloud fits

  • Developer-facing infrastructure with GKE as the control plane.
  • Agent or CI workloads that need isolated sandboxes with predictable network policy.
  • Teams that already operate Google Cloud and want AI features to inherit the same release and incident model.
  • Products where agents need their own governed execution environments instead of borrowing a human developer's permissions.

What we watch closely

  • Kubernetes complexity. GKE is powerful, but the product should hide namespace, policy, and lifecycle machinery from the user.
  • Environment cost. Branch and agent environments multiply quickly unless TTL, shared baselines, and cleanup are first-class behavior.
  • Identity edges. Human users, CI jobs, and agents need different scopes even when they touch the same environment.

Decisions we tend to make

  • Use GKE-native primitives when they buy back time instead of abstracting across clouds too early.
  • Make teardown, TTL, and audit trails part of the core workflow.
  • Put metrics and logs in place before exposing self-serve environment creation.
  • Design agent sandboxes as governed users, not as an exception path.

What we include in handover

  • Namespace, NetworkPolicy, and identity model for humans, CI jobs, and agents.
  • Environment lifecycle rules: create, share, snapshot, expire, tear down.
  • Metrics and logs tied to each workspace, PR, agent run, or user action.
  • Cost controls for idle environments, repeated builds, shared baselines, and cleanup.
  • Runbooks for failed provisioning, leaked resources, policy violations, and stuck teardown.

When we avoid it

If the product is mostly document workflow, back-office automation, or a narrow internal AI tool, Kubernetes can be more machinery than the problem needs. We use Google Cloud when governed runtime is central, not when we just need somewhere to host an API.

Related work

Microstax is the public example: isolated Kubernetes namespaces, shared reference environments, PR-based environments, logs, metrics, and agent-ready runtime primitives.