Pod Templates#

Pod templates define how workflow tasks execute as Kubernetes pods. After configuring pools and resource validation, create pod templates to specify scheduling constraints, security policies, and resource allocations that apply across your pools.

Why Use Pod Templates?#

Pod templates provide standardized configurations that simplify cluster management:

Target Specific Hardware

Use node selectors and tolerations to route workflows to the right GPU types, CPU architectures, or instance types.

Enforce Security Policies

Apply consistent security contexts, capabilities, and access controls across all workflow tasks.

Optimize Resource Allocation

Set appropriate resource requests and limits with conditional logic based on workflow requirements.

Simplify User Experience

Users select pools without needing to understand complex Kubernetes scheduling—templates handle all the details.

How It Works#

Template Application Flow#

1. Define Templates 📋

Create reusable specs

2. Reference in Pools 🔗

Attach to pools

3. Merge Templates 🔄

Combine specifications

4. Create K8s Pods

Build Kubernetes pods

Template Structure#

Pod templates use the standard Kubernetes PodSpec format with OSMO enhancements:

template_name:
  spec:
    nodeSelector:
      node-label: value
    tolerations:
      - key: taint-key
        effect: NoSchedule
    containers:
      - name: '{{USER_CONTAINER_NAME}}'
        resources:
          limits:
            cpu: '{{USER_CPU}}'
            memory: '{{USER_MEMORY}}'

Key Features#

  • Variable Substitution: Use {{USER_CPU}}, {{WF_ID}}, etc. are resolved at runtime

  • Template Merging: Combine multiple templates; later ones override earlier ones

  • Conditional Logic: Use Jinja2 expressions for dynamic values (For example, to accept all user requests of CPU > 2 else override to 2, use {% if USER_CPU > 2 %}2{% else %}{{USER_CPU}}{% endif %})

Warning

Merge Behavior

  • Fields are overridden by your templates

  • Lists are merged by name field (same name = recursive merge, different name = append)

  • Templates are applied in order (later overrides earlier)

Note

For detailed configuration fields and all available variables, see /api/configs/pod_template in the API reference.

Base Pod Specification Details

OSMO creates a base pod spec with three containers (osmo-init, osmo-ctrl, user container). Your templates are merged on top of it.

apiVersion: v1
kind: Pod
metadata:
  labels:
    osmo.workflow_id: <workflow name>
    osmo.submitted_by: <user name>
spec:
  containers:
    - name: {{USER_CONTAINER_NAME}}  # Your code runs here
      command: ["/osmo/bin/osmo_exec"]
    - name: osmo-ctrl  # Manages data transfer
  initContainers:
    - name: osmo-init  # Sets up environment

Practical Guide#

📄 Edit in your Helm values file

Everything on this section is to be added in your Helm values file under services.configs. Apply changes with helm upgrade.

Standard Pod Templates#

Create templates that target specific hardware and handle Kubernetes scheduling constraints.

Step 1: Understanding Template Variables

Special Variables
Resource Variables:
  • {{USER_CPU}} - CPU count

  • {{USER_GPU}} - GPU count

  • {{USER_MEMORY}} - Memory (e.g., “8Gi”)

  • {{USER_STORAGE}} - Storage (e.g., “100Gi”)

  • {{USER_CONTAINER_NAME}} - Name of user container

Workflow Variables:
  • {{WF_ID}} - Workflow name/ID

  • {{WF_UUID}} - Unique workflow ID

  • {{WF_TASK_NAME}} - Task name

  • {{WF_SUBMITTED_BY}} - Username

  • {{WF_POOL}} - Pool name

  • {{WF_PLATFORM}} - Platform name

Conditional Logic:
  • Use Jinja2: {% if USER_CPU > 2 %}2{% else %}{{USER_CPU}}{% endif %}

Step 2: Define Pod Templates in Helm Values

Add base templates for architecture, control container, and user container under services.configs.podTemplates:

services:
  configs:
    enabled: true
    podTemplates:
      # Target specific architecture
      default_amd64:
        spec:
          nodeSelector:
            kubernetes.io/arch: amd64
      # Control container
      default_ctrl:
        spec:
          containers:
            - name: osmo-ctrl
              resources:
                # Use user specified resources as limits
                limits:
                  cpu: '{{USER_CPU}}'
                  memory: '{{USER_MEMORY}}'
                  ephemeral-storage: '{{USER_STORAGE}}'
                # Cap ctrl container at 2 CPUs if user requests more
                requests:
                  cpu: '{% if USER_CPU > 2 %}2{% else %}{{USER_CPU}}{% endif %}'
                  memory: 1Gi
                  ephemeral-storage: 4Gi
      # User container
      default_user:
        spec:
          containers:
            - name: '{{USER_CONTAINER_NAME}}'
              resources:
                limits:
                  cpu: '{{USER_CPU}}'
                  memory: '{{USER_MEMORY}}'
                  nvidia.com/gpu: '{{USER_GPU}}'
                  ephemeral-storage: '{{USER_STORAGE}}'
                requests:
                  cpu: '{{USER_CPU}}'
                  memory: '{{USER_MEMORY}}'
                  nvidia.com/gpu: '{{USER_GPU}}'
                  ephemeral-storage: '{{USER_STORAGE}}'

Step 3: Reference Templates in Pools

Add templates to your pool’s common_pod_template field:

services:
  configs:
    pools:
      my-pool:
        backend: default
        common_pod_template:
          - default_amd64
          - default_ctrl
          - default_user

Step 4: Apply

helm upgrade osmo deployments/charts/service -f my-values.yaml

Additional Examples#

GPU-Specific Templates - Target Specific GPU Types

Create templates for different GPU hardware (H100, L40, T4):

services:
  configs:
    podTemplates:
      training_h100:
        spec:
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-H100
          tolerations:
            - key: training-dedicated
              value: h100
              effect: NoSchedule
      simulation_l40:
        spec:
          nodeSelector:
            nvidia.com/gpu.product: NVIDIA-L40
          tolerations:
            - key: simulation-dedicated
              value: l40
              effect: NoSchedule
CPU Instance Types - Target Specific Instance Classes

Target CPU-optimized instances:

services:
  configs:
    podTemplates:
      cpu_compute:
        spec:
          nodeSelector:
            node.kubernetes.io/instance-type: c5.4xlarge
Security Templates - Apply Security Contexts

Enforce security policies:

services:
  configs:
    podTemplates:
      secure_workload:
        spec:
          securityContext:
            runAsNonRoot: true
            runAsUser: 1000
            fsGroup: 1000
          containers:
            - name: '{{USER_CONTAINER_NAME}}'
              securityContext:
                allowPrivilegeEscalation: false
                readOnlyRootFilesystem: true
                capabilities:
                  drop: [ALL]
Node Exclusion - Exclude Specific Nodes

Use node affinity to exclude nodes from user requests. This rule can be used to avoid gpu fragmentation with in the cluster by satisfying user requests on the same node, before the scheduler chooses other nodes to schedule tasks.

services:
  configs:
    podTemplates:
      node_exclusion:
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: kubernetes.io/hostname
                        operator: NotIn
                        values: '{{USER_EXCLUDED_NODES}}'
Shared Memory - Add /dev/shm Volume

Add shared memory for workflows requiring IPC (Example: TensorRT, PyTorch, etc.)

services:
  configs:
    podTemplates:
      shared_memory:
        spec:
          containers:
            - name: '{{USER_CONTAINER_NAME}}'
              volumeMounts:
                - name: shm
                  mountPath: /dev/shm
          volumes:
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: 1Gi

Troubleshooting#

Template Not Found
  • Verify template name matches exactly in pool configuration

  • Check template exists: osmo config get POD_TEMPLATE <template_name>

Variable Substitution Errors
  • Ensure all variables used are valid OSMO variables

  • Check for typos in variable names (case-sensitive)

  • Review logs for specific variable resolution errors

Resource Constraints
  • Verify resource requests match available node capacity

  • Check nodeSelector labels exist on cluster nodes

  • Ensure tolerations match node taints

Debugging Tips
  • Start with simple templates and add complexity gradually

  • Validate YAML syntax before applying

  • Test with different workflow configurations

  • Review OSMO service logs for detailed errors

Tip

Best Practices

  • Use descriptive template names (e.g., gpu_h100_training, cpu_inference)

  • Create modular templates for reusability across different pools (Example: architecture, security, resources)

  • Use conditional logic to optimize resource requests

  • Add labels and annotations for monitoring

  • Test templates thoroughly before production use

Warning

  • Do not override image, command, or args fields in containers — OSMO manages these internally.

  • Template changes only apply to new workflows and NOT running workflow tasks