ML 모델 Kubernetes 배포: Triton Inference Server로 고성능 서빙

이 글은 누구를 위한 것인가

자체 훈련 ML 모델을 프로덕션에 서빙해야 하는 팀
Triton Inference Server 설정이 처음인 ML 엔지니어
Kubernetes에서 GPU 워크로드를 운영하는 DevOps

들어가며

FastAPI로 ML 모델을 서빙하면 편하지만, GPU 활용률이 낮고 배칭이 없어 비효율적이다. Triton Inference Server는 GPU 최대 활용, 동적 배칭, 앙상블 파이프라인을 지원한다.

이 글은 bluefoxdev.kr의 ML 서빙 인프라 를 참고하여 작성했습니다.

1. Triton 서빙 아키텍처

[Triton Inference Server 구성]

모델 저장소 구조:
  model_repository/
  ├── text_classifier/
  │   ├── config.pbtxt     # 모델 설정
  │   └── 1/               # 버전 1
  │       └── model.onnx
  ├── embedding_model/
  │   ├── config.pbtxt
  │   └── 1/
  │       └── model.pt
  └── ensemble_pipeline/
      ├── config.pbtxt     # 앙상블 정의
      └── 1/               # 빈 디렉토리

[동적 배칭 (Dynamic Batching)]
  개별 요청 → 큐에 적재 → 배치 구성 → GPU 실행
  최대 대기 시간: 5-20ms (레이턴시 허용 범위 내)
  배치 크기: 8-64 (GPU 메모리에 따라)
  효과: GPU 활용률 20% → 80%+

[Kubernetes GPU 리소스]
  resources:
    limits:
      nvidia.com/gpu: 1
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists

[모델 A/B 배포]
  트래픽 스플릿: canary 10% → 50% → 100%
  메트릭 모니터링: 오류율, 레이턴시, 정확도
  자동 롤백: 이상 감지 시

2. Triton 배포 설정

# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        args:
        - tritonserver
        - --model-repository=s3://my-model-bucket/models
        - --http-port=8000
        - --grpc-port=8001
        - --metrics-port=8002
        - --log-verbose=0
        ports:
        - containerPort: 8000
        - containerPort: 8001
        - containerPort: 8002
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "16Gi"
          requests:
            memory: "8Gi"
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

# config.pbtxt 생성 유틸리티
def generate_triton_config(
    model_name: str,
    backend: str,  # "onnxruntime", "pytorch", "tensorrt"
    input_shapes: list[dict],
    output_shapes: list[dict],
    max_batch_size: int = 32,
    dynamic_batching: bool = True,
) -> str:
    
    inputs = "\n".join(f"""  {{
    name: "{inp['name']}"
    data_type: {inp['dtype']}
    dims: {inp['dims']}
  }}""" for inp in input_shapes)
    
    outputs = "\n".join(f"""  {{
    name: "{out['name']}"
    data_type: {out['dtype']}
    dims: {out['dims']}
  }}""" for out in output_shapes)
    
    dynamic_batch_config = """
dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 10000
}""" if dynamic_batching else ""
    
    return f"""name: "{model_name}"
backend: "{backend}"
max_batch_size: {max_batch_size}

input [{inputs}]
output [{outputs}]
{dynamic_batch_config}

instance_group [
  {{
    kind: KIND_GPU
    count: 1
  }}
]"""

import tritonclient.grpc as grpcclient
import numpy as np

def triton_inference(
    server_url: str,
    model_name: str,
    inputs: dict[str, np.ndarray],
    output_names: list[str],
) -> dict[str, np.ndarray]:
    """Triton gRPC 추론 클라이언트"""
    
    client = grpcclient.InferenceServerClient(url=server_url)
    
    triton_inputs = []
    for name, array in inputs.items():
        inp = grpcclient.InferInput(name, array.shape, "FP32")
        inp.set_data_from_numpy(array)
        triton_inputs.append(inp)
    
    triton_outputs = [grpcclient.InferRequestedOutput(name) for name in output_names]
    
    response = client.infer(
        model_name=model_name,
        inputs=triton_inputs,
        outputs=triton_outputs,
    )
    
    return {name: response.as_numpy(name) for name in output_names}

def canary_deploy_model(
    old_model: str,
    new_model: str,
    traffic_pct: int = 10,
) -> callable:
    """카나리 배포: 트래픽 분할"""
    import random
    
    def route_request(request_data):
        if random.randint(1, 100) <= traffic_pct:
            return triton_inference("triton:8001", new_model, request_data, ["output"])
        else:
            return triton_inference("triton:8001", old_model, request_data, ["output"])
    
    return route_request

마무리

Triton은 단순 FastAPI 서빙 대비 GPU 활용률을 4배 이상 높인다. 동적 배칭을 켜면 레이턴시를 5-20ms 희생하고 처리량을 3-5배 늘릴 수 있다. Kubernetes HPA(Horizontal Pod Autoscaler)와 결합하면 트래픽에 따라 자동 스케일링이 가능하다.