LLM 테스팅·평가 하네스: AI 출력 품질을 코드로 검증하기

이 글은 누구를 위한 것인가

프롬프트를 바꿀 때마다 수동으로 결과를 확인하는 팀
LLM 출력이 기대한 형식·내용인지 자동 검증이 필요한 팀
AI 기능을 CI/CD에 통합하고 싶은 개발자

들어가며

프롬프트를 조금 바꿨더니 다른 기능이 깨졌다. 이게 LLM 앱 개발의 현실이다. 단위 테스트처럼 프롬프트에도 회귀 테스트가 필요하다. LLM-as-Judge를 쓰면 "좋은 응답"을 수치로 측정할 수 있다.

이 글은 bluefoxdev.kr의 LLM 품질 관리 를 참고하여 작성했습니다.

1. LLM 평가 전략

[평가 방식 비교]

규칙 기반 평가:
  정규식으로 JSON 형식 확인
  필수 키워드 포함 여부
  응답 길이 범위
  장점: 빠름, 결정론적
  단점: 유연성 없음

LLM-as-Judge:
  더 강력한 모델로 출력 평가
  루브릭 기반 점수 (1-10)
  비교 평가 (A vs B)
  장점: 뉘앙스 이해, 열린 형식 평가
  단점: 비용, 비결정론적

사람 평가:
  전문가 어노테이터
  크라우드소싱
  장점: 가장 정확
  단점: 느림, 비쌈

[평가 파이프라인]
  골든 데이터셋 → 프롬프트 실행 → 자동 평가 → 사람 검토 → CI 통과/실패

2. 평가 하네스 구현

import anthropic
import json
from dataclasses import dataclass, field
from typing import Callable

client = anthropic.Anthropic()

@dataclass
class TestCase:
    id: str
    input: str
    expected_contains: list[str] = field(default_factory=list)
    expected_format: str | None = None  # 'json', 'markdown', 'plain'
    min_quality_score: float = 7.0

@dataclass
class EvalResult:
    test_id: str
    output: str
    rule_checks: dict[str, bool]
    llm_score: float
    llm_feedback: str
    passed: bool

async def run_llm_judge(
    question: str,
    answer: str,
    rubric: str,
) -> tuple[float, str]:
    """LLM-as-Judge로 응답 품질 평가"""
    
    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=400,
        messages=[{
            "role": "user",
            "content": f"""다음 AI 응답을 평가하세요.

질문: {question}
응답: {answer}

평가 기준:
{rubric}

JSON으로 평가:
{{
  "score": 1.0-10.0,
  "feedback": "구체적 피드백",
  "strengths": ["강점1"],
  "weaknesses": ["약점1"]
}}"""
        }]
    )
    
    result = json.loads(response.content[0].text)
    return result["score"], result["feedback"]

def check_rules(output: str, test_case: TestCase) -> dict[str, bool]:
    """규칙 기반 검사"""
    checks = {}
    
    for keyword in test_case.expected_contains:
        checks[f"contains_{keyword}"] = keyword.lower() in output.lower()
    
    if test_case.expected_format == "json":
        try:
            json.loads(output)
            checks["valid_json"] = True
        except json.JSONDecodeError:
            checks["valid_json"] = False
    
    checks["not_empty"] = len(output.strip()) > 0
    checks["reasonable_length"] = 10 <= len(output) <= 10000
    
    return checks

class EvalHarness:
    """LLM 평가 하네스"""
    
    def __init__(self, prompt_fn: Callable, rubric: str):
        self.prompt_fn = prompt_fn
        self.rubric = rubric
        self.results: list[EvalResult] = []
    
    async def run_test(self, test_case: TestCase) -> EvalResult:
        output = await self.prompt_fn(test_case.input)
        
        rule_checks = check_rules(output, test_case)
        
        llm_score, feedback = await run_llm_judge(
            test_case.input, output, self.rubric
        )
        
        all_rules_pass = all(rule_checks.values())
        score_ok = llm_score >= test_case.min_quality_score
        
        result = EvalResult(
            test_id=test_case.id,
            output=output,
            rule_checks=rule_checks,
            llm_score=llm_score,
            llm_feedback=feedback,
            passed=all_rules_pass and score_ok,
        )
        self.results.append(result)
        return result
    
    async def run_suite(self, test_cases: list[TestCase]) -> dict:
        import asyncio
        results = await asyncio.gather(*[self.run_test(tc) for tc in test_cases])
        
        passed = sum(1 for r in results if r.passed)
        avg_score = sum(r.llm_score for r in results) / len(results)
        
        return {
            "total": len(results),
            "passed": passed,
            "failed": len(results) - passed,
            "pass_rate": passed / len(results),
            "avg_llm_score": avg_score,
            "failed_tests": [r.test_id for r in results if not r.passed],
        }
    
    def compare_prompts(
        self,
        prompt_a_results: list[EvalResult],
        prompt_b_results: list[EvalResult],
    ) -> dict:
        """A/B 프롬프트 비교"""
        avg_a = sum(r.llm_score for r in prompt_a_results) / len(prompt_a_results)
        avg_b = sum(r.llm_score for r in prompt_b_results) / len(prompt_b_results)
        
        return {
            "prompt_a_score": avg_a,
            "prompt_b_score": avg_b,
            "winner": "A" if avg_a > avg_b else "B",
            "improvement": abs(avg_a - avg_b),
        }

마무리

LLM 테스팅의 시작은 골든 데이터셋 10-20개다. 각 케이스에 "무엇이 좋은 응답인가"를 정의하는 루브릭을 작성하면 LLM-as-Judge가 자동으로 측정한다. CI에 통합하면 프롬프트 변경 시마다 회귀 테스트가 돌아 품질 저하를 사전에 잡는다.