AI 코드 생성 품질 측정: LLM 코드의 정확도·보안·유지보수성 평가

이 글은 누구를 위한 것인가

AI 코드 생성 도구 도입 효과를 측정하고 싶은 엔지니어링 팀
LLM이 생성한 코드의 품질과 보안을 검증해야 하는 팀
AI 코딩 어시스턴트를 평가하는 표준 방법론이 필요한 팀

들어가며

LLM이 코드를 잘 생성하는지 어떻게 알 수 있나? "작동하는 것 같다"는 답이 아니다. 테스트 통과율, 보안 취약점 비율, 실제 프로덕션 배포 비율로 측정해야 한다.

이 글은 bluefoxdev.kr의 AI 코드 생성 가이드 를 참고하여 작성했습니다.

1. AI 코드 품질 지표

[AI 코드 생성 품질 메트릭]

기능 정확도:
  Pass@1: 첫 번째 시도로 테스트 통과율
  Pass@k: k번 시도 중 1번이라도 통과율
  HumanEval 점수: 표준 벤치마크 (0-100%)
  
  GPT-4: ~87% Pass@1
  Claude Sonnet: ~85% Pass@1
  
보안:
  CWE 취약점 발생률 (SQL Injection, XSS 등)
  Bandit/Semgrep 스캔 결과
  시크릿 하드코딩 비율

코드 품질:
  Cyclomatic Complexity (복잡도)
  코드 중복률
  린트 경고 수
  문서화 비율

실무 지표:
  PR 승인율 (AI 코드 vs 인간 코드)
  버그 발생율 (AI 코드 기원)
  리뷰 사이클 수
  
[측정 도구]
  기능: pytest, jest (자동 테스트)
  보안: Bandit (Python), Semgrep (다중 언어)
  품질: SonarQube, pylint, eslint
  벤치마크: HumanEval, MBPP, SWE-bench

2. 코드 품질 자동 평가 파이프라인

import subprocess
import json
from dataclasses import dataclass

@dataclass
class CodeQualityReport:
    pass_rate: float
    security_issues: list[dict]
    complexity_avg: float
    lint_errors: int
    overall_score: float

async def evaluate_generated_code(
    code: str,
    tests: list[str],
    language: str = "python",
) -> CodeQualityReport:
    """LLM 생성 코드 자동 품질 평가"""
    
    # 1. 파일로 저장
    code_file = f"/tmp/eval_code.{language}"
    test_file = f"/tmp/eval_tests.{language}"
    
    with open(code_file, "w") as f:
        f.write(code)
    with open(test_file, "w") as f:
        f.write("\n".join(tests))
    
    # 2. 테스트 실행 (기능 평가)
    test_result = subprocess.run(
        ["python", "-m", "pytest", test_file, "--tb=short", "-q", "--json-report"],
        capture_output=True, text=True, timeout=30
    )
    
    test_data = json.loads(test_result.stdout) if test_result.returncode == 0 else {}
    passed = test_data.get("summary", {}).get("passed", 0)
    total = test_data.get("summary", {}).get("total", 1)
    pass_rate = passed / total
    
    # 3. 보안 스캔
    security_result = subprocess.run(
        ["bandit", "-f", "json", code_file],
        capture_output=True, text=True
    )
    security_data = json.loads(security_result.stdout) if security_result.stdout else {}
    security_issues = security_data.get("results", [])
    
    # 4. 복잡도 분석
    complexity_result = subprocess.run(
        ["radon", "cc", code_file, "-j"],
        capture_output=True, text=True
    )
    complexity_data = json.loads(complexity_result.stdout) if complexity_result.stdout else {}
    complexities = [
        block["complexity"]
        for blocks in complexity_data.values()
        for block in blocks
    ]
    complexity_avg = sum(complexities) / len(complexities) if complexities else 0
    
    # 5. 린트 검사
    lint_result = subprocess.run(
        ["pylint", code_file, "--output-format=json"],
        capture_output=True, text=True
    )
    lint_data = json.loads(lint_result.stdout) if lint_result.stdout else []
    lint_errors = len([m for m in lint_data if m.get("type") == "error"])
    
    # 6. 종합 점수 계산
    security_penalty = len([i for i in security_issues if i["issue_severity"] == "HIGH"]) * 20
    complexity_penalty = max(0, complexity_avg - 10) * 5
    
    overall_score = max(0, min(100,
        pass_rate * 60 +              # 기능: 60점
        (1 - len(security_issues) / 10) * 25 +  # 보안: 25점
        (1 - lint_errors / 20) * 15 -            # 품질: 15점
        security_penalty - complexity_penalty
    ))
    
    return CodeQualityReport(
        pass_rate=pass_rate,
        security_issues=security_issues,
        complexity_avg=complexity_avg,
        lint_errors=lint_errors,
        overall_score=overall_score,
    )

async def run_ai_code_benchmark(model: str, problems: list[dict]) -> dict:
    """LLM 코드 생성 벤치마크 실행"""
    import anthropic
    
    client = anthropic.Anthropic()
    results = []
    
    for problem in problems:
        response = client.messages.create(
            model=model,
            max_tokens=2000,
            messages=[{
                "role": "user",
                "content": f"다음 문제를 Python으로 구현하세요. 코드만 출력하세요.\n\n{problem['description']}"
            }]
        )
        
        generated_code = extract_code(response.content[0].text)
        report = await evaluate_generated_code(generated_code, problem["tests"])
        results.append(report)
    
    return {
        "model": model,
        "avg_pass_rate": sum(r.pass_rate for r in results) / len(results),
        "avg_security_issues": sum(len(r.security_issues) for r in results) / len(results),
        "avg_score": sum(r.overall_score for r in results) / len(results),
    }

마무리

AI 코드 생성 품질 측정에서 가장 중요한 단일 지표는 "자동화 테스트 통과율"이다. 보안 취약점은 Bandit·Semgrep로 CI 파이프라인에 통합하면 AI 코드가 PR 단계에서 자동으로 검사된다. LLM이 생성한 코드는 항상 인간 리뷰를 거쳐야 하지만, 자동화 검사가 명백한 결함을 먼저 걸러줄 수 있다.