LLM 시맨틱 캐싱: 유사한 질문에 비용 없이 답변하기

이 글은 누구를 위한 것인가

LLM API 호출 비용을 줄이려는 팀
유사한 질문에 같은 답변을 재사용하려는 개발자
캐시 히트율을 높여 응답 속도를 개선하려는 팀

들어가며

"배송은 언제 오나요?", "배송 예정일이 언제죠?", "언제 도착해요?" — 의미는 같지만 정확한 문자열은 다르다. 정확 매칭 캐시는 이 세 질문 모두 캐시 미스가 된다. 시맨틱 캐시는 임베딩 유사도로 의미가 같은 질문을 감지한다.

이 글은 bluefoxdev.kr의 LLM 시맨틱 캐싱 가이드 를 참고하여 작성했습니다.

1. 시맨틱 캐시 아키텍처

[2계층 캐시 구조]

L1: 정확 매칭 (해시 캐시)
  키: MD5(정규화된 질문)
  조회: O(1), 매우 빠름
  히트: 완전히 같은 질문

L2: 시맨틱 매칭 (벡터 캐시)
  키: 임베딩 벡터
  조회: ANN(근사 최근접 이웃)
  히트: 의미적으로 유사한 질문
  임계값: 코사인 유사도 > 0.92

LLM 호출 (캐시 미스)
  응답 생성 후 두 계층 모두 저장

[유사도 임계값 설정]
  > 0.95: 매우 엄격 (거의 같은 문장만)
  > 0.92: 권장 (의미적으로 동일한 문장)
  > 0.85: 느슨 (유사하지만 다를 수 있음)
  
  도메인마다 다름: A/B 테스트로 최적값 찾기

[TTL 전략]
  FAQ: 7일 (안정적 답변)
  제품 정보: 1일 (자주 변경)
  실시간 데이터: 캐시 안 함 (재고, 날씨)

2. 시맨틱 캐시 구현

import Anthropic from '@anthropic-ai/sdk';
import { Redis } from 'ioredis';
import crypto from 'crypto';

const client = new Anthropic();
const redis = new Redis(process.env.REDIS_URL!);

type Embedding = number[];

function cosineSimilarity(a: Embedding, b: Embedding): number {
  const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const normA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const normB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dot / (normA * normB);
}

async function getEmbedding(text: string): Promise<Embedding> {
  const response = await client.messages.create({
    model: 'claude-haiku-4-5-20251001',
    max_tokens: 1,
    messages: [{ role: 'user', content: text }],
  });
  // 실제로는 Voyage AI 또는 OpenAI Embeddings API 사용
  return Array.from({ length: 1536 }, () => Math.random() - 0.5);
}

class SemanticCache {
  private readonly similarityThreshold = 0.92;
  private readonly maxCacheEntries = 10000;
  private readonly defaultTTL = 86400; // 24시간

  async get(question: string): Promise<string | null> {
    // L1: 정확 매칭
    const exactKey = `cache:exact:${crypto.createHash('md5').update(question.trim().toLowerCase()).digest('hex')}`;
    const exactHit = await redis.get(exactKey);
    if (exactHit) {
      await redis.incr('cache:stats:l1_hits');
      return exactHit;
    }

    // L2: 시맨틱 매칭
    const embedding = await getEmbedding(question);
    const semanticHit = await this.findSemanticMatch(embedding);
    if (semanticHit) {
      await redis.incr('cache:stats:l2_hits');
      // L1에도 저장 (다음 정확 매칭 시 빠르게)
      await redis.set(exactKey, semanticHit, 'EX', this.defaultTTL);
      return semanticHit;
    }

    await redis.incr('cache:stats:misses');
    return null;
  }

  async set(question: string, answer: string, ttl?: number): Promise<void> {
    const normalizedQ = question.trim().toLowerCase();
    const exactKey = `cache:exact:${crypto.createHash('md5').update(normalizedQ).digest('hex')}`;
    const embedding = await getEmbedding(question);
    const entryId = crypto.randomUUID();

    const effectiveTTL = ttl ?? this.defaultTTL;

    await Promise.all([
      // L1 저장
      redis.set(exactKey, answer, 'EX', effectiveTTL),
      // L2 저장 (벡터 + 답변)
      redis.set(`cache:vector:${entryId}`, JSON.stringify({
        question,
        answer,
        embedding,
        createdAt: Date.now(),
      }), 'EX', effectiveTTL),
      // 인덱스에 추가
      redis.sadd('cache:vector:index', entryId),
    ]);
  }

  private async findSemanticMatch(queryEmbedding: Embedding): Promise<string | null> {
    const entryIds = await redis.smembers('cache:vector:index');
    let bestMatch: string | null = null;
    let bestScore = this.similarityThreshold;

    // 배치 조회
    const entries = await Promise.all(
      entryIds.slice(0, 1000).map(id => redis.get(`cache:vector:${id}`))
    );

    for (const entry of entries) {
      if (!entry) continue;
      const { answer, embedding } = JSON.parse(entry);
      const score = cosineSimilarity(queryEmbedding, embedding);
      if (score > bestScore) {
        bestScore = score;
        bestMatch = answer;
      }
    }

    return bestMatch;
  }

  async getStats() {
    const [l1Hits, l2Hits, misses] = await Promise.all([
      redis.get('cache:stats:l1_hits'),
      redis.get('cache:stats:l2_hits'),
      redis.get('cache:stats:misses'),
    ]);
    const total = Number(l1Hits) + Number(l2Hits) + Number(misses);
    return {
      l1HitRate: `${((Number(l1Hits) / total) * 100).toFixed(1)}%`,
      l2HitRate: `${((Number(l2Hits) / total) * 100).toFixed(1)}%`,
      overallHitRate: `${(((Number(l1Hits) + Number(l2Hits)) / total) * 100).toFixed(1)}%`,
    };
  }
}

// LLM 호출 래퍼
const cache = new SemanticCache();

async function cachedLLMCall(question: string, systemPrompt: string): Promise<string> {
  const cached = await cache.get(question);
  if (cached) return cached;

  const response = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: systemPrompt,
    messages: [{ role: 'user', content: question }],
  });

  const answer = response.content[0].type === 'text' ? response.content[0].text : '';
  await cache.set(question, answer);
  return answer;
}

마무리

시맨틱 캐시의 핵심은 임계값 설정이다. 0.92는 대부분의 FAQ 시나리오에서 안전하다. 너무 낮으면 다른 질문에 잘못된 답변을 반환한다. 실시간 데이터(재고, 날씨)가 포함된 답변은 캐시하지 않는다. 캐시 히트율 30-50%만 달성해도 LLM API 비용을 크게 절감할 수 있다.