OCR + LLM 문서 파싱 파이프라인: PDF부터 영수증까지

이 글은 누구를 위한 것인가

PDF, 스캔 문서, 영수증에서 구조화된 데이터를 추출하려는 팀
Textract 또는 Tesseract와 LLM을 결합하려는 개발자
다국어 문서 처리 파이프라인을 구축하려는 팀

들어가며

스캔된 계약서, 영수증 더미, PDF 인보이스 — 이런 문서에서 데이터를 추출하려면 OCR과 LLM을 결합해야 한다. OCR이 텍스트를 인식하고, LLM이 비정형 텍스트에서 구조화된 데이터를 추출한다.

이 글은 bluefoxdev.kr의 OCR LLM 문서 파싱 파이프라인 가이드 를 참고하여 작성했습니다.

1. 문서 파싱 파이프라인

[파이프라인 단계]

1. 입력 처리
   PDF → 페이지별 이미지 변환 (pdf2pic)
   이미지 → 전처리 (회전 보정, 노이즈 제거)
   텍스트 PDF → 직접 텍스트 추출 (pdfjs)

2. OCR
   스캔 문서: AWS Textract (테이블/폼 인식)
   빠른 처리: Tesseract.js (로컬)
   고정밀: Google Cloud Vision

3. LLM 구조화 추출
   비정형 텍스트 → JSON 스키마 추출
   Claude: 한국어 문서에 강함
   검증: Zod 스키마 검증

4. 후처리
   오류 수정 (OCR 오인식)
   정규화 (날짜, 금액 형식)
   신뢰도 점수 계산

[문서 유형별 전략]
  영수증: 금액, 날짜, 상호명 추출
  인보이스: 품목 테이블, 합계 추출
  계약서: 조항 분류, 당사자 추출
  의료 차트: HIPAA 준수 처리

2. 문서 파싱 구현

import Anthropic from '@anthropic-ai/sdk';
import { TextractClient, DetectDocumentTextCommand, AnalyzeDocumentCommand } from '@aws-sdk/client-textract';
import { z } from 'zod';
import * as fs from 'fs';

const claude = new Anthropic();
const textract = new TextractClient({ region: 'ap-northeast-2' });

// 영수증 스키마
const ReceiptSchema = z.object({
  storeName: z.string(),
  storeAddress: z.string().optional(),
  date: z.string(),
  items: z.array(z.object({
    name: z.string(),
    quantity: z.number().optional(),
    price: z.number(),
  })),
  subtotal: z.number().optional(),
  tax: z.number().optional(),
  total: z.number(),
  paymentMethod: z.string().optional(),
});

type Receipt = z.infer<typeof ReceiptSchema>;

// AWS Textract로 OCR
async function ocrWithTextract(imageBuffer: Buffer): Promise<string> {
  const response = await textract.send(new DetectDocumentTextCommand({
    Document: { Bytes: imageBuffer },
  }));

  return (response.Blocks ?? [])
    .filter(b => b.BlockType === 'LINE')
    .map(b => b.Text ?? '')
    .join('\n');
}

// 테이블이 있는 문서 (인보이스)
async function analyzeDocumentWithTables(imageBuffer: Buffer) {
  const response = await textract.send(new AnalyzeDocumentCommand({
    Document: { Bytes: imageBuffer },
    FeatureTypes: ['TABLES', 'FORMS'],
  }));
  return response.Blocks ?? [];
}

// LLM으로 구조화 추출
async function extractReceiptData(rawText: string): Promise<Receipt | null> {
  const response = await claude.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: `다음 영수증 텍스트에서 정보를 추출해 JSON으로 반환하세요.

OCR 텍스트:
${rawText}

반환 형식:
{
  "storeName": "상호명",
  "storeAddress": "주소 (없으면 null)",
  "date": "YYYY-MM-DD",
  "items": [{"name": "상품명", "quantity": 수량, "price": 금액}],
  "subtotal": 소계,
  "tax": 세금,
  "total": 합계,
  "paymentMethod": "결제수단"
}

OCR 오인식 가능성이 있으니 맥락을 고려해 수정하세요. JSON만 반환하세요.`,
    }],
    temperature: 0,
  });

  try {
    const text = response.content[0].type === 'text' ? response.content[0].text : '{}';
    const jsonMatch = text.match(/\{[\s\S]*\}/);
    if (!jsonMatch) return null;
    const parsed = JSON.parse(jsonMatch[0]);
    return ReceiptSchema.parse(parsed);
  } catch {
    return null;
  }
}

// 멀티모달: 이미지 직접 전달 (Claude Vision)
async function extractWithVision(imageBase64: string): Promise<Receipt | null> {
  const response = await claude.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: [
        {
          type: 'image',
          source: { type: 'base64', media_type: 'image/jpeg', data: imageBase64 },
        },
        {
          type: 'text',
          text: '이 영수증 이미지에서 정보를 추출해 JSON으로 반환하세요. storeName, date, items[], total 필드를 포함하세요.',
        },
      ],
    }],
  });

  try {
    const text = response.content[0].type === 'text' ? response.content[0].text : '{}';
    const match = text.match(/\{[\s\S]*\}/);
    return match ? ReceiptSchema.parse(JSON.parse(match[0])) : null;
  } catch { return null; }
}

// 통합 파이프라인
async function processDocument(filePath: string): Promise<Receipt | null> {
  const buffer = fs.readFileSync(filePath);
  const base64 = buffer.toString('base64');

  // 방법 1: Claude Vision (이미지 품질 좋을 때)
  const visionResult = await extractWithVision(base64);
  if (visionResult) return visionResult;

  // 방법 2: Textract + LLM (OCR 정확도 필요 시)
  const ocrText = await ocrWithTextract(buffer);
  return extractReceiptData(ocrText);
}

마무리

OCR + LLM 파이프라인의 핵심은 OCR 결과의 불완전성을 LLM이 보완하는 것이다. Claude는 맥락 기반으로 OCR 오인식(예: "0원" → "O원")을 수정하고, 비정형 텍스트에서 구조화된 JSON을 추출한다. 이미지 품질이 좋다면 Claude Vision으로 OCR 단계를 건너뛰고 이미지를 직접 전달하는 것이 더 정확하다.