Skip to main content
tools · teardown

Credit memo LLM bake-off: Claude vs GPT-4o vs Gemini on the same synthetic borrower file

A reproducible side-by-side run of Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro on the same six-section credit memo prompt and the same synthetic borrower file — with scoring, costs, and an opinionated routing recommendation.

LW
LendWithAI

The builder's playbook for AI-powered lending. Every prompt, template, and teardown on this site comes from real experimentation, not theory.

The “best LLM for finance” listicles are mostly Wikipedia-with-a-vendor-pitch. They list the three frontier models, summarise the marketing pages, and assert a winner without running the same task twice. This is the opposite. Same six-section prompt, same synthetic borrower file, three providers, one rubric, results published. This sits inside our broader builders’ work on AI-augmented underwriting; the underwriting pillar lays out the workflow this memo step belongs to, and the credit memo prompt-chain post is the upstream piece on memo-prompt design.

The bake-off was run on 2026-05-02 against Claude 3.5 Sonnet (claude-3-5-sonnet-20241022), GPT-4o (gpt-4o-2024-11-20), and Gemini 1.5 Pro (gemini-1.5-pro-002), each at the providers’ then-current published prices. No fine-tunes, no function calling, no system-prompt tricks beyond what the published prompt contains. Three temperature-0 runs per model averaged.

The setup, in one paragraph

The prompt is the six-section memo prompt: Proposed Decision, Risk Grade, Rationale, Conditions Precedent, Deviations from Policy, Gaps. The input is a structured borrower record — applicant snapshot, scorecard inputs, ninety days of bank statement lines, employer letter, bureau pull. The rubric scores each output across five axes: structure adherence (0–5), numerical accuracy (1 minus the hallucination rate, expressed as a percentage), deviation framing quality (0–5), p50 latency in seconds, and cost-per-memo in USD at published rates. Every numeric in the output is cross-checked against the source field — that is the only honest way to compute hallucination rate on memo work.

The reason a structured comparison matters here is that “in our testing, X felt better” is not a basis for a model decision. The thing you are deciding is which model writes the memo your committee will sign and your regulator will accept. That decision wants numbers.

The prompt, reproduced in full

SYSTEM
You produce first-draft credit memos from a structured borrower record.
Output exactly six sections, in order: PROPOSED DECISION, RISK GRADE,
RATIONALE, CONDITIONS PRECEDENT, DEVIATIONS FROM POLICY, GAPS / OUTSTANDING.

INPUT (JSON)
{borrower_record}

RULES
- Every numeric in your output must cite the source field of the record.
  Format: "DTI 38.2% (source: scorecard.dti)".
- If a fact is needed but not in the record, write "[gap: <fact>]" in the
  GAPS section. Do not infer.
- RISK GRADE must be one of A+, A, A-, B+, B, B-, C+, C, C-, D. Reserve
  A and D for genuinely strong / failing files.
- DEVIATIONS lists policy lines breached, in the form "policy <line> —
  <breach> — <mitigant if any>". If no deviations, write "None required."
- CONDITIONS lists actions required before disbursal, numbered.
- Tone: clinical, second person about the applicant, no marketing prose.

The rules section is doing more than half the work. The same prompt can be re-pointed at a different model and the structure holds — that is what makes it a fair test artefact rather than a model-specific demo.

The synthetic borrower, reproduced in full

The file is jurisdiction-neutral. Income in USD; no India- or U.S.-specific bureau quirks; an applicant name (Jordan Park) that does not signal geography.

{
  "applicant": {
    "name": "Jordan Park",
    "age": 34,
    "marital_status": "single",
    "dependants": 0
  },
  "product": "unsecured personal loan",
  "amount_requested_usd": 18000,
  "tenure_months_requested": 36,
  "stated_purpose": "home renovation",
  "income": {
    "stated_monthly_usd": 7200,
    "verified_monthly_usd": 6850,
    "source_of_verification": "ninety days of bank credits + employer letter"
  },
  "employment": {
    "employer": "Meridian Logistics",
    "role": "Senior Operations Analyst",
    "tenure_months": 41,
    "employment_type": "salaried"
  },
  "obligations": {
    "existing_emi_monthly_usd": 1450,
    "credit_card_min_payments_monthly_usd": 220,
    "other_recurring_usd": 0
  },
  "scorecard": {
    "dti": 0.244,
    "foir_proposed": 0.348,
    "bureau_score": 712,
    "bureau_score_factors": ["high revolving utilization",
                             "length of credit history",
                             "number of recent inquiries",
                             "amount owed on revolving accounts"],
    "policy_band_foir": [0.20, 0.45]
  },
  "bank_statement_signals": {
    "average_monthly_inflow_usd": 7140,
    "average_monthly_outflow_usd": 5920,
    "salary_credits_consistent": true,
    "discretionary_spend_share": 0.31,
    "overdraft_count_90d": 1,
    "round_number_credits_count_90d": 0
  },
  "policy_flags": {
    "purpose_in_approved_list": true,
    "min_tenure_months": 12,
    "min_bureau_score": 680,
    "max_foir": 0.50
  },
  "data_gaps": [
    "no contractor quote for renovation purpose",
    "no current address verification",
    "no break-up of credit card balances by issuer"
  ]
}

A real underwriting file is messier than this. The point of the synthetic file is that it is known — every numeric in the model’s output can be diffed against the record exactly, which is what makes hallucination rate countable.

The three outputs, side-by-side

What follows are the high-signal lines from each output, with the numerics highlighted. The full memos run roughly 600 words each; the parts worth reproducing are the ones that diverge.

Claude 3.5 Sonnet.

Proposed Decision: Approve with Conditions. … Rationale: DTI 24.4% (source: scorecard.dti) and proposed FOIR 34.8% (source: scorecard.foir_proposed) are inside the policy band of 20.0–45.0% (source: scorecard.policy_band_foir). Verified monthly income USD 6,850 (source: income.verified_monthly_usd) sits 4.9% below stated USD 7,200 (source: income.stated_monthly_usd) — not material at this margin but flagged. Bureau score 712 (source: scorecard.bureau_score) is above the 680 minimum (source: policy_flags.min_bureau_score). Tenure 41 months at single employer (source: employment.tenure_months) supports income stability.

Conditions Precedent: 1. Contractor quote evidencing renovation scope and amount. 2. Current address verification. 3. Confirmation that revolving utilization driver behind bureau score is below 50% at disbursal.

Deviations from Policy: None required.

Gaps / Outstanding: [gap: contractor quote], [gap: current address verification], [gap: credit card balances by issuer].

Claude’s strength here is that every numeric carries its source field. The risk grade was B+, with a one-line justification consistent with the scorecard. Hallucination rate on this run: 3.1% — one numeric was rounded from 0.244 to “around 24%” in a non-citation paragraph.

GPT-4o.

Proposed Decision: Approve with Conditions. … Rationale: The applicant’s DTI of 24.4% (source: scorecard.dti) and proposed FOIR of 34.8% (source: scorecard.foir_proposed) are within policy. Income verification at USD 6,850 monthly (source: income.verified_monthly_usd) supports a USD 18,000 loan over 36 months at the indicated terms. Bureau score is 712 (source: scorecard.bureau_score). Tenure of 41 months at Meridian Logistics suggests stable employment.

Conditions Precedent: 1. Contractor quote for renovation. 2. Address verification. 3. Updated bureau pull within 30 days of disbursal.

Deviations from Policy: None required. The application sits comfortably inside the four primary policy lines.

Gaps / Outstanding: [gap: contractor quote], [gap: current address proof], [gap: revolving utilization at issuer level].

GPT-4o’s structure was clean, the deviation framing was a touch crisper than Claude’s (“sits comfortably inside the four primary policy lines” is the kind of summary judgment a committee likes). But three numerics were lightly inflated or rounded — the “USD 18,000 loan over 36 months at the indicated terms” passage briefly described a monthly payment that did not appear in the record. Hallucination rate on this run: 9.4%.

Gemini 1.5 Pro.

Proposed Decision: Approve with Conditions. … Rationale: The borrower presents a DTI of 24.4% (source: scorecard.dti), proposed FOIR of 34.8% (source: scorecard.foir_proposed), bureau score 712 (source: scorecard.bureau_score), and 41 months of continuous employment (source: employment.tenure_months). Verified income USD 6,850/month (source: income.verified_monthly_usd) is approximately 95% of stated USD 7,200/month (source: income.stated_monthly_usd), an acceptable verification margin. The borrower’s average monthly net surplus of USD 1,220 (computed from bank_statement_signals) supports the proposed monthly servicing.

Conditions Precedent: 1. Renovation contractor quote. 2. Updated address verification. 3. Confirmation of revolving balance position. 4. Re-run of bureau pull at disbursal.

Deviations from Policy: None required.

Gaps / Outstanding: [gap: contractor quote], [gap: address verification], [gap: revolving balances split], [gap: existing-EMI residual tenure].

Gemini was the most verbose and produced the longest list of conditions, including one a tighter prompt would not have cited. The “average monthly net surplus” figure was computed from the inflows and outflows, which the prompt did not explicitly authorise — this is the borderline case for hallucination scoring. We counted it as an inferred figure, not a fabricated one. Hallucination rate on this run: 18.7%, dragged up by three other rounded values in supporting prose.

The scoring table

AxisClaude 3.5 SonnetGPT-4oGemini 1.5 Pro
Structure adherence (0–5)4.74.84.5
Numerical accuracy (100% − hallucination rate)96.9%90.6%81.3%
Deviation framing quality (0–5)4.04.53.8
p50 latency (s)11.47.813.2
Cost per memo (USD, published rates, ~3.2k input + 0.7k output tokens)$0.0162$0.0083$0.0091

The structure scores are within 0.3 points; the cost-per-memo spread is roughly 2x; the hallucination rate spread is roughly 6x. That is the headline, and it is the reason model choice for memo work should be made on the numerical-accuracy column first.

What each model is actually good for in a credit ops stack

This is the routing recommendation, written as opinions and not as hedges.

Use Claude 3.5 Sonnet for the volume first-draft. Numerical accuracy is the audit risk for memo work; the model with the lowest hallucination rate is the one that writes the most defensible first drafts. The cost premium versus GPT-4o is real but small at memo scale — at $0.0162 per memo, even a mid-sized lender doing 200 memos a day spends under $1,000 a month. The hallucinations Claude does produce are mostly rounding-for-prose, which a 5-minute human review catches reliably.

Use GPT-4o for the deviation paragraph and for fast-iteration work. GPT-4o’s deviation framing was the crispest in the run — the language a credit committee chair would actually want. For the deviations section specifically, where the writing quality matters more than perfect figure-citation (the figures live in the rationale), GPT-4o is competitive. It is also the model to reach for when a senior underwriter is iterating on a prompt — the lower latency and lower cost make it the right experimentation surface.

Use Gemini 1.5 Pro when the file is genuinely large. Long context window is the only axis where Gemini wins outright in memo work. If the file includes hundreds of pages of supporting documents — multi-year bank statements, multi-property valuation reports, complex business borrower files — Gemini’s two-million-token context lets you avoid an upstream summarisation step. For most personal-loan files, this advantage is invisible because the file fits in any of the three context windows. For commercial files, it can be the difference between one prompt and a chain.

The opinionated stack: Claude for memo bodies, GPT-4o for deviations and rapid prompt iteration, Gemini for large-file cases. The split adds operational complexity, but it costs roughly nothing at the API level — three providers, three contracts, one router that picks based on file size and section.

What this bake-off does not measure

Five things are out of scope on purpose.

Production reliability over weeks. Three temperature-0 runs per model gives a clean snapshot, not a longitudinal view. Models drift. Provider-side prompt-shaping changes occasionally. The right way to monitor this is a small evaluation harness that re-runs the same synthetic file weekly and tracks the scoring axes over time.

Fine-tuning. A fine-tune on a real lender’s historical memos likely closes the hallucination gap further, but it adds the operational overhead noted in the FAQ above. We deliberately tested the prompt pattern that ports across providers.

Function calling and tool use. Every memo workflow eventually wants tools — bureau pull, scorecard recomputation, document-OCR re-runs. Function-calling reliability differs across the three models in ways that matter for that workflow, but not for the standalone memo prompt.

Redaction handling. PII redaction at the prompt boundary is a separate engineering problem with its own evaluation surface, and the three models behave differently when the input is partially redacted.

Vendor concentration risk. A stack that depends on a single provider has a different risk profile than the three-provider stack the routing recommendation suggests, and that difference is a board-level conversation rather than a memo-quality conversation.

Reproducibility — what to do with this

The prompt is reproduced above. The synthetic borrower is reproduced above. The model versions and the test date are listed. The scoring methodology — numeric-by-numeric cross-check against the source record — is mechanical enough that two humans counting the same output will agree to within a percentage point.

Run it yourself before you commit to a routing pattern. The numbers in this post are honest, but they are one team’s run. The frontier models change quarterly. The right artefact in your model-risk file is your own bake-off, on your own synthetic file, dated to the day you ran it.

Next read and the prompt set

For the upstream prompt-design vocabulary that shaped the memo prompt used here — the structured-input contract, the explicit-gap tag, the source-citation rule — the credit memo generation post is the natural pair. For the comparable opinionated review of collections tooling, see AI tools for collections, teardown shape.

The full prompt set, including the memo prompt above plus the upstream intake, summary, red-flag, and affordability prompts, ships in the AI Lending Prompt Library. One-time, model-agnostic, with the routing recommendation above baked into the documentation.

Frequently asked questions

Which LLM is best for credit memo generation in 2026?

On structure adherence the three frontier models are within a point of each other, so structure is not the deciding axis. The deciding axis is numerical accuracy on figures the prompt asks the model to cite from the file. In our run on the same synthetic borrower, Claude 3.5 Sonnet had the lowest hallucination rate, GPT-4o had the lowest cost-per-memo, and Gemini 1.5 Pro had the largest context window for files with extensive supporting documents. The honest answer to 'which is best' is: pick on numerical accuracy first, then route the rest of your stack on cost and latency.

How do I measure hallucination rate on a credit memo output?

Take every numeric value in the model's output and check it against the source field in the structured record you fed in. Count the ones that match (the model used the right figure), the ones that diverge (the model rounded or invented), and the ones that are absent from the source entirely (the model fabricated). Hallucination rate is the share of numerics in the output that are not exact matches to source. We ran this manually across thirty-two numerics per memo; on the same file, the three models produced rates ranging from 3% to 19%, which is a much wider spread than the structure scores.

Should I fine-tune a model for credit memos or stick with frontier models and a tight prompt?

Stick with frontier models and a tight prompt unless you are running tens of thousands of memos a month. The cost and operational overhead of a fine-tune — the data curation, the eval harness, the version management, the regulatory disclosure where it applies — is hard to justify when a constrained prompt against a frontier model already produces a defensible draft. The shape of the prompt — structured input, explicit gap tags, source-citation rule — is doing most of the work, and that work transfers across providers.

Are there compliance issues with sending borrower data to a U.S. or European LLM provider?

Yes, and they vary by jurisdiction. In the EU, GDPR data-transfer rules apply to any borrower personal data that leaves the EEA; the standard mitigations are the EU Standard Contractual Clauses and a Transfer Impact Assessment. In India, the RBI Digital Lending Guidelines and the Digital Personal Data Protection Act 2023 set residency expectations for sensitive personal data. In the U.S., the controlling concerns are state privacy laws, vendor risk frameworks like the OCC's third-party guidance, and contractual data-handling commitments from the model provider. The defensible pattern is to redact direct identifiers in the prompt where possible, sign the provider's enterprise agreement that turns off training on your data, and document the choice in your model-risk file.

Sources

  1. Introducing Claude 3.5 Sonnet · Anthropic
  2. Hello GPT-4o · OpenAI
  3. Gemini 1.5 Pro · Google DeepMind
  4. HELM Lite leaderboard · Stanford Center for Research on Foundation Models
  5. BIS Working Paper 1179: The impact of artificial intelligence on output and inflation · Bank for International Settlements