Skip to main content
p2p · how-to

Early-warning signals for P2P lenders: the LLM repayment-message read that flags loans 30 days before they go bad

A weekly LLM review of borrower repayment messages and platform notes flags a meaningful share of the loans about to go 30+ DPD — earlier than the dashboard. With the prompt, a five-tier severity rubric, and a 30-thread benchmark.

LW
LendWithAI

The builder's playbook for AI-powered lending. Every prompt, template, and teardown on this site comes from real experimentation, not theory.

The problem with platform delinquency dashboards is that they count what already happened. A loan goes 1+ DPD, then 7+, then 30+, and each transition becomes a row in a table. By the time the row is in the table, the borrower’s tone shifted weeks ago, and the message thread is full of the soft signals that would have predicted the row if anyone had been reading. This is part of our ongoing series on P2P lending with AI; the weekly portfolio-allocation review covers the concentration-drift side of the same operating cadence. This post is the borrower-message side.

A weekly LLM pass over the chat threads, classified into a five-tier severity grade and aggregated into a watch-list, is the cheapest thing you can do that meaningfully changes what you know about your portfolio. It is not magic. The catch rate is good but not perfect. The false-positive rate is real. The post is upfront about both.

Why platform dashboards lag

Three structural reasons.

Dashboards are payment-event-driven. They flip a flag when a payment is late or missed. They cannot flip a flag on a sentence that says “this month is tight, I might be a few days late on the 5th.” That sentence carries information; the schema does not have a column for it.

Soft signals never become structured fields. Hesitation, the appearance of a third-party excuse, the shift from “I will” to “I should be able to,” the abrupt change from chatty to monosyllabic — these all carry predictive content in the small text-mining-default literature, and none of them survive into a row of the platform database.

The cadence is wrong. Even if the dashboard surfaced this content, the typical retail lender checks the dashboard once a week at most, and the dashboard refreshes nightly. A signal that arrives on Tuesday and is acted on the following Sunday is six days closer to the missed payment, not weeks.

The fix is not to build a better dashboard. The fix is to read the messages, the way a small relationship-banking team would have read them in 1995, but with an LLM doing the first pass.

The five severity grades

The rubric is the artefact that does the work. It defines what the prompt is looking for, in language a human can apply too.

Green — paying, on time, normal tone. The borrower acknowledges schedules, responds to messages within a day or two, and uses neutral language about money. “Payment sent yesterday, let me know if there’s an issue.” Green is the default; most threads in a healthy portfolio sit here.

Yellow — first soft signal. The borrower is paying on time but starts using small hedge phrases, mentions a one-off cost, or introduces a hardship explanation that has not yet shown up in payment behaviour. “This month is tight because of a medical thing, but the EMI will go through on the 5th as usual.” Yellow does not predict default on its own; a stretch of yellow over four to six weeks does.

Orange — explicit hardship language, request for postponement. The borrower has either asked for a payment holiday, asked to restructure, or used phrases like “I might miss this month” or “can we do half now and half later?” Orange is the grade where most action begins. A direct conversation here usually changes the outcome; silence after orange usually does not.

Red — silence after a missed payment, hostile tone, contradictory excuses. The borrower has missed at least one payment and either stopped responding for more than a week or has shifted to defensive language. “I told you it would be late, why are you bothering me again.” Red is where collections workflows take over and where the collections-side post becomes more relevant than this one.

Black — explicit refusal, dispute of the loan itself. The borrower has either refused to pay or has begun questioning whether they owe the money at all — disputing the rate, the terms, or the validity of the agreement. Black is rare and usually preceded by red. When it is the first observed grade, it is almost always a fraud signal rather than a hardship signal.

The grades are designed to be ordinal and exclusive. A thread sits at exactly one grade per review window. The grade for a thread that has improved from orange to yellow over the four-week window is yellow — the most recent grade dominates, with a note that improvement was observed.

The classification prompt

This is the original artefact for this post, reproduced in full.

SYSTEM
You classify the borrower's repayment-message thread into one of five
severity grades: green, yellow, orange, red, black. You apply the
provided rubric, you cite the messages that drove the grade, and you
abstain when the thread is empty or uninformative.

INPUT
- thread_id: opaque identifier
- thread_messages: ordered list of {timestamp, sender, text} for the
  last 30 days
- payment_events: list of {date, type, amount, on_time} for the last
  30 days
- prior_classifications: last four weekly classifications for this
  thread, if any

RUBRIC
green   — paying on time, neutral tone, no hardship language
yellow  — paying on time but introduces hedge phrases, one-off cost,
          or unverified hardship reference
orange  — explicit hardship language, request for postponement,
          explicit risk of missing the next payment
red     — at least one missed payment + (silence > 7 days OR hostile
          tone OR contradictory excuses)
black   — explicit refusal to pay OR dispute of the validity of the
          loan itself

OUTPUT
{
  "thread_id": "...",
  "grade": "green|yellow|orange|red|black|abstain",
  "rationale": "<2 sentences citing 1 to 3 specific messages by index>",
  "trajectory": "improving|stable|deteriorating",
  "next_action": "<one of the action codes from the action list>",
  "confidence": "high|medium|low"
}

ACTION LIST
NONE             — green, no action
WATCH            — yellow, monitor for two more weeks
CHECK_IN         — orange, send a direct message offering options
ESCALATE         — red, route to collections
DISPUTE_REVIEW   — black, route to compliance/legal

RULES
- Do not infer beyond the provided messages and payment events.
- If the thread has fewer than two borrower messages in the last
  30 days, return grade "abstain" with rationale "insufficient signal."
- Do not classify on protected attributes (income, employment, marital,
  health, religion). If the rubric tempts you to, return "abstain."
- Confidence "high" requires at least two messages supporting the grade.

The rules block, again, is doing more than the rubric block. The abstain output is the part most prompt drafts skip; without it the model produces false-positives on threads where the borrower simply did not write much.

The watch-list aggregator

The classification prompt produces one record per thread. The aggregator is a much shorter prompt that takes the day’s batch of records, the loan exposures, and the prior classifications, and produces a one-page watch-list.

SYSTEM
You produce a weekly watch-list from a batch of thread classifications.
The list is ordered by expected loss — exposure × deterioration risk.
Output is a markdown table with columns: rank, thread_id, current_grade,
trajectory, exposure_usd, suggested_action, why_now.

INPUT
- classifications: today's grade records
- exposures: {thread_id: outstanding_principal_usd}

RULES
- Include only rows at orange, red, or black, plus any yellow with a
  deteriorating trajectory.
- Cap the list at 20 rows. If more than 20 qualify, surface the
  highest-exposure 20 and note "n more at orange+ not shown."
- "why_now" is a single phrase explaining what changed in the last
  seven days. If nothing changed, write "ongoing."

The whole loop — collect threads, classify, aggregate — runs in fifteen minutes a week for a portfolio of fifty loans. That is the operational claim, and it is the reason this works for retail lenders specifically.

The 30-thread benchmark

The synthetic dataset is constructed to span the rubric realistically — six green threads, eight yellow, seven orange, six red, three black — and to map each thread to a simulated 90-day outcome (paid as scheduled, 30+ DPD, 60+ DPD, charge-off). The threads are anonymised composites built from the kind of language that appears in real platform chats; we do not publish real borrower messages, ever.

The headline numbers from the run:

MetricValue
Threads classified30
Future defaulters in dataset (30+ DPD by day 90)18
Caught at orange-or-worse grade11 (61%)
Caught at orange-or-worse with deteriorating trajectory flag14 (78%)
False positives at orange-or-worse4 (13% of 30 threads)
Threads where prompt returned abstain2 (7%)
Mean lead time on caught defaulters23 days before first 30+ DPD

A 61% catch rate at the strict grade and 78% when the deteriorating-trajectory flag is allowed in is the honest range. The four false positives are worth looking at: three were yellow-grade threads that had a single hardship phrase but recovered; one was a thread the prompt graded orange that turned out to be a borrower in a temporarily volatile life event who paid every instalment. The cost of a false positive in this workflow is a five-minute check-in message; the cost of a false negative is a 30+ DPD that surprised you.

What this misses, named honestly

Five real gaps.

Silent borrowers. No messages, no signal. The mitigating layer is a cohort behavioural alert built on payment timing — partial payments, day-of-month drift, sudden change in the time-of-day a payment is made. The portfolio-allocation review covers part of this on the concentration side; the payment-timing alert is its own small workflow.

Payment-only relationships. Some platforms structure the lender-borrower relationship so that no chat exists. The prompt has nothing to read, by design. Either you accept this segment is invisible to the early-warning workflow, or you select platforms that preserve a chat layer.

Sophisticated dishonesty. A borrower coached to maintain cooperative tone while deliberately delaying payment will read green-or-yellow until the missed payment lands them at red. The prompt is not adversarial-language-aware. The mitigating signal is the payment-timing layer above and the cohort behavioural alert.

Language drift. Threads in non-English languages, mixed-language threads, and threads with heavy idiom or regional slang all degrade the prompt’s accuracy. The honest pattern for a non-English portfolio is to translate-then-classify, with the translation step run by the same model so the idioms survive better than they would in a separate machine-translation pipeline. The benchmark above is on English-only threads.

Definition drift over time. The rubric is good for now. As the portfolio ages, the kinds of language that signal hardship in your borrower base will shift. Re-baseline the rubric every six months against your own outcome data; the prompt is durable, the rubric edges shift.

Operating cadence — the 15-minute weekly review

The cadence is what separates a clever prompt from an actual change in your portfolio outcomes.

Once a week, on the same day, run the classification prompt over every thread that had at least two borrower messages in the trailing 30 days. Run the aggregator over today’s batch. Open the resulting watch-list. For each row at orange, write the one-paragraph check-in message — usually templated, scoped to the borrower’s specific language, never the same message to two borrowers. For each row at red, route to whatever collections workflow you have. For each row at black, route to compliance. Log the action.

The logging is the part most retail lenders skip. Without it, you cannot audit your own follow-through, cannot retro-score the prompt’s catch rate against your actual outcomes, and cannot retire the workflow if it stops earning its time. With it, you have a six-month retro that tells you whether the workflow is paying for the fifteen minutes a week.

For a portfolio of fifty loans, the typical weekly watch-list has three to six rows. For a portfolio of two hundred, eight to fifteen. Above five hundred loans, the list gets unwieldy and the workflow starts wanting a small dashboard layer in addition — not in place of — the LLM read.

What this does not replace

It does not replace credit underwriting. The decision to lend was made before the messages started; this prompt operates downstream. It does not replace your platform’s own delinquency reporting; you still need that for regulatory reporting and for the rows that arrive without prior message warning. It does not replace human judgment on the harder rows; orange-grade threads benefit from a real person reading them, and the prompt’s rationale is meant to make that read faster, not to substitute for it.

The point is that the LLM does the first pass on a high-volume, low-margin task — reading three hundred message threads — that a human would either skip or do badly. What the human does, freed up, is the conversation that actually changes the outcome. That trade is where this workflow earns its keep.

Next read and how to grab the prompt set

The natural next read is the portfolio-allocation companion piece, weekly LLM review for retail P2P lenders, which covers the concentration side of the same fifteen-minutes-a-week operating cadence.

The classification prompt above, the watch-list aggregator, and the rubric template ship in the AI Lending Starter Kit, which is the right tier if you are running a retail P2P portfolio and want the fast-onramp set rather than the full prompt library.

The earliest you can see a default coming is when the borrower writes the first hedge sentence. Read the thread.

Frequently asked questions

Can an LLM really predict loan default from message text alone?

Not from message text alone, and not 'predict' in the strict sense. What an LLM does well is classify language signals — hesitation, hardship phrasing, the tone shift from cooperative to defensive — that historically correlate with subsequent default in the small text-mining literature on credit. On our 30-thread synthetic benchmark the prompt caught 18 of 30 future-defaulters at the orange-or-worse grade with a false-positive rate of 13%, which is an honest catch rate, not a marketing one. It is good enough to add a useful weekly signal to a retail P2P portfolio. It is not a replacement for payment data.

How early can early-warning signals show up before a missed payment?

In the message-review work the lead time is usually two to six weeks before the first 30+ DPD event, with a long tail of cases that signal much earlier. The signals that matter are not always negative — a borrower who suddenly stops messaging entirely after months of cooperative chat is sometimes a sharper signal than one who explicitly asks for a payment holiday. The prompt's silence handling is the part most builders skip and most regret skipping.

What if the borrower never messages me? Does this approach still work?

Partially. For borrowers in fully disintermediated relationships with no chat history, the message-review prompt has nothing to read. The mitigating layer is a cohort behavioural alert built on platform telemetry — payment timing, partial payments, the day-of-month pattern — which catches a different and overlapping set of soft signals. The two together are stronger than either alone. The post is honest about this gap; the silent-borrower case is one of the failure modes we list explicitly.

Is it ethical to classify borrower messages with an LLM without telling them?

If the messages were sent through a platform whose terms of service permit operational analysis of communications, classifying them with an LLM is consistent with the contract the borrower agreed to — but the bar for ethics is higher than the bar for legality. The defensible practice is to disclose the analysis at sign-up, document the categories, and never pass the LLM's classification into a credit decision without human review. The early-warning use is internal triage. The borrower's experience of you should not be 'an LLM put me on a watch-list and you raised my rate.'

Sources

  1. Principles for the management of credit risk (BCBS 75) · Basel Committee on Banking Supervision
  2. Sound credit risk assessment and valuation for loans (BCBS 126) · Basel Committee on Banking Supervision
  3. Banking on SMEs — small and medium enterprise finance · International Finance Corporation
  4. Funding Circle UK — Statistics · Funding Circle
  5. Comptroller's Handbook: Loan Portfolio Management · Office of the Comptroller of the Currency