Evolve a basic prompt into a production-grade, reusable template for generating merchant risk assessments โ through iterative prompt engineering.
โฑ 40 minutes
Exercise Overview
AnyCompany's risk teams assess thousands of merchants across Southeast Asia. Each assessment requires analyzing transaction data, chargeback rates, complaint history, and compliance status โ then writing a clear narrative that non-technical stakeholders can act on.
Currently, analysts spend 30-45 minutes per merchant writing these manually. In this exercise, you'll build a prompt template that produces consistent, high-quality assessments in seconds.
โ๏ธ Setup: How to run this exercise
Use Kiro chat for this exercise. You'll paste prompts and observe how the output improves with each technique.
Keeping steps independent:
Start a New Session for each step (Steps 1โ5, 7, 8)
Step 6 continues in the same session as Step 5
Each prompt includes an isolation instruction โ a steering file is pre-configured to tell Kiro not to read previous step files, ensuring each technique is evaluated independently
Each step saves to a unique filename (step1-..., step2-...) so you can compare outputs side-by-side at the end
๐ฆ First-time Kiro setup (do this once before starting)
Download and extract this zip into your workspace root folder. It adds steering files that configure Kiro for the exercises.
Extract into your workspace root โ creates .kiro/steering/workshop-rules.md (AnyCompany context) and .kiro/steering/exercise-isolation.md (keeps exercise steps independent).
๐ฏ Exercise Approach
In this exercise, you'll iteratively refine a prompt through 6 steps โ each applying a specific technique from the Advanced Prompting curriculum. At the end, you'll extract a reusable prompt template with variables, then test it against a completely different merchant profile to validate that it works at scale. The final deliverable is a production-ready prompt template that your team can deploy across thousands of merchant assessments.
Zero-shot means giving the model a task with no examples, no role, and minimal instruction. This establishes a baseline โ you'll see what the model produces with almost no guidance, then improve from there.
In Kiro, start a New Session and paste:
PROMPT โ Step 1: Zero-Shot
Assess the risk of this merchant based on the data below. Save the output as "step1-zero-shot.md" in a "lab6-risk-assessment" folder.
[PASTE MERCHANT DATA HERE]
MERCHANT DATA
๐ Observe the output: The response is likely generic, unstructured, and missing key analysis. It may hallucinate details not in the data. Note what's missing โ this is your improvement baseline.
๐ฌ Discussion point: What's wrong with this output? Common issues:
No clear structure โ hard to scan quickly
May include assumptions not supported by the data
No risk rating or actionable recommendation
Inconsistent depth โ some areas over-analyzed, others ignored
Would look different every time you run it โ not repeatable
Step 2: Add Role & Persona
๐ Technique: Role & Persona Prompting (Module 3)
Assigning a specific role shapes the model's vocabulary, reasoning depth, and what it considers important. A "risk analyst" will focus on different signals than a "customer support agent" looking at the same data.
Start a New Session in Kiro. This time, add a persona before the task:
PROMPT โ Step 2: Persona
You are a Senior Merchant Risk Analyst at a Southeast Asian fintech company. You have 8 years of experience assessing payment merchants for fraud risk, operational risk, and compliance risk. You are known for being thorough, data-driven, and fair โ you always distinguish between genuine business growth and suspicious patterns.
Assess the risk of this merchant based on the data below. Save the output as "step2-persona.md" in the "lab6-risk-assessment" folder.
[PASTE MERCHANT DATA HERE]
MERCHANT DATA
๐ฌ Why does persona work? The model has been trained on millions of documents written by risk analysts. When you say "You are a Senior Merchant Risk Analyst," you're activating that specific knowledge cluster โ the model draws on risk assessment frameworks, industry terminology, and analytical patterns it learned from real analyst writing.
Step 3: Add Few-Shot Examples
๐ Technique: Few-Shot Prompting (Module 1)
Providing 1-2 examples of the desired output teaches the model your exact format, tone, and level of detail. This is the single most effective technique for getting consistent, repeatable outputs.
Start a New Session in Kiro. Now include two short example assessments before the actual task:
PROMPT โ Step 3: Few-Shot
You are a Senior Merchant Risk Analyst at a Southeast Asian fintech company. You have 8 years of experience assessing payment merchants for fraud risk, operational risk, and compliance risk.
Here are two examples of how merchant risk assessments should be written:
---
EXAMPLE 1 (LOW RISK):
Merchant: FreshDaily Grocers (MRC-1102) | Market: Malaysia | Category: Grocery
Assessment: FreshDaily Grocers demonstrates a healthy, stable transaction profile. Monthly volumes have grown steadily at 8-10% month-over-month, consistent with organic business expansion. Chargeback rate of 0.4% is well within the industry benchmark of 0.5-1.0%. Customer complaints are minimal (2-3/month) and resolved within SLA. KYC documentation is current and no compliance flags exist.
Risk Rating: ๐ข GREEN โ No action required. Next review in 6 months.
---
EXAMPLE 2 (HIGH RISK):
Merchant: LuxeDeals Online (MRC-3391) | Market: Indonesia | Category: E-Commerce
Assessment: LuxeDeals Online presents significant risk indicators requiring immediate attention. Transaction volume spiked 400% in one month with no corresponding business explanation. Chargeback rate has reached 6.2%, far exceeding the 1.0% industry benchmark. 70% of chargebacks cite "unauthorized transaction," suggesting potential card-testing or account takeover fraud. The merchant has not responded to two compliance review requests.
Risk Rating: ๐ด RED โ Recommend immediate PayLater suspension and enhanced monitoring. Escalate to Fraud Investigation team.
---
Now assess this merchant using the same format and depth. Save the output as "step3-few-shot.md" in the "lab6-risk-assessment" folder.
[PASTE MERCHANT DATA HERE]
MERCHANT DATA
๐ Compare with Step 2: The output should now match the format of your examples โ same structure, similar length, consistent tone. The model learned your "house style" from just 2 examples.
๐ฌ Key insight: Few-shot examples are like showing a new analyst "here's how we write these reports." The model mimics the pattern. Notice how 2 examples (one GREEN, one RED) are enough โ the model interpolates for AMBER cases on its own.
Step 4: Add Structured Output Requirements
๐ Technique: Structured Output (Module 4)
Defining exact sections and format ensures every assessment covers the same areas. This makes outputs comparable across merchants and scannable by busy stakeholders.
Start a New Session in Kiro. Now add explicit section requirements:
PROMPT โ Step 4: Structured Output
You are a Senior Merchant Risk Analyst at a Southeast Asian fintech company with 8 years of experience in payment merchant risk assessment.
Produce a Merchant Risk Assessment Report with EXACTLY these sections:
1. MERCHANT SUMMARY
- One paragraph: who they are, what they do, how long on platform
2. TRANSACTION ANALYSIS
- Volume and GMV trends (highlight any anomalies)
- Average transaction size analysis
- PayLater adoption trends and risk implications
3. CHARGEBACK & DISPUTE ANALYSIS
- Current rate vs. industry benchmark
- Trend direction (improving/worsening)
- Root cause breakdown
- Dispute resolution effectiveness
4. CUSTOMER COMPLAINT ANALYSIS
- Volume trend and top categories
- SLA compliance
- Correlation with chargeback patterns
5. RISK FACTORS
- List each identified risk factor
- For each: severity (HIGH/MEDIUM/LOW) and supporting data point
6. MITIGATING FACTORS
- Any legitimate business explanations for the patterns observed
7. RISK RATING
- ๐ข GREEN (low risk) | ๐ก AMBER (elevated, monitor) | ๐ด RED (high, action required)
- One-sentence justification
8. RECOMMENDED ACTIONS
- Numbered list of specific, actionable next steps with owners and timelines
Save the output as "step4-structured.md" in the "lab6-risk-assessment" folder.
[PASTE MERCHANT DATA HERE]
MERCHANT DATA
๐ Compare with Step 3: Every assessment now has the same 8 sections. You can compare Merchant A vs Merchant B side by side. Stakeholders know exactly where to look for the information they need.
Grounding instructions prevent the model from hallucinating facts not in the data. Self-critique makes the model review its own output for errors, bias, or unsupported claims โ like having a second analyst review the report.
Start a New Session in Kiro. This is the last isolated step โ after this, Step 6 continues in the same session. Add grounding rules and a self-review step:
PROMPT โ Step 5: Grounding + Self-Critique
You are a Senior Merchant Risk Analyst at a Southeast Asian fintech company with 8 years of experience in payment merchant risk assessment.
CRITICAL GROUNDING RULES:
- Base your assessment ONLY on the data provided below. Do not infer, assume, or add information not present in the data.
- Every claim must reference a specific data point. Example: "Chargeback rate increased from 0.3% to 4.1% over 6 months" โ not "chargebacks are high."
- If data is insufficient to assess an area, explicitly state: "[INSUFFICIENT DATA: need X to assess Y]"
- Do not speculate on intent or motivation. State patterns, not judgments about the merchant's character.
- Distinguish between correlation and causation. If two trends coincide, note the correlation but do not claim one caused the other.
Produce a Merchant Risk Assessment Report with these sections:
1. MERCHANT SUMMARY
2. TRANSACTION ANALYSIS
3. CHARGEBACK & DISPUTE ANALYSIS
4. CUSTOMER COMPLAINT ANALYSIS
5. RISK FACTORS (each with severity and supporting data point)
6. MITIGATING FACTORS
7. RISK RATING (GREEN / AMBER / RED with justification)
8. RECOMMENDED ACTIONS (numbered, with owners and timelines)
After completing the report, perform a SELF-REVIEW:
- Re-read your assessment and check: Is every claim supported by a specific data point from the input?
- Are there any assumptions or inferences that go beyond the data?
- Is the risk rating consistent with the evidence presented?
- Would a different analyst reading the same data reach the same conclusion?
If you find any issues, correct them before presenting the final report.
Save the output as "step5-grounded.md" in the "lab6-risk-assessment" folder.
[PASTE MERCHANT DATA HERE]
MERCHANT DATA
๐ Compare with Step 4: The output should now cite specific numbers for every claim. The self-review section catches errors the model might have made. This is production-safe โ auditable and defensible.
๐ฌ Why self-critique matters for risk assessments: In regulated environments, every assessment may be audited. A report that says "chargebacks are concerning" is useless. A report that says "chargeback rate increased from 0.3% to 4.1% over 6 months, exceeding the 1.0% industry benchmark by 4x" is auditable. The grounding rules + self-critique ensure this level of rigor automatically.
Step 6: Extract the Reusable Template
๐ Technique: Meta-Prompting (Module 5.2)
Meta-prompting asks the AI to analyze your conversation and produce a reusable artifact. Instead of manually extracting the template, you ask the model to do it โ turning your iterative work into a production-ready template with variables. The quality of the template depends on how well you instruct the extraction.
In the same session from Step 5 (do not start a new one), paste this follow-up:
PROMPT โ Step 6: Template Extraction
Now I want to turn this into a reusable template that any analyst on my team can use for ANY merchant โ not just QuickMart Express.
Create a Markdown file called "merchant-risk-assessment-prompt-template.md" and save it in a "prompt-templates" folder.
The template should:
- Be completely self-contained โ a new team member should be able to use it without additional training
- Use {{variables}} for all merchant-specific data (name, ID, market, transaction data, etc.)
- Include the persona, grounding rules, output structure, and self-review from our refined prompt
- Have clear usage instructions
Think about what makes a template truly production-ready and reusable at scale.
โ Deliverable: Kiro will create a prompt-templates/merchant-risk-assessment-prompt-template.md file. This is your reusable artifact โ open it and review the quality.
๐ค Submit Your Template
Submit your template for automated scoring. How production-ready is it?
Resubmitting with the same name replaces your previous entry.
๐ก How scoring works
Your template is sent to Amazon Bedrock which evaluates how production-ready it is โ structure, reusability, guardrails, quality controls, and domain relevance. The same LLM-as-Judge technique from the slides, now applied to your work. Scores appear on the leaderboard below.
๐ Template Leaderboard
No submissions yet. Be the first!
๐ Instructor Version โ How would an expert extract this template?
After submitting, enter the passkey to reveal the production-grade extraction prompt. Compare it with what you used โ notice the difference in specificity.
๐ Instructor Version โ Production-Grade Extraction Prompt
Compare this with the simplified prompt above. Notice how much more specific it is about variable definitions, data format examples, customization notes, and modifiable sections. This level of detail is what makes a template truly self-contained and usable by someone who wasn't in the room when it was built.
Excellent work. Now I want to turn this into a reusable template that any analyst on my team can use for ANY merchant โ not just QuickMart Express.
Create a Markdown file called "merchant-risk-assessment-prompt-template.md" and save it in a "prompt-templates" folder. The file should contain a complete, self-contained prompt template with the following structure:
## HEADER
- Title: "Merchant Risk Assessment โ Prompt Template"
- Version number, last updated date, purpose statement
- Brief usage instruction: copy from ---START PROMPT--- to ---END PROMPT---, replace variables, paste into LLM
## TEMPLATE USAGE GUIDE
A table with ALL variables used in the template. Columns:
| Variable | Description | Expected Format | Example |
Include every variable: {{merchant_name}}, {{merchant_id}}, {{market}}, {{merchant_category}}, {{onboarded_date}}, {{payment_channels}}, {{beneficial_owner}}, {{related_merchants}}, {{kyc_status}}, {{business_registration_status}}, {{analysis_period}}, {{transaction_data}}, {{chargeback_data}}, {{industry_chargeback_benchmark}}, {{complaint_data}}, {{complaint_sla}}, {{compliance_status}}, {{additional_context}}, {{currency}}
## DATA FORMAT EXAMPLES
Show the exact format expected for each complex variable (transaction_data, chargeback_data, complaint_data, compliance_status, additional_context) with realistic sample data.
## PREREQUISITES
Numbered list of what to prepare before using the template.
## ---START PROMPT--- / ---END PROMPT---
The actual prompt template containing:
- The persona we refined
- The grounding rules from Step 5
- The 8-section output format from Step 4
- Risk rating definitions (GREEN/AMBER/RED with specific criteria)
- The self-review instructions from Step 5
- A MERCHANT DATA INPUT block at the end with all {{variables}} organized by category
## CUSTOMIZATION NOTES (after ---END PROMPT---)
Three subsections:
1. **Market-Specific Adjustments** โ Table of SEA markets (Singapore/MAS, Malaysia/BNM, Indonesia/OJK, Thailand/BOT, Vietnam/SBV, Philippines/BSP) with key regulatory considerations.
2. **Adjusting Risk Thresholds by Merchant Category** โ Table of categories (Convenience, E-commerce, F&B, Digital Goods, Travel, Subscription) with typical chargeback benchmarks.
3. **Modifiable Sections** โ Table showing which parts of the template can/cannot be modified and why. Grounding rules and self-review should be marked as "Do not modify."
The template must be self-contained โ a new team member should be able to use it without any additional context or training.
๐ก Teaching point: The simplified prompt produces a decent template. The instructor version produces a production-grade one. The difference? Specificity โ naming every variable, defining data formats, including customization notes. This is the gap between "good enough for a demo" and "ready for your team to use daily."
Step 7: Validate the Template with New Data
๐ Technique: Production Testing
A template is only useful if it works beyond the data it was built with. Test against a completely different merchant โ different market, category, and risk profile.
How to use your template
Open the file prompt-templates/merchant-risk-assessment-prompt-template.md that Kiro created in Step 6
Find the section between ---START PROMPT--- and ---END PROMPT---
Copy that entire block โ this is your reusable prompt
Start a New Session in Kiro
Paste the prompt, then replace the entire merchant data section with the test data below โ no need to replace variables one by one, just swap the whole data block
Add this instruction at the end: "Save the output as run1.md in the lab6-risk-assessment folder."
Test data โ a low-risk Indonesian merchant
TEST DATA โ WarungMakan Digital (Indonesia, healthy growth)
Start a New Session for each run. Use the same template + test data each time, but change the output filename:
Run
Add this to the end of your prompt
Run 1
Save the output as "run1.md" in the "lab6-risk-assessment" folder.
Run 2
Save the output as "run2.md" in the "lab6-risk-assessment" folder.
Run 3
Save the output as "run3.md" in the "lab6-risk-assessment" folder.
๐ Validate the outputs: This is a deliberately different profile โ low-risk Indonesian F&B merchant. Check across all 3 runs:
Are all 8 sections populated correctly?
Does it handle IDR currency (not just SGD)?
Does it correctly identify this as a low-risk (GREEN) merchant?
Are the recommended actions appropriate for a healthy F&B merchant?
Are the 3 runs consistent in structure and rating?
๐ฌ If the template doesn't work well: That's valuable feedback. Go back to Step 6 and refine โ maybe it needs better currency handling or market-aware regulatory references. This iterate-and-test cycle is exactly how production templates get hardened.
Step 8: Evaluate Your Template
Now use LLM-as-Judge to score each of your 3 runs from Step 7. You have run1.md, run2.md, and run3.md in your lab6-risk-assessment folder.
Score each run
Start a New Session in Kiro. Paste the rubric below and tell Kiro which file to evaluate:
PROMPT โ Evaluate Run 1
You are a STRICT expert evaluator for merchant risk assessments. You have high standards and rarely give perfect scores. A score of 5 should be genuinely exceptional โ most good outputs score 3-4.
Read the file "lab6-risk-assessment/run1.md" and score on 4 criteria (1-5 each):
1. **Completeness** (1-5): Are all 8 required sections present with SUBSTANTIVE content?
- 1 = Missing 3+ sections
- 2 = Missing 1-2 sections or several are just headers with one sentence
- 3 = All sections present but 2-3 are thin (under 2 sentences)
- 4 = All sections present with good detail, minor gaps
- 5 = RARE โ every section has exceptional depth, derived insights, and cross-references between sections
2. **Data Grounding** (1-5): Does EVERY factual claim cite a SPECIFIC number from the input?
- 1 = Most claims are generic ("chargebacks are high")
- 2 = Some numbers cited but many vague claims remain
- 3 = Most claims cite data but 2-3 are generic or rounded
- 4 = Nearly all claims cite specific data, 1 minor gap
- 5 = RARE โ zero vague statements, every single claim traces to an exact metric, includes calculated derived metrics (e.g., growth rates, ratios)
3. **Actionability** (1-5): Are recommendations SPECIFIC with named owners AND timelines?
- 1 = No actions or just "monitor the situation"
- 2 = Generic actions like "review the merchant" without specifics
- 3 = Some specific actions but missing owners OR timelines
- 4 = Most actions have owners and timelines, 1-2 are vague
- 5 = RARE โ every action has: what to do, who does it, by when, and what triggers escalation
4. **Analytical Depth** (1-5): Does the assessment show REASONING beyond restating data?
- 1 = Just restates the input data in paragraph form
- 2 = Minimal interpretation, mostly data summary
- 3 = Some analysis (e.g., compares to benchmarks) but surface-level
- 4 = Good analysis with trend interpretation and risk implications
- 5 = RARE โ identifies root causes, connects patterns across sections, calculates derived metrics, distinguishes correlation from causation
IMPORTANT: Be honest. Most good AI outputs score 14-17/20. A score of 20/20 should be almost never given. If you find yourself giving all 5s, you are being too lenient โ re-read the "RARE" criteria.
Return your evaluation as JSON and save it as "eval-run1.md" in the "lab6-risk-assessment" folder:
{"completeness": X, "grounding": X, "actionability": X, "depth": X, "total": X, "strengths": "one sentence", "weaknesses": "one sentence โ there is ALWAYS something to improve"}
๐ก For runs 2 and 3: Start a New Session each time. Change run1.md โ run2.md โ run3.md and eval-run1.md โ eval-run2.md โ eval-run3.md in the prompt.
Record your scores
Completeness
Grounding
Actionability
Depth
Total /20
Run 1
Run 2
Run 3
Average
โ What to look for:
All 3 runs score 17-19/20: Your template is production-ready and consistent โ this is the ideal outcome
All 3 runs score the same: Excellent consistency โ the template produces reliable results every time. This is what you want for production use.
Average total 14-16: Good but has room for improvement โ check which criterion scored lowest and refine that part of the template
Scores vary by 3+ between runs: The template needs tighter constraints โ add more structure, examples, or decision rules
Want to see real score differences? Try the bonus challenge below โ switch to a different model and compare
๐ก This is LLM-as-Judge โ you're using one AI to evaluate another AI's output. This technique scales: you can evaluate 100 outputs in minutes instead of hours. The JSON format makes it easy to track scores over time and compare template versions.
๐ฏ Bonus challenge (if time permits): The 3 runs above test consistency (same model, same template โ does it produce reliable results?). For a different test, try running your template with a different model. In Kiro, you can switch models in the model selector. Try generating a run-nova.md or run-haiku.md and evaluate it with the same rubric. You'll likely see score differences โ different models have different strengths. A cheaper model might score 14/20 where the default scores 17/20 โ and that might be good enough for GREEN-rated merchants.
Reflection & Discussion
What You Built
Through 6 iterative steps, you evolved a 10-word prompt into a production-grade template that:
Produces consistent, structured assessments every time
Cites specific data points (auditable and defensible)
Includes self-review to catch errors before human review
Works for any merchant โ just swap the variables
Can be customized per market and merchant category
Technique Recap
Step
Technique
What It Fixed
1. Zero-Shot
Baseline
Established what "bad" looks like
2. Persona
Role assignment
Better vocabulary, deeper analysis
3. Few-Shot
Example-driven
Consistent format and tone
4. Structured
Section requirements
Comparable, scannable outputs
5. Grounding
RAG + Self-Critique
No hallucination, auditable claims
6. Meta-Prompt
Template extraction
Reusable at scale
7. Validation
Test with new data
Confirmed template generalizes
8. Evaluation
Rubric scoring
Measurable quality, consistency proof
๐ก Key takeaway: The prompt IS the product. In many business workflows, you don't need to build software โ you need to build a great prompt template. A well-engineered template that takes 35 minutes to create can save your team hundreds of hours per month when deployed across thousands of merchant assessments.
Try It Yourself
To validate your template, try it with a completely different merchant profile โ a high-volume e-commerce merchant in Indonesia, or a small food stall in Thailand. Does the template still produce a useful assessment? If not, what needs adjusting?
๐พ Save Your Game โ AI Memory for Long Projects
You just completed 8 steps across multiple sessions. In a real project, you'd want the AI to "remember" this work next week. But AI has no memory between sessions โ every new chat starts blank.
The fix: maintain two files as your project's persistent memory:
project-status.md
Current state โ what exists, what's remaining, key decisions. Load this every session. Keep it compact (~2 pages).
session-log.md
History โ what was done each session and why. Load only when needed (e.g., "when did we change the threshold?").
End-of-session prompt:
Update project-status.md with what we built today:
- List all files created or modified
- Update the "What's Remaining" section
- Note any key decisions we made
Then append a summary of today's session to session-log.md.
๐ก Why be specific? "Update the status" is vague โ the AI might miss details or append instead of replacing. The more specific your save prompt, the better your load next time. Think of it like writing a handover note for your future self.
What You Accomplished
๐ Applied 6 advanced prompting techniques in a real business context
๐ Experienced iterative prompt refinement โ the core skill of prompt engineering
๐ Produced a reusable, production-grade prompt template with variables
๐ Learned to ground AI outputs in data and add self-critique for quality assurance
๐ Evaluated your template with a rubric and LLM-as-Judge โ measurable quality
๐๏ธ Built an artifact your team can deploy immediately for merchant risk assessments