Day 2

Prompt Engineering
Workshop

Mastering Prompt Techniques for All Users

AnyCompany Financial Group · Generative & Agentic AI on AWS

Module 1

Prompt Fundamentals
Deep Dive

The 4 pillars that determine 80% of output quality

The 80/20 Rule of Prompting

80% of prompt quality comes from 4 fundamentals:

1. Clarity

Say exactly what you mean. If a colleague would ask "what do you mean?" — your prompt needs work.

2. Context

Give the AI the background it needs. Without context, it guesses — dangerous in finance.

3. Role Assignment

Tell the AI who to be. A "risk analyst" focuses on different signals than a "support agent."

4. Output Framing

Define what "done" looks like — format, length, structure, style.

Pillar 1: Clarity

Vague	Clear
"Summarize this report"	"Summarize this quarterly earnings report in 5 bullet points, focusing on revenue growth, cost changes, and risk factors"
"Help me with this data"	"Analyze this CSV of 500 transactions and identify the top 3 merchants by total volume"
"Write something about compliance"	"Draft a 200-word summary of MAS Notice 626 requirements for e-payment service providers"

Rule of thumb: The more specific your prompt, the less the AI has to guess.

Pillar 2: Context

Without context:

"Is this transaction suspicious?"

With context:

"This merchant is a convenience store 
in Singapore, typically 50-80 txns/day 
averaging $15 SGD. Today: 340 txns 
averaging $4.50. Is this suspicious?"

Types of context: Domain · Data · Situational · Constraints

4 Types of Context

Type	What it tells the AI	Finance example
Domain	The industry, market, and business area	"In the context of Southeast Asian digital payments and PayLater services..."
Data	The specific numbers, records, or documents to analyze	"Here is the merchant's transaction history for the last 6 months: [data]"
Situational	Why you need this now — the trigger or event	"We are preparing for a quarterly board review" / "This merchant was flagged by our monitoring system"
Constraints	Rules, limits, and requirements the output must follow	"All amounts in SGD with 2 decimal places" / "Follow MAS Notice 626 guidelines"

Rule of thumb: If you skip Domain context, the AI gives generic answers. If you skip Data context, it hallucinates. If you skip Situational context, it guesses your purpose. If you skip Constraints, it ignores your standards.

Context in Action: Merchant Review

[DOMAIN]
You are reviewing an AnyCompany Pay merchant in 
Singapore's food & beverage sector.

[DATA]
Merchant: Kopi Corner Pte Ltd (ID: MC-8842)
Monthly txn volume: 4,200 → 15,600 (6-month trend)
Avg transaction: $8.50 SGD
Chargeback rate: 0.3% → 4.1% (6-month trend)
Complaints: 12 in last 30 days (up from 2)

[SITUATIONAL]
Auto-flagged: chargeback rate exceeds 1.0% threshold.
Risk committee meets Friday.

[CONSTRAINTS]
- All amounts in SGD
- Reference AnyCompany's chargeback policy (max 1.0%)
- Use only the data provided above
- Include a GREEN/AMBER/RED risk rating

Pillar 3: Role Assignment

Role	What changes in the output
Compliance Officer	Focuses on regulatory requirements, flags risks
Customer Support Agent	Empathetic language, resolution-focused
Financial Analyst	Numbers, trends, comparisons, frameworks
Fraud Investigator	Patterns, anomalies, evidence chains

Pro tip: Add experience level — "Senior Credit Risk Analyst with 10 years of experience in Southeast Asian consumer lending, specializing in PayLater products"

Pillar 4: Output Framing

Dimension	Example
Format	"Respond as a bullet list" / "Use a table"
Length	"In exactly 3 sentences" / "Under 200 words"
Structure	"Use sections: Summary, Analysis, Recommendation"
Style	"Write for a non-technical executive audience"
Constraints	"Do not include personal opinions"

Output Framing in Action

Same question: "What happened with PayLater chargebacks this quarter?"

❌ No output framing

PayLater chargebacks have been increasing this quarter. The rate went from about 2.1% to 2.8% which is a notable increase. There are several factors that could be contributing to this including seasonal spending patterns, new merchant onboarding, and changes in customer demographics. The team should look into this further and consider implementing additional fraud checks. It might also be worth reviewing the underwriting criteria for new merchants. Overall, while the increase is concerning, it's not yet at critical levels but should be monitored closely going forward.

⚠ Wall of text. No structure. Vague recommendations. Key numbers buried.

✅ With output framing

Summary: Chargeback rate rose 2.1% → 2.8% (+33%).

Root causes:
• New merchant cohort (Q2): 4.2% vs 1.8% established
• "Item not received" disputes +45% in Indonesia

Actions:
• Tighten onboarding for high-risk MCCs — Risk, 2 wks
• Add delivery confirmation for ID — Product, 1 wk
• Review top 20 merchants by volume — Ops, immediate

✓ Scannable. Key number upfront. Actions with owners & timelines.

All 4 Pillars Together

[ROLE] You are a Senior Merchant Risk Analyst at a 
Southeast Asian digital payments company.

[CONTEXT] Below is the data for a flagged merchant.
Merchant: Kopi Corner Pte Ltd (ID: MC-8842)
Sector: Food & Beverage, Singapore
Monthly txn volume: 4,200 → 15,600 (6-month trend)
Avg transaction: $8.50 SGD
Chargeback rate: 0.3% → 4.1% (6-month trend)
Complaints: 12 in last 30 days (up from 2)
Top complaint: "Item not received" (45%)

[TASK] Produce a risk assessment with a 
GREEN/AMBER/RED rating.

[OUTPUT] Format as:
1. Risk Rating (one sentence)
2. Transaction Analysis (cite numbers)
3. Chargeback Assessment (vs 1.0% threshold)
4. Recommended Actions (numbered, with owners)
Use ONLY the data provided. All amounts in SGD.

See the Difference: Merchant Review

❌ Without Technique

Prompt: "Review this merchant"

The merchant seems to be doing okay. Their transactions have been increasing which is good. There are some chargebacks but that's normal for this type of business. Overall they appear to be a reasonable merchant to work with. I would recommend continuing the relationship.

⚠ Vague, no data cited, no structure, no actionable recommendation, no risk rating

✅ With 4 Pillars

Prompt: Role + Context + Structure + Constraints

RISK RATING: AMBER
Transaction Analysis: Volume surged 271% (4,200→15,600) over 6 months. Avg txn flat at $10 suggests organic growth, not fraud.
Chargeback Alert: Rate escalated 0.3%→4.1%, exceeding 1.0% benchmark by 4x. Primary driver: "Item not received" (45%).
Action: 1) Issue chargeback warning letter (Merchant Ops, 5 days) 2) Reduce PayLater limit to $5 min (Risk, immediate)

✓ Structured, data-cited, specific actions with owners and timelines

Module 2

Chain-of-Thought
Reasoning

Making AI show its work — step by step

Why Chain-of-Thought?

Financial decisions require multi-step logic. CoT makes reasoning visible and auditable.

Without CoT

"Can this merchant afford a 
$50K credit line?"

Answer: "Yes, they can afford it."
(No reasoning shown)

With CoT

"Think through this step by step."

Revenue: $120K
- Costs: $95K = $25K free cash
- Existing debt: $15K
= Available: $10K
New repayment: $5K
DSCR = 2.0x → Affordable

CoT Techniques

Technique	Trigger	Best for
Zero-Shot CoT	"Think step by step"	Quick calculations, simple logic
Few-Shot CoT	Provide example with reasoning	Consistent multi-step processes
Step-Back	"First identify key factors, then analyze"	Complex analysis needing prioritization
Self-Consistency	"Solve 3 ways, report majority"	High-stakes decisions

Finance rule: Any decision that could be audited should use CoT — the reasoning trail is your documentation.

Zero-Shot CoT Example

Just add "Think step by step" to any prompt — no examples needed.

❌ Without CoT

PROMPT: A merchant has revenue $120K, costs $95K, existing loan repayment $15K/month. They want a $50K credit line with $5K monthly repayments. Can they afford it?

AI RESPONSE:
Yes, they can afford it. The merchant has sufficient revenue to cover the additional repayment.

⚠ No math. No reasoning. Not auditable. Could be wrong.

✅ With "Think step by step"

PROMPT: Same data as left. + "Think through this step by step."

AI RESPONSE:
Revenue: $120K
− Operating costs: $95K
= Free cash flow: $25K
− Existing loan: $15K
= Available: $10K
New repayment: $5K
DSCR = $10K ÷ $5K = 2.0x
✅ Affordable — DSCR above 1.25x threshold

✓ Every step visible. Math auditable. Conclusion backed by evidence.

Few-Shot CoT Example

Provide one example with reasoning — the AI follows the exact same pattern.

Your prompt (with example)

PROMPT: Assess PayLater applications step by step.

EXAMPLE:
Customer A: income $4,000, balance $800, req $2,000 increase.
Step 1 — Income: $4,000 > $2,500 min ✅
Step 2 — Current DTI: $800/$4,000 = 20% < 30% ✅
Step 3 — New DTI: $2,800/$4,000 = 70% > 30% ❌
Step 4 — DECLINE: DTI exceeds policy max.

NOW ASSESS:
Customer B: income $6,500, balance $1,200, req $1,500 increase.

AI output (follows your pattern)

AI RESPONSE:
Step 1 — Income: $6,500 > $2,500 min ✅
Step 2 — Current DTI: $1,200/$6,500 = 18.5% < 30% ✅
Step 3 — New DTI: $2,700/$6,500 = 41.5% > 30% ❌
Step 4 — DECLINE: DTI 41.5% exceeds 30% max.

Counter-offer: Increase to $750 only (DTI = $1,950/$6,500 = 30.0%).

Key insight: The AI copied your exact Step 1-4 format, applied the same thresholds, and even added a counter-offer — all from one example.

Step-Back Prompting Example

First identify principles, then apply them to the specific case.

Your prompt

PROMPT:
Before analyzing this insurance claim, first answer:
1. What are the 3 most important factors for travel insurance claim validity?
2. What are common red flags in Southeast Asia?

Now assess this claim using those factors:
Claimant: Sarah Tan (SG)
Trip: Bangkok, 3 nights
Claim: $4,200 SGD — laptop stolen from hotel
Filed: 12 days after return
Police report: Filed on last day of trip
Receipt: Online purchase, 2 months before trip

AI response

STEP 1 — KEY FACTORS:
1. Documentation timeliness (police report, filing delay)
2. Evidence quality (receipts, proof of ownership)
3. Claim amount vs trip context

STEP 2 — RED FLAGS:
• Filing delay >7 days • Police report on departure day
• High-value electronics • No hotel incident report

ASSESSMENT: HIGH RISK
• 12-day filing delay ❌
• Police report timing suspicious (last day) ❌
• $4,200 claim for 3-night trip ⚠
Recommend: Request hotel incident report + purchase proof before processing.

Why Step-Back works: The AI built a framework FIRST (factors + red flags), then applied it systematically — instead of jumping to "looks suspicious."

Self-Consistency for High Stakes

Solve 3 ways, report the majority — for decisions where being wrong is costly.

Your prompt

PROMPT:
Investigate this merchant for potential fraud.

Analyze using 3 independent approaches:
1. Velocity patterns (txn frequency vs norms)
2. Amount patterns (deviation from avg size)
3. Geographic patterns (location consistency)

For each: conclude FRAUDULENT / SUSPICIOUS / LEGITIMATE

Final assessment = majority conclusion.
If approaches disagree → flag for human review.

AI response

Approach 1 — Velocity: 340 txns/day vs 75 norm (+353%). Burst pattern at 2-4am. → FRAUDULENT

Approach 2 — Amount: Avg $4.50 vs $15 norm (−70%). Micro-transactions consistent with card testing. → FRAUDULENT

Approach 3 — Geographic: All transactions from Singapore IP. Consistent with merchant location. → LEGITIMATE

MAJORITY: 2/3 FRAUDULENT
Recommend: Suspend merchant, escalate to fraud team for full investigation.

Why 3 approaches? Approach 3 alone would have cleared this merchant. The majority vote catches what a single analysis misses.

See the Difference: Loan Decision

❌ Without CoT

Prompt: "Should we approve this PayLater application?"

Yes, I would recommend approving this PayLater application. The customer has a decent income and has been on the platform for a reasonable amount of time. Their payment history is mostly good with only minor issues.

⚠ No reasoning shown, no math, not auditable, "decent" and "reasonable" are not metrics

✅ With Chain-of-Thought

Prompt: "Think through each policy criterion step by step"

Step 1 — Income: $5,200 > $2,000 min ✅
Step 2 — DTI: ($400 loan + $3,000 req) / $5,200 = 65% > 40% max ❌
Step 3 — Late payments: 2 in 8 months ≤ 2 max ✅
Step 4 — History: 8 months > 6 min ✅
Decision: DECLINE. DTI of 65% exceeds 40% policy maximum. Suggest $1,500 counter-offer (DTI = 37%).

✓ Every criterion checked, math shown, auditable, includes counter-offer

Module 3

Role & Persona
Prompting

Same data, dramatically different insights — just by changing who the AI "is"

The Persona Formula

You are [TITLE] at [COMPANY TYPE] 
with [X years] of experience in [SPECIALTY].
You are known for [CHARACTERISTIC]. 
When [SITUATION], you always [BEHAVIOR].

Conservative Analyst

"Senior Risk Analyst, 12 years in consumer credit. Known for being cautious. When in doubt, recommends additional verification."

Growth Analyst

"Business Development Analyst, 6 years in merchant acquisition. Known for finding opportunities. Always highlights growth potential alongside risks."

The last two fields matter most: CHARACTERISTIC and BEHAVIOR shape the AI's judgment calls. "Cautious" vs "opportunity-focused" produces completely different recommendations from the same data.

Persona in Action: Same Merchant, Different Eyes

Data: Merchant with 271% volume growth + chargeback rate rising from 0.3% to 4.1%

🛡 Conservative Risk Analyst

RISK RATING: RED — IMMEDIATE ACTION

Primary concern: Chargeback rate at 4.1% is 4x above the 1.0% policy threshold. This trajectory suggests systemic issues, not isolated incidents.

Volume growth concern: 271% growth without proportional infrastructure investment raises fulfillment risk. Rapid scaling often precedes quality collapse.

Recommendation:
• Reduce PayLater limit to minimum immediately
• Issue formal chargeback warning letter
• Place on 30-day enhanced monitoring
• If no improvement → initiate offboarding review

📈 Growth Business Analyst

ASSESSMENT: HIGH-GROWTH MERCHANT — MANAGE, DON'T EXIT

Growth signal: 271% volume increase indicates strong product-market fit. This merchant is scaling fast — a potential top-10 partner within 6 months.

Chargeback context: 4.1% rate is elevated but common during rapid scaling. 45% are "item not received" — a logistics issue, not fraud. Solvable with delivery confirmation.

Recommendation:
• Maintain current limits to support growth
• Offer delivery tracking integration (reduces chargebacks 60%)
• Assign dedicated merchant success manager
• Review in 60 days with chargeback target of 1.5%

Multi-Agent Framing

Get 3 perspectives in one prompt — no need to schedule 3 meetings.

Your prompt

PROMPT:
Evaluate this PayLater policy change (raising limit from $500 to $1,000):

🛡 RISK MANAGER: Default rate, exposure, regulation
📊 PRODUCT MANAGER: Adoption, competition, revenue
⚖ COMPLIANCE: Responsible lending, MAS guidelines

Each: 3 concerns + Recommendation + 1 metric to monitor

Then: Balanced synthesis of all three views.

AI response (synthesis excerpt)

🛡 RISK: "Doubling limits increases exposure by $12M. Default rate likely rises 0.3-0.5pp. Recommend: phased rollout to top-tier customers first."

📊 PRODUCT: "Competitors offer $1,500+. Current $500 limit is #1 reason for churn. Revenue uplift est. $2.4M/quarter."

⚖ COMPLIANCE: "MAS Notice requires affordability assessment above $500. Must add income verification step."

SYNTHESIS: Proceed with phased rollout ($750 first) with income verification. Monitor default rate weekly. Full $1,000 after 90-day review.

Why this works: Forces balanced analysis. No single perspective dominates. The synthesis is where the real insight lives.

Same Data, Different Audiences

Data: "PayLater default rate increased from 2.1% to 2.8% this quarter"

Audience	Persona	Output style
Board	"You are the CFO presenting to the board"	Strategic, 5-minute read
Ops Team	"You are the Ops Manager briefing your team"	Actionable, task-oriented
Regulators	"You are Compliance Head responding to MAS"	Formal, regulation-referenced
Customers	"You are a support specialist"	Simple, empathetic

💡 Practice activity (10 min): Pick the same data point above. Write prompts for 2 different audiences. Compare how the tone, detail level, and recommendations change.

Module 4

Structured Outputs
& RAG

JSON extraction, document grounding, and meta-prompting

Why Structure Matters

Unstructured = Conversation

Different every time. Hard to compare. Can't feed into systems. Requires human parsing.

Structured = Form

Consistent format. Comparable across items. Machine-parseable. Scannable by busy stakeholders.

Finance use cases:

Invoice extraction → accounts payable system
Transaction categorization → reconciliation
Complaint classification → route to correct team
KYC document parsing → verification forms

How to Get Structured Output

Tell the AI exactly what shape the output should take. The more specific, the more consistent.

Technique	Prompt example	What you get
Named sections	"Use these sections: Summary, Risk Factors, Recommendation"	Same headings every time — scannable, comparable
Table format	"Present as a table with columns: Metric \| Value \| Benchmark \| Status"	Aligned data, easy to paste into Excel
JSON output	"Return JSON: {risk_rating, confidence, reasoning, actions[]}"	Machine-readable, feeds into dashboards or APIs
Numbered actions	"List 3 actions. Each: action, owner, deadline, priority (H/M/L)"	Actionable items with accountability
Rating + justification	"Give a GREEN/AMBER/RED rating. Justify in exactly 2 sentences."	Consistent decision format across all reviews
Length control	"Executive summary: max 3 sentences. Detail section: max 200 words."	Right depth for the audience
Markdown output	"Save as .md with ## headings, bullet lists, and \| tables"	AI-native format — low tokens, reusable, versionable

Pro tip: Combine techniques — "Use sections: Summary (3 sentences), Risk Table (Metric | Value | Benchmark), Actions (numbered, with owner and deadline). Return the risk rating as JSON at the end."

The Best Default Format: Markdown

When you ask AI to save output as a file or produce a reusable document, Markdown wins on every dimension:

Format	Human	AI	Tokens	Reusable
PDF	✅	❌	N/A	❌
Word	✅	⚠️	N/A	❌
HTML	⚠️	✅	High	✅
Markdown ✓	✅	✅	Low	✅

HTML heading (~20 tokens):

<h2 id="section-4-2">4.2 Chargeback Thresholds</h2>

Markdown heading (~8 tokens):

## 4.2 Chargeback Thresholds

Ask AI to "save as .md" — you get structured headings, tables, and lists with 60% fewer tokens than HTML. Readable by you, parseable by AI, and on Day 3 every artifact you create (SKILL.md, steering files) will be Markdown.

Why Markdown? The Numbers

Not just a preference — Markdown is measurably better for AI work:

60%

fewer tokens than HTML
for the same content structure

35%

better RAG retrieval accuracy
with clean Markdown vs unstructured text

61%

table extraction accuracy in Markdown
vs 54% for HTML tables

llms.txt

new web standard (2024) — websites now
serve Markdown specifically for AI agents

What this means for you: Your steering files and SKILL.md load on every AI request. Concise Markdown = lower cost, better accuracy, and outputs that are reusable across tools. Detailed sources in the interactive explainer.

The Grounding Problem

Without grounding rules, the AI mixes its training data with your documents — and you can't tell which is which.

❌ Without grounding

Fills gaps with plausible fiction — invents policy details that aren't yours
Uses "typically" and "usually" — hedging that masks guessing
Mixes sources invisibly — your doc + training data, no way to tell
Never says "I don't know" — answers confidently regardless

✅ With grounding rules

Every claim traces to a source — citations after each statement
Admits gaps explicitly — "[INSUFFICIENT DATA]" instead of inventing
No outside knowledge — only the provided documents
Audit-ready output — regulators can verify every claim

Why this matters in finance: If a customer disputes a charge based on AI-generated policy guidance that was hallucinated, your team has no defense. Grounding rules make every AI output traceable.

RAG — The 4 Grounding Rules

Add these rules to any prompt where accuracy matters:

CRITICAL RULES:
1. Base your answer ONLY on the provided documents
2. After each claim, cite: [Doc Name, Section]
3. If not in documents: "Not available in 
   provided documents"
4. Do NOT use outside knowledge

DOCUMENTS:
[Doc 1: PayLater Terms v3.2]
[Doc 2: MAS Notice PSN 06]

QUESTION: What are our obligations if a customer 
misses 3 consecutive PayLater payments?

When to use: Policy lookups, compliance checks, contract review, regulatory Q&A, audit preparation — any task where the answer must come from YOUR documents.

See the Difference: Policy Q&A

❌ Without RAG Grounding

Prompt: "What happens if a customer misses 3 PayLater payments?"

If a customer misses 3 consecutive PayLater payments, the account is typically suspended and the outstanding balance may be sent to collections. Late fees of around 1-2% per month are usually applied. The customer's credit score will be negatively impacted and they may be barred from future PayLater services.

⚠ Sounds plausible but NONE of this may be in your actual policy. "Typically" and "usually" = hallucination.

✅ With RAG Grounding

Prompt: "ONLY from provided documents" + policy doc attached

Per the PayLater Terms v3.2:
• After 3 missed payments, the account enters "Default" status [Section 8.3]
• A late fee of $5 or 1.5% (whichever higher) per missed payment [Section 7.1]
• [INSUFFICIENT DATA: credit score reporting not in provided document]

✓ Every claim cites a section. Admits what it doesn't know. No hallucination.

Meta-Prompting

Ask the AI to write prompts for you:

I need a prompt that helps our compliance team 
assess whether a new MAS regulatory update 
affects our PayLater product.

The prompt should:
- Work with any MAS circular
- Produce a structured impact assessment
- Identify operations that need to change
- Suggest a compliance timeline

Write the best possible prompt for this task.

The AI builds a tool for your team. Use the generated prompt repeatedly — it's a reusable asset.

Module 5

Evaluating
Your Prompts

How do you KNOW your prompts are working?

Why Evaluate?

Prompts degrade over time — model updates change behavior
"It looks good" is not a metric — you need measurable quality
Compliance requires evidence that AI outputs meet standards
You need to compare version A vs version B objectively

The problem: Most teams deploy prompts based on "it looked good when I tested it once." That's like shipping software without tests.

Manual Evaluation: Rubrics

Criterion	1 (Poor)	3 (OK)	5 (Excellent)
Completeness	Missing 3+ sections	All sections, some thin	All sections thorough
Data grounding	Unsupported claims	Mostly grounded	Every claim cites data
Actionability	No recommendation	Vague recommendation	Specific actions + owners
Consistency	Different each run	Mostly consistent	Identical structure

Process: Run same prompt 5 times → score each → average = quality score

LLM-as-Judge

Use one AI to evaluate another AI's output:

Score this merchant risk assessment:

1. All 8 sections present? (0-10)
2. Every claim cites data? (0-10)
3. Risk rating justified? (0-10)
4. Actions specific & actionable? (0-10)

Return JSON: {"completeness": X, "grounding": X, 
"consistency": X, "actionability": X, 
"total": X, "issues": ["..."]}

Run on 10 outputs → compare scores between template versions

A/B Testing Prompts

Process

Same input data, two prompt versions
Run both 10 times each
Score with the judge prompt
Higher average score wins

When to Re-evaluate

After any model update
When users report quality issues
Monthly for production templates
After any template modification

Module 6 · NEW

From Manual Prompts
to Automated Tools

You build the template once. The tools do the rest.

The Reality: Nobody Writes Long Prompts Every Day

You learn the techniques → build the template once → let the tools handle the rest.

Phase	What you do	Tool
1. Learn	Master the techniques (today)	Your brain
2. Build	Create a reusable template with `{{variables}}`	Kiro / any AI chat
3. Optimize	Let AI rewrite your prompt for better performance	Bedrock Prompt Optimization
4. Store & Share	Save versioned templates with metadata	Bedrock Prompt Management
5. Reuse	Fill in variables and run — no rewriting needed	Bedrock Console / API

Bedrock Prompt Management

Your prompt library — stored, versioned, and shared across the team.

Manual (today)	Bedrock Prompt Management
Templates in markdown files	Stored as managed resources
Copy-paste to test	One-click testing across models
No version history	Immutable version snapshots
Manual comparison	Side-by-side model comparison
Share via email/Slack	Shared across team via API

No additional charge — you only pay for model tokens during testing.

Prompt Management: Key Features

Prompt Templates with {{variables}} — same syntax from the exercises. Define variables with descriptions and defaults.
Version Management — every change creates an immutable snapshot. Roll back anytime.
Multi-Model Testing — test across Claude, Nova, Llama side-by-side. Compare quality, latency, cost.
Up to 3 Prompt Variants — compare different versions of the same prompt to find the best performer.

Think of it as: Google Docs for prompts — versioned, shared, and always accessible. But with built-in testing across multiple AI models.

Prompt Optimization (Instructor Demo)

You write a basic prompt. Bedrock rewrites it for better performance — automatically.

Your prompt

"Assess this merchant's risk level"

6 words. No structure, no role, no constraints.

Bedrock's optimized version

"You are a Senior Risk Analyst 
specializing in SEA digital payments.
Produce a risk assessment:
1. Rating (GREEN/AMBER/RED)
2. Transaction Pattern Analysis
3. Chargeback Assessment
4. Recommended Actions
Base analysis ONLY on provided data."

Persona + structure + grounding — applied automatically

How Prompt Optimization Works

Step 1: Submit your prompt (even a short, rough one)
Step 2: Bedrock analyzes the prompt components
Step 3: It rewrites with best practices — structure, constraints, model-specific formatting
Step 4: Compare original vs optimized output side-by-side
Step 5: Save the optimized version to your Prompt Management library

GA — April 2025. Supports Claude, Amazon Nova, Meta Llama, DeepSeek, Mistral. The techniques you learned today help you evaluate whether the optimized prompt is actually good.

The Bottom Line

Your concern	The solution
"I don't want to write long prompts every time"	Build the template once → reuse with `{{variables}}`
"I'm not sure my prompt is good enough"	Prompt Optimization rewrites it automatically
"My team needs to share and version prompts"	Prompt Management stores everything centrally
"Which model gives the best result?"	Multi-model testing compares side-by-side

For developers: Intelligent Prompt Routing auto-selects cheaper models for simple tasks (up to 30% cost savings). Prompt Flows chains prompts into automated workflows. These are covered in Day 3.

Hands-on

Prompt Engineering
Exercises

Build a reusable prompt template in 7 steps

Choose Your Exercise

Exercise 1: Merchant Risk Assessment

Best for: Risk analysts, merchant ops, compliance

Techniques: Zero-Shot → Persona → Few-Shot → Structured → RAG + Self-Critique → Meta-Prompting → Validation

Deliverable: Reusable template for GREEN/AMBER/RED merchant risk assessments

Exercise 2: Credit Risk Narrative

Best for: Credit analysts, PayLater ops, financing

Techniques: Zero-Shot → Step-Back → Audience Framing → Multi-Perspective → Structured + Length Control → Meta-Prompting → Validation

Deliverable: Reusable template for APPROVE/CONDITIONS/DECLINE credit narratives

Open the workshop site → Prompt Engineering Exercises

Wrap-up

Best Practices &
Prompt Optimization

Common mistakes, optimization strategies, and recovery patterns

7 Prompt Mistakes Everyone Makes

Mistake	Why it hurts	Quick fix
The Kitchen Sink	Cramming 5 tasks into 1 prompt	One task per prompt, chain results
The Blank Canvas	No examples = AI guesses your format	Show 1-2 examples of desired output
The Trust Fall	No grounding = confident hallucinations	"ONLY from provided data"
The Vague Ask	"Analyze this" — analyze what, how, for whom?	Specify audience, format, length
The One-Shot Wonder	Expecting perfection on first try	Plan for 2-3 refinement turns
The Copy-Paste Trap	Using the same prompt for different models	Tune syntax per model family
The Set-and-Forget	Never re-testing after model updates	Monthly prompt health checks

The Draft-Score-Revise Loop

Don't accept the first output. Build a self-improving cycle into your prompt:

Step 1 — DRAFT: Write a merchant risk summary 
  using the data provided.

Step 2 — SCORE: Rate your draft on these criteria:
  - Completeness (0-5): All required sections?
  - Grounding (0-5): Every claim cites data?
  - Actionability (0-5): Specific next steps?

Step 3 — REVISE: If total < 12, rewrite to fix 
  the lowest-scoring area. Max 2 revisions.

Output only the final version.

Result: The AI self-corrects before you even read it. Teams using this pattern report 40-60% fewer revision cycles.

Break Big Tasks into Small Steps

Complex tasks fail when you ask for everything at once. Decompose instead:

❌ One Giant Prompt

"Analyze our Q2 transactions, 
identify fraud patterns, calculate 
loss exposure, compare to Q1, 
draft a board summary, and 
recommend 3 prevention measures."

6 tasks = shallow work on each

✅ Chained Prompts

Prompt 1: "Analyze Q2 transactions 
  and flag anomalies"
Prompt 2: "From these anomalies, 
  identify the top 3 fraud patterns"
Prompt 3: "Calculate loss exposure 
  for each pattern"
Prompt 4: "Draft a board summary 
  with prevention measures"

Each step gets full attention

Tell the AI What NOT to Do

Positive instructions tell the AI what to include. Negative constraints prevent common failure modes:

Problem	Negative constraint to add
AI adds unsolicited opinions	"Do not include personal opinions or speculation"
AI uses data not in your input	"Do not reference any data outside the provided documents"
AI writes too much	"Do not exceed 300 words. Do not add a conclusion section"
AI hedges everything	"Do not use phrases like 'it depends' or 'generally speaking'"
AI explains obvious things	"Do not explain what PayLater is or how digital wallets work"
AI invents numbers	"If a metric is not in the data, write [DATA NOT AVAILABLE]"

Pro tip: After your first test run, note what went wrong and add a "Do NOT" line for each issue. Your prompt improves with every iteration.

Decision Rules: Override Subjective Judgment

Different models give different ratings for the same data. Decision rules enforce your policy:

❌ Without rules

Claude: "4.1% chargebacks = critically high" → RED
Llama: "271% growth offsets risk" → AMBER
Same data, different conclusions.

✅ With decision rules

RULES: Chargeback >3% → RED | >1% → AMBER | Complaints +200% → escalate

Claude: "Rule: 4.1% > 3.0%" → RED
Llama: "Rule: 4.1% > 3.0%" → RED
Both agree. Policy enforced.

Use when: Consistency matters more than creativity — risk ratings, credit decisions, compliance. If your company has a policy threshold, encode it in the prompt.

Structure Your Prompts Like Documents

Well-organized prompts produce well-organized outputs. Use clear sections and delimiters:

### ROLE
You are a Senior Payment Operations Analyst.

### CONTEXT
<<<
[Paste your transaction data or document here]
>>>

### TASK
Analyze the data for anomalies in Thailand and Vietnam.

### OUTPUT FORMAT
- Executive summary (3 sentences)
- Anomaly table: Market | Type | Severity | Evidence
- Recommended actions (numbered, with owner)

### CONSTRAINTS
- Use ONLY the data provided above
- All amounts in SGD
- Do not exceed 400 words

Why delimiters matter: Without clear separation, the AI may confuse your instructions with your data — especially dangerous when pasting policy documents.

Show, Don't Tell: The Power of Examples

One good example is worth 100 words of instruction:

❌ Telling

"Categorize each transaction as 
high risk, medium risk, or low 
risk based on amount, frequency, 
and merchant type. Format as a 
table with columns for transaction 
ID, category, and reasoning."

50 words of instruction, AI still guesses your format

✅ Showing

"Categorize transactions like this:

| ID | Risk | Reason |
| T001 | HIGH | $12K single txn, 
  new merchant, no history |
| T002 | LOW | $45 recurring, 
  12-month pattern |

Now categorize these: [data]"

One example = perfect format every time

The 3-Round Prompt Improvement Workflow

Every production-quality prompt goes through this cycle:

Round	What you do	What improves
Round 1: Baseline	Write your first prompt using the 4 pillars. Run it 3 times.	You see what the AI gets right and wrong
Round 2: Fix failures	Add negative constraints for each failure. Add an example of good output. Run 3 more times.	Consistency jumps from ~60% to ~85%
Round 3: Polish	Add self-review step. Tighten length/format. Test with edge cases.	Production-ready at ~95% consistency

Total time: 15-20 minutes to go from first draft to production template. That template then saves hours every week.

Build a Team Prompt Library

Your best prompts are team assets, not personal notes. Treat them like shared templates:

What to include

Prompt name and purpose
The full prompt with {{variables}}
Which model and temperature to use
1-2 example outputs (good vs bad)
Known limitations and edge cases
Last tested date and model version

Starter library for finance

Merchant risk assessment
Transaction anomaly detection
Customer complaint classification
Policy document Q&A (RAG)
Board summary generator
Regulatory impact assessment

Start today: The template you built in the exercise is your first library entry. Share it with your team this week.

Why AI "Gets Dumber" Mid-Conversation

It's not a bug — it's a context window problem. Every AI has a limited "working memory."

What happens inside

Every message + every AI response stays in the context window
At 60-70% capacity, performance drops sharply — sudden cliffs, not gradual
AI compresses and deprioritizes earlier messages
"Lost in the Middle": AI remembers start and end best, forgets the middle

What you experience

AI contradicts instructions you gave 10 messages ago
AI re-introduces ideas you already rejected
AI ignores constraints from the start of the chat
Outputs get vague, generic, or repetitive
AI starts "hallucinating" more frequently

Key insight: Most people blame the AI for "getting stupid." The real problem is the conversation got too long. The fix is context management, not a better model.

5 Rules for Managing Long Conversations

	Rule	Why it works
	One task per session — don't mix debugging, writing, and analysis	Each session gets full attention capacity
	Paste only what's relevant — don't dump entire documents	Reduces noise, keeps AI focused
	Key instructions at start AND end — not buried in the middle	Exploits primacy + recency bias
	Keep sessions under 15-20 turns — start fresh after that	Stays within the performance sweet spot
	Use "session summaries" — ask AI to summarize, paste into new chat	Fresh context window with all the knowledge

The Session Summary Technique

When a conversation gets too long but you can't lose the state:

Step 1: Ask for a summary

PROMPT (in the old session):

Summarize our conversation so far:
• Key decisions we made
• Data and findings so far
• What we still need to do next

Format as a briefing I can paste into a new session.

Step 2: Start fresh with context

PROMPT (in the new session):

Here is the context from our previous session:

[PASTE SUMMARY HERE]

Continue from where we left off. The next step is to draft the risk committee report based on the findings above.

✓ Fresh context window + all accumulated knowledge = best of both worlds

Think of it as "saving your game." You compress hours of conversation into a focused briefing, then load it into a fresh session with full attention capacity.

The Conversation Funnel

Start broad, then narrow. Each turn builds on context — but keep it focused.

The pattern

Turn 1 (Explore):
"Analyze this month's transaction data — identify top 3 trends"

Turn 2 (Deep-dive):
"Expand on trend #2 — the PayLater chargeback increase"

Turn 3 (Produce):
"Draft a 1-page summary for the risk committee"

Turn 4 (Polish):
"Make the tone more formal and add data citations"

Why it works

Each turn is focused on one thing
You review and correct at each step
Errors don't compound — you catch them early
4 focused turns > 1 massive prompt

When to reset: If Turn 3 goes wrong, don't keep correcting. Start a new session with: "Here's the data and the trend analysis. Draft a risk committee summary."

When to Start Fresh vs. Continue

🟢 Start a New Session

Switching to a completely different task
Conversation has gone off track
Testing a refined prompt cleanly
Session is longer than 15-20 turns
AI keeps repeating the same mistake
AI contradicts earlier instructions

🔵 Continue the Session

Iterating on the same output
Need AI to remember earlier context
Building step by step (funnel pattern)
Refining format or tone
Follow-up questions on same topic
Session is still under 15 turns

The 3-strike rule: If you've corrected the AI 3 times and it's still wrong — the context is working against you. Start fresh. It's faster than fighting a polluted conversation.

The #1 Misconception: "AI Remembers Me"

It doesn't. Each session is completely isolated. Here's what AI actually sees:

❌

What people think

"The AI remembers our conversation from last week"
"It knows what I worked on yesterday"
"I should keep this session open so it doesn't forget"
"My old tabs are giving it context"

✅

How it actually works

Each session starts with zero memory
AI only sees: your current message + this session's history
Old tabs/sessions have no effect on new ones
Closing old sessions is safe — it's cosmetic, not functional

The mental model: chat is ephemeral, files are permanent. The AI's "memory" is the files it created — reports, templates, skills. Those persist in your workspace. The conversation that produced them does not. When you need context in a new session, reference the files — not the old chat.

"Save Your Game" — AI Memory for Long Projects

For projects spanning weeks or months, you need two files — not one giant document:

project-status.md

Load every session — compact, ~2 pages

What exists now (file list, decisions)
What's remaining (next steps)
Key rules and constraints

Like a project brief — current state only

session-log.md

Load only when needed — grows over time

What was done each session
Why decisions were made
Technical details and gotchas

Like meeting minutes — history archive

When	What to say
Start session	"Here's my project context: [paste project-status.md]"
End session	"Update project-status.md with current state. Append today's work to session-log.md."
Look back	"Load session-log.md — when did we change the approval threshold?"

Why two files? A single status doc that grows every session wastes tokens. After 10 sessions, you're loading 20 pages of history on every request. Split it: load the brief (2 pages) always, load the history only when you need it. Same knowledge, 90% fewer tokens.

This is the practical application of "chat is ephemeral, files are permanent." For a one-off task, the session summary technique works fine. But for projects that span weeks — like building a report template, developing a compliance workflow, or iterating on a merchant risk process — you need persistent memory across many sessions. The two-file pattern solves this: project-status is your "save game" that loads fast. Session-log is your "replay history" you only watch when you need to understand a past decision. The end-of-session prompt is key — be specific: "Update project-status with what we built today and what's remaining. Append a summary of today's session to session-log." If you just say "update the status," the AI might miss important details or append instead of replacing. The more specific your save prompt, the better your load next time. Ask: "How many of you have a project that's been going for more than 5 sessions?" This pattern is for them. In Kiro, these files can be steering files that load automatically — we'll see that on Day 3.

Circuit Breaker Patterns

Pattern	Symptom	Fix
Repetition Loop	Same wrong output after correction	New session, rephrase
Hallucination Spiral	Inventing data	"Use ONLY provided data"
Over-Eager Helper	2,000 words for 5 bullets	"Exactly 5 bullets, under 20 words"
Format Drift	Format changes mid-output	"Continue EXACTLY same format"
Confidence Trap	Uncertain info as fact	"Prefix uncertain with [UNCERTAIN]"

Working Safely: Undo, Revert, Recover

What happens when the AI makes a mistake? You have safety nets at every level.

Safety net	How it works	When to use
Supervised Mode	Shows changes, asks for approval before applying	High-stakes outputs, first time using a skill
Revert Changes	Click to undo individual file changes	AI modified a file incorrectly
New Session	Start fresh — clean context, no polluted history	AI went off track, switching tasks
Autopilot + Review	AI works autonomously, you review after	Trusted skills, routine tasks

Think of it like "track changes" in Word. Supervised mode shows you what's about to change. You approve or reject. If you approve and it's wrong, you can still revert.

Two Controls, Two Different Jobs

These are independent settings — changing one does not affect the other.

Execution Mode

Controls how much freedom Kiro has over your files

Supervised	Shows changes, waits for your approval
Autopilot	Applies changes directly, you review after

Model Selection

Controls which AI brain answers your question

Auto	Kiro picks the best model per task (recommended)
Sonnet / Opus / Haiku	You choose a specific model

Our recommendation: Use Supervised + Auto for today's labs. Supervised lets you see what Kiro is doing. Auto picks the right model so you don't have to. Switch to Autopilot in Day 3 when you're comfortable.

This is a common point of confusion — participants see "Autopilot" and "Auto" and think they're the same thing. They're completely independent. Execution mode = how much autonomy Kiro has over your files (Supervised = approval required, Autopilot = acts freely). Model selection = which AI model processes your request (Auto = Kiro's intelligent router picks the best model per task, or you manually pick Sonnet/Opus/Haiku). You can combine them any way: Supervised + Auto, Autopilot + Opus, Supervised + Haiku, etc. For the workshop: Supervised + Auto is the sweet spot. Supervised because they're learning and want to see what's happening. Auto because it optimizes cost and quality — they don't need to think about model selection. Auto costs 1.0x credits (baseline), Sonnet is 1.3x, Opus is 2.2x. Auto delivers Sonnet 4-class results at lower cost by routing intelligently. The only time to switch models manually: if they're curious to compare (connects to Day 1 model tiers), or if they hit a complex problem where Opus might help. But for prompt exercises and Kiro labs, Auto is perfect. Ask the room: "Who has already noticed the model dropdown? Has anyone tried switching models?" — good engagement check.

Using Kiro for Business Users

Vibe mode: Describe what you want → Kiro writes and runs the code
File context: Drag CSVs, PDFs, JSON into chat
Iterative refinement: "Make the chart bigger" / "Add a percentage column"
New Session per task: Keep context focused

Remember: You don't need to understand the code Kiro writes. You just need to describe what you want clearly — using the 4 pillars from Module 1.

Quick Reference Card

	Technique	Trigger Phrase
	Zero-Shot CoT	"Think step by step before answering"
	Expert Persona	"You are a Senior [ROLE] with X years in [SPECIALTY]"
	Multi-Perspective	"Present the case FOR and AGAINST"
	Structured Output	"Use EXACTLY these sections: 1... 2... 3..."
	RAG Grounding	"Base your answer ONLY on the provided documents"
	Self-Critique	"Review: Is every claim supported by data?"
	Meta-Prompting	"Write the best prompt for [TASK]"
	LLM-as-Judge	"Score this output against these criteria"
	Negative Constraints	"Do NOT include / Do NOT use / Do NOT exceed"
	Decision Rules	"If [metric] > [threshold] → MUST be [rating]"
	Task Decomposition	Break 1 big prompt into 3-4 focused prompts
	Draft-Score-Revise	"Draft, then score on [rubric], then revise if < threshold"
	Show Don't Tell	Include 1-2 examples of desired output format

Preview

From Prompts to
Workflow Automation

Everything you learned today becomes the foundation for autonomous AI agents

Your Prompt Skills = Agent Design Skills

Every technique you learned today maps directly to how AI agents are built:

Day 2: Prompt Technique	Day 3: Agent Component	What it does in an agent
Persona prompting	Agent role definition	Defines who the agent "is" and how it behaves
Structured output	Output contracts	Ensures consistent, usable results
Chain-of-Thought	Reasoning strategy	Agent thinks step-by-step before acting
RAG grounding	Knowledge base	Agent accesses your company's documents
Negative constraints	Guardrails	Prevents the agent from doing things it shouldn't
Prompt template	SKILL.md file	The template becomes a reusable, shareable skill

Key insight: You don't need to code to design an AI agent. You need to write great instructions — which is exactly what you practiced today.

Preview: Templates → Skills → Automation

Tomorrow you'll turn your prompt templates into automated workflows:

Today: Prompt template

You are a Senior Risk Analyst...
Analyze merchant data and produce:
1. Risk Rating (GREEN/AMBER/RED)
2. Transaction Analysis
3. Recommended Actions

Pasted manually each time

Tomorrow: SKILL.md + Hooks

---
name: merchant-risk-assessment
description: Generate risk assessments
---
[Same template + auto-trigger]

✓ Auto-activates, shared, versioned
✓ Works in Kiro AND Claude Cowork

Day 3 covers: Workflow patterns (chaining, parallelization, routing, orchestration), the Kiro stack (steering + skills + hooks), and you'll design an agent for your team's workflow.

Quick preview only — don't go deep. Show the before/after to build excitement for tomorrow. The key message: the template they built today becomes a portable skill file with 4 lines of frontmatter. Day 3 covers the full stack and they'll design a real agent. The callout lists what's coming tomorrow.

The 3-Day Journey

📚

Day 1

"What can AI do?"

Fundamentals, use cases, responsible AI

💬

Day 2 (Today)

"How do I talk to AI?"

Prompt engineering, templates, tools

🤖

Day 3 (Tomorrow)

"How do I make AI work on its own?"

Agentic AI, workflow automation, no code

💡 Homework: What repetitive task does your team do every week that could be automated? Come to Day 3 with a specific workflow — you'll design an AI agent for it.

Day 2 Outcomes

Design prompts using the 4 pillars (Clarity, Context, Role, Output)
Apply Chain-of-Thought and Self-Consistency for financial reasoning
Create expert personas for different audiences
Extract structured data and ground responses in documents
Evaluate prompt quality with rubrics and LLM-as-Judge
Use Bedrock tools to optimize and manage prompts at scale
Manage long conversations and know when to start fresh
Build reusable prompt templates — the foundation for AI agents
Identify a workflow from your team to automate on Day 3

Thank You

Tomorrow: Make AI Work On Its Own

Agentic AI · Workflow Automation · Agent Design · No Coding Required

💡 Homework: Come with a workflow your team does every week that could be automated

AnyCompany Financial Group · Generative & Agentic AI on AWS

Prompt EngineeringWorkshop

Prompt FundamentalsDeep Dive

The 80/20 Rule of Prompting

1. Clarity

2. Context

3. Role Assignment

4. Output Framing

Pillar 1: Clarity

Pillar 2: Context

4 Types of Context

Context in Action: Merchant Review

Pillar 3: Role Assignment

Pillar 4: Output Framing

Output Framing in Action

❌ No output framing

✅ With output framing

All 4 Pillars Together

See the Difference: Merchant Review

❌ Without Technique

✅ With 4 Pillars

Chain-of-ThoughtReasoning

Why Chain-of-Thought?

Without CoT

With CoT

CoT Techniques

Zero-Shot CoT Example

❌ Without CoT

✅ With "Think step by step"

Few-Shot CoT Example

Your prompt (with example)

AI output (follows your pattern)

Step-Back Prompting Example

Your prompt

AI response

Self-Consistency for High Stakes

Your prompt

AI response

See the Difference: Loan Decision

❌ Without CoT

✅ With Chain-of-Thought

Role & PersonaPrompting

The Persona Formula

Conservative Analyst

Growth Analyst

Persona in Action: Same Merchant, Different Eyes

🛡 Conservative Risk Analyst

📈 Growth Business Analyst

Multi-Agent Framing

Your prompt

AI response (synthesis excerpt)

Same Data, Different Audiences

Structured Outputs& RAG

Why Structure Matters

Unstructured = Conversation

Structured = Form

How to Get Structured Output

The Best Default Format: Markdown

Why Markdown? The Numbers

The Grounding Problem

❌ Without grounding

✅ With grounding rules

RAG — The 4 Grounding Rules

See the Difference: Policy Q&A

❌ Without RAG Grounding

✅ With RAG Grounding

Meta-Prompting

EvaluatingYour Prompts

Why Evaluate?

Manual Evaluation: Rubrics

LLM-as-Judge

A/B Testing Prompts

Process

When to Re-evaluate

From Manual Promptsto Automated Tools

The Reality: Nobody Writes Long Prompts Every Day

Bedrock Prompt Management

Prompt Management: Key Features

Prompt Optimization (Instructor Demo)

Your prompt

Bedrock's optimized version

Prompt Engineering
Workshop

Prompt Fundamentals
Deep Dive

Chain-of-Thought
Reasoning

Role & Persona
Prompting

Structured Outputs
& RAG

Evaluating
Your Prompts

From Manual Prompts
to Automated Tools

Prompt Engineering
Exercises

Best Practices &
Prompt Optimization

From Prompts to
Workflow Automation