How AI models process text, what it costs, and why different models perform differently โ explained for finance teams.
LLMs don't process words โ they process tokens. A token is a piece of text, roughly:
Why this matters for cost: You pay per token โ both for what you send (input) and what the AI generates (output). Longer prompts and longer outputs cost more.
| Content | Approximate tokens |
|---|---|
| A short question ("Assess this merchant") | ~5 tokens |
| A paragraph of merchant data (10 lines) | ~150 tokens |
| Our engineered prompt template | ~400 tokens |
| A full risk assessment output (8 sections) | ~800 tokens |
| Total per assessment (input + output) | ~1,350 tokens |
Models use subword tokenization โ they break text into meaningful pieces, not whole words:
| Text | Tokens | Count | Note |
|---|---|---|---|
| "chargebacks" | ["charge", "backs"] | 2 | Split into meaningful subwords |
| "PayLater" | ["Pay", "Later"] | 2 | CamelCase splits naturally |
| "SGD" | ["SG", "D"] | 2 | Abbreviations may split |
| "$4,200" | ["$", "4", ",", "200"] | 4 | Numbers are expensive! |
Each model family has its own tokenizer โ the same text may be a different number of tokens on different models:
## 4.2 Chargeback Thresholds
costs ~8 tokens in Markdown. The HTML equivalent:<h2 id="section-4-2">4.2 Chargeback Thresholds</h2>
costs ~20 tokens โ 2.5x more for the same information. This is why all your prompt templates, skills, and steering files use Markdown (.md) โ it's the most token-efficient structured format that both humans and AI can read.
Not all AI models are created equal. They differ in size (parameters), training (data and techniques), and architecture โ which affects speed, quality, and cost.
Parameters are the "knowledge" stored in the model. More parameters = more capacity for complex reasoning, but also slower and more expensive.
| Model size | Parameters | Analogy | Good for |
|---|---|---|---|
| Small | 1-17B | Junior analyst โ fast, handles routine tasks | Classification, simple extraction, FAQ |
| Medium | 17-70B | Senior analyst โ balanced speed and depth | Reports, structured analysis, narratives |
| Large | 70B+ | Expert consultant โ thorough but expensive | Complex reasoning, multi-step analysis, research |
Different tasks need different trade-offs. Match the model to the job โ not every task needs the most powerful option:
| Task type | What matters most | Model category | Examples on Bedrock |
|---|---|---|---|
| Classification & routing | Speed, low cost | Small / lightweight models | Nova Micro, Nova Lite |
| Data extraction & summarization | Accuracy, structured output | Mid-range models | Nova Pro, Claude Haiku, Llama Maverick |
| Narrative generation & analysis | Quality, reasoning depth | Capable models | Claude Sonnet, Llama 70B, DeepSeek |
| Complex multi-step reasoning | Depth, nuance, thoroughness | Frontier models | Claude Sonnet, Claude Opus |
In the Merchant Risk Assessment demo, you may have noticed:
This is why model selection matters โ and why we use decision rules in the prompt to enforce consistency across models.
AI models come in different sizes and specializations. Instead of memorizing specific model names (which change frequently), learn to match your task complexity to the right model tier.
| Tier | Characteristics | Speed | Cost | Examples (as of 2026) |
|---|---|---|---|---|
| โก Lightweight Small models, 1-17B params |
Fast, cheap, good for simple tasks. Limited reasoning depth. | ๐ข Fastest | ๐ข Lowest | Nova Micro, Nova Lite, Haiku |
| ๐ฏ Balanced Mid-range, 17-70B params |
Good quality + reasonable speed. Handles structured analysis well. | ๐ก Fast | ๐ก Moderate | Nova Pro, Haiku 4.5, Llama Maverick, DeepSeek |
| ๐ง Capable Large models, 70B+ params |
Strong reasoning, nuanced analysis, reliable structured output. | ๐ก Moderate | ๐ด Higher | Claude Sonnet, Llama 70B, Gemini Pro |
| ๐ฌ Frontier Largest / reasoning-specialized |
Deepest reasoning, multi-step logic, handles ambiguity. Slowest and most expensive. | ๐ด Slower | ๐ด Highest | Claude Opus, Claude Sonnet (extended thinking), o3 |
| Finance task | Recommended tier | Why |
|---|---|---|
| Document classification (invoice vs receipt vs complaint) | โก Lightweight | Simple pattern matching, speed matters, lowest cost |
| Invoice data extraction (fields โ JSON) | โก Lightweight | Structured extraction doesn't need deep reasoning |
| Transaction categorization & routing | โก Lightweight | High volume, low complexity per item |
| Customer complaint response drafts | ๐ฏ Balanced | Needs empathy and nuance, but not deep analysis |
| Merchant risk assessment narrative | ๐ง Capable | Needs structured reasoning, data citation, actionable recommendations |
| Credit committee narrative | ๐ง Capable | Multi-perspective analysis (bull/bear case) needs good reasoning |
| Regulatory impact assessment | ๐ง Capable or ๐ฌ Frontier | Cross-referencing multiple documents, nuanced interpretation |
| Complex multi-step financial analysis | ๐ฌ Frontier | Deep reasoning across large datasets, highest accuracy needed |
| Bulk monthly assessments (200+ merchants) | โก Lightweight | Cost-effective at scale โ use the cheapest model that meets quality threshold |
Some models have a "thinking" or "extended thinking" mode where they reason through problems step-by-step internally before responding. This is different from Chain-of-Thought prompting โ the model does it automatically.
| Feature | Standard model | Reasoning model / extended thinking |
|---|---|---|
| How it works | Generates answer directly | Thinks internally first, then generates answer |
| Speed | Faster | Slower (thinking takes time) |
| Cost | Lower | Higher (thinking tokens count) |
| Best for | Most tasks โ extraction, drafts, classification | Math, logic puzzles, complex multi-step analysis, ambiguous problems |
| Finance example | Extract invoice fields, draft complaint response | Calculate DSCR across multiple scenarios, assess regulatory impact across 3 jurisdictions |
| Tool | Model selection | What you control |
|---|---|---|
| Kiro (workshop) | Auto-selected by task | Your prompt quality โ Kiro picks the model |
| Claude (Cowork/Desktop) | Assigned by plan tier | Your prompt quality โ Anthropic picks the model |
| Cursor | You choose per conversation | Model selection + prompt quality |
| Bedrock Playground | Full control | Model, parameters, prompt โ everything |
| Bedrock API | Full control + routing | Intelligent Prompt Routing can auto-select the cheapest model that meets quality |
Every model has a knowledge cutoff date โ it doesn't know about events after that date. Newer models have more recent cutoffs, but even the newest model won't know about last week's MAS circular.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best for |
|---|---|---|---|
| Amazon Nova Micro | $0.035 | $0.14 | Simple classification, routing |
| Amazon Nova Lite | $0.06 | $0.24 | Drafts, summaries, FAQ |
| Llama 4 Maverick 17B | $0.22 | $0.88 | Cost-effective moderate tasks |
| DeepSeek v3.2 | $0.62 | $1.85 | Reasoning, cost-effective |
| Amazon Nova Pro | $0.80 | $3.20 | Reports, analysis |
| Claude Haiku 4.5 | $1.00 | $5.00 | Quality + speed balance |
| Llama 3.3 70B | $2.65 | $3.50 | Open-source experimentation |
| Claude Sonnet 4 | $3.00 | $15.00 | Complex reasoning, compliance |
| Claude Opus 4 | $5.00 | $25.00 | Most complex tasks |
โ ๏ธ Pricing accurate as of April 2026 and subject to change. Check aws.amazon.com/bedrock/pricing for current rates.
Using our workshop prompt template (~400 input tokens + ~800 output tokens):
| Model | Cost per assessment | 50/week | 200/month |
|---|---|---|---|
| Nova Micro | $0.000126 | $0.006 | $0.025 |
| Nova Lite | $0.000216 | $0.011 | $0.043 |
| Nova Pro | $0.002880 | $0.144 | $0.576 |
| Claude Haiku 4.5 | $0.004400 | $0.220 | $0.880 |
| Claude Sonnet 4 | $0.013200 | $0.660 | $2.640 |
| Claude Opus 4 | $0.022000 | $1.100 | $4.400 |
| Strategy | Savings | How it works |
|---|---|---|
| Right-size your model | Up to 143x | Use Nova Micro for classification, Sonnet for complex analysis โ don't use Opus for simple tasks |
| Optimize prompts | 10-40% | Remove redundant instructions, use shorter examples, constrain output length |
| Batch processing | 50% | Submit requests in bulk (not real-time) โ perfect for monthly portfolio assessments |
| Intelligent Prompt Routing | Up to 30% | Bedrock auto-routes simple tasks to cheaper models, complex tasks to powerful ones |
| Prompt caching | Up to 90% | Cache your template โ pay full price once, 10% for every reuse |
| Task | Recommended model | Cost tier |
|---|---|---|
| Document classification ("Is this an invoice or receipt?") | Nova Micro / Lite | $0.04-0.06/1M tokens |
| Data extraction (fields from invoice PDF) | Nova Pro / Haiku | $0.80/1M tokens |
| Narrative generation (risk assessment, credit narrative) | Sonnet / Llama 70B | $2.65-3.00/1M tokens |
| Complex reasoning (regulatory impact, multi-step analysis) | Sonnet / Opus | $3.00-5.00/1M tokens |
The context window is the maximum amount of text the model can process at once โ your prompt + the AI's response must fit within it.
| Model | Context window | Text equivalent | Practical meaning |
|---|---|---|---|
| Nova Micro | 128K tokens | ~100 pages | Can read a short book |
| Nova Pro | 300K tokens | ~230 pages | Can read a long report |
| Claude Sonnet 4 | 200K tokens | ~150 pages | Can read a full policy manual |
| Llama 3.3 70B | 128K tokens | ~100 pages | Can read a short book |
| Content type | Tokens per page | Tokens per item |
|---|---|---|
| Plain English text | ~250/page | โ |
| Financial data (CSV) | ~400/page | ~50/row |
| JSON structured data | ~350/page | โ |
| A typical email | โ | ~200 tokens |
| A merchant risk assessment | โ | ~800 tokens |
| A credit committee narrative | โ | ~600 tokens |
| An invoice (extracted text) | โ | ~300 tokens |
Bedrock provides a CountTokens API that lets you check how many tokens your input will use โ before you send the actual request. This is free (no charge for counting).
| What you can do | Why it matters |
|---|---|
| Estimate costs before sending requests | Know the cost before you commit โ especially for large batch jobs |
| Optimize prompts to fit within token limits | Trim your prompt if it's too long for the context window |
| Plan token usage in your applications | Budget your monthly token spend accurately |
AWS sets quotas on how many tokens you can use per minute (TPM) and per day (TPD). Understanding how these work helps you avoid throttling.
| Term | What it means |
|---|---|
| Tokens per Minute (TPM) | Maximum tokens (input + output) you can use in one minute |
| Tokens per Day (TPD) | Maximum tokens per day (default = TPM ร 1,440) |
| Requests per Minute (RPM) | Maximum number of API calls per minute |
| max_tokens | Parameter you set to limit how long the AI's response can be |
For newer Claude models (3.7 and later), output tokens consume 5x the quota of input tokens. This is because generating text is computationally much harder than reading it.
| Model | Input burndown | Output burndown | Example: 1,000 input + 100 output |
|---|---|---|---|
| Claude Sonnet 4, Opus 4 | 1:1 | 5:1 | 1,000 + (100 ร 5) = 1,500 quota tokens |
| Nova, Llama, older Claude | 1:1 | 1:1 | 1,000 + 100 = 1,100 quota tokens |
Bedrock reserves quota for max_tokens at the start of each request, then adjusts after the response is generated:
| max_tokens = 32,000 (too high) | max_tokens = 1,250 (optimized) | |
|---|---|---|
| Initial quota reserved | 40,000 tokens | 9,250 tokens |
| Actual quota used | 9,000 tokens | 9,000 tokens |
| Wasted reservation | 31,000 tokens | 250 tokens |
| Impact | Fewer concurrent requests possible | More concurrent requests possible |
max_tokens close to your expected output size. For a merchant risk assessment (~800 tokens output), set max_tokens to 1,000-1,200 โ not the default 4,096 or 32,000. This lets you run more concurrent requests within your quota.
Use Amazon CloudWatch to track your token consumption:
Navigate to CloudWatch โ Dashboards โ Automatic dashboards โ Bedrock โ "Token Counts by Model" to see your usage patterns.
Understanding tokens isn't just about API pricing โ it directly affects how you use AI tools like Kiro every day.
When you chat with Kiro (or any AI assistant), the tool sends all previous messages in the conversation as context to the model. This means:
| Message # | What gets sent to the model | Approx. tokens |
|---|---|---|
| 1st message | Steering files + your message | ~1,500 |
| 5th message | Steering + 4 previous exchanges + your message | ~8,000 |
| 15th message | Steering + 14 previous exchanges + your message | ~25,000+ |
| After summarization | Steering + compressed summary + your message | ~4,000 |
When the context reaches 80% of the model's limit, Kiro automatically summarizes older messages to free up space. This keeps the conversation going but compresses earlier details.
Kiro's steering files (.kiro/steering/*.md) provide persistent context โ but they consume tokens on every single message you send. The official Kiro docs state:
This is why Kiro offers inclusion modes โ so you control what loads when:
| Mode | When it loads | Token impact | Use for |
|---|---|---|---|
always | Every message, every session | ๐ด Constant โ sent with every request | Essential rules only: company name, currency, PII policy |
fileMatch | Only when matching files are in context | ๐ก Conditional โ only when relevant | Component patterns, API standards, test conventions |
manual | Only when you reference it with # | ๐ข On-demand โ zero cost until used | Heavy reference docs, deployment guides, project status |
auto | When your request matches the description | ๐ก Conditional โ AI decides relevance | Specialized domain knowledge, complex workflows |
Consider a 30-message session with a 500-token always-on steering file:
Multiply by 24 participants ร multiple sessions per day โ it adds up.
always files under 300 words โ essential rules only, no lengthy explanationsmanual โ project status, deployment guides, reference materialsfileMatch for specialized rules โ they only load when you're working on matching files| Concept | Where you'll see it |
|---|---|
| Token estimation | Prompt Engineering: Understanding why prompt length matters for cost and quality |
| Model selection | Demo: Model Arena โ compare 3 models on the same task |
| Cost optimization | Prompt Management and Optimization |
| Right-sizing models | Agentic AI: Intelligent Prompt Routing in workflow automation |
| Context windows | Managing long conversations and knowing when to start fresh |
| Steering token cost | Kiro Stack: Creating steering files โ keep always files concise, use manual for heavy docs |