Tokenization, Pricing & Model Selection

What Are Tokens?

LLMs don't process words — they process tokens. A token is a piece of text, roughly:

TOKEN ESTIMATION

1 token ≈ 4 characters ≈ ¾ of a word "AnyCompany Financial Group" = 4 tokens "The merchant's chargeback rate is 4.1%" = 9 tokens A typical merchant risk assessment (500 words) ≈ 650 tokens

Why this matters for cost: You pay per token — both for what you send (input) and what the AI generates (output). Longer prompts and longer outputs cost more.

Quick Token Estimation

Content	Approximate tokens
A short question ("Assess this merchant")	~5 tokens
A paragraph of merchant data (10 lines)	~150 tokens
Our engineered prompt template	~400 tokens
A full risk assessment output (8 sections)	~800 tokens
Total per assessment (input + output)	~1,350 tokens

How Tokenization Works

Models use subword tokenization — they break text into meaningful pieces, not whole words:

Text	Tokens	Count	Note
"chargebacks"	["charge", "backs"]	2	Split into meaningful subwords
"PayLater"	["Pay", "Later"]	2	CamelCase splits naturally
"SGD"	["SG", "D"]	2	Abbreviations may split
"$4,200"	["$", "4", ",", "200"]	4	Numbers are expensive!

⚠️ Key insight for finance: Numbers and special characters use more tokens than you'd expect. A table of financial data with lots of numbers costs more tokens than the same amount of narrative text.

🔍 Why do numbers cost more tokens?

Tokenizers learn from text — and most text is words, not numbers.

During training, the tokenizer (BPE) merges the most frequent character pairs into tokens. Common words like "merchant" and "chargeback" appear millions of times, so they become efficient 1-2 token units. But numbers are different:

Text	Tokens	Why
"merchant"	1 token	Very common word — learned as a single unit
"chargeback"	2 tokens	"charge" + "back" — common subwords
"$4,200.50"	5+ tokens	"$" + "4" + "," + "200" + "." + "50" — each piece is separate
"SGD 18,642"	4+ tokens	"SG" + "D" + " 18" + "," + "642"

Three reasons numbers split into many tokens:

Each digit combination is rare. "200" exists as a token, but "4,200" as a specific sequence is much rarer than common words. The tokenizer never learned to merge it.
Punctuation breaks merging. The comma in "$4,200" and the period in "4,200.50" act as boundaries — each becomes its own token.
Numbers have high entropy. After "$", the next character could be anything (4, 42, 4,200, 42,000). The tokenizer can't predict it, so it keeps numbers as small pieces.

The practical impact:

A CSV row like MC-8842, Kopi Corner, $15,600, 4.1%, SGD, 2025-06-15 uses ~25-30 tokens — while the same length of English text uses only ~12 tokens.

What you can do:

Summarize data in narrative form when possible ("Revenue grew 271% from $4,200 to $15,600") rather than pasting raw CSV tables
Use the CountTokens API to check before sending large datasets
Extract only the relevant columns rather than sending entire spreadsheets

Why Different Models Tokenize Differently

Each model family has its own tokenizer — the same text may be a different number of tokens on different models:

Claude uses a BPE (Byte-Pair Encoding) tokenizer optimized for English and code
Llama uses a SentencePiece tokenizer with a 128K vocabulary
Nova uses Amazon's own tokenizer

💡 Document format affects token cost too

The format you use to give documents to AI matters for cost. The same section heading:
## 4.2 Chargeback Thresholds costs ~8 tokens in Markdown. The HTML equivalent:
<h2 id="section-4-2">4.2 Chargeback Thresholds</h2> costs ~20 tokens — 2.5x more for the same information. This is why all your prompt templates, skills, and steering files use Markdown (.md) — it's the most token-efficient structured format that both humans and AI can read.

Why Different Models Perform Differently

Not all AI models are created equal. They differ in size (parameters), training (data and techniques), and architecture — which affects speed, quality, and cost.

Model Size = Brain Size

Parameters are the "knowledge" stored in the model. More parameters = more capacity for complex reasoning, but also slower and more expensive.

Model size	Parameters	Analogy	Good for
Small	1-17B	Junior analyst — fast, handles routine tasks	Classification, simple extraction, FAQ
Medium	17-70B	Senior analyst — balanced speed and depth	Reports, structured analysis, narratives
Large	70B+	Expert consultant — thorough but expensive	Complex reasoning, multi-step analysis, research

Choosing the Right Model for the Task

Different tasks need different trade-offs. Match the model to the job — not every task needs the most powerful option:

Task type	What matters most	Model category	Examples on Bedrock
Classification & routing	Speed, low cost	Small / lightweight models	Nova Micro, Nova Lite
Data extraction & summarization	Accuracy, structured output	Mid-range models	Nova Pro, Claude Haiku, Llama Maverick
Narrative generation & analysis	Quality, reasoning depth	Capable models	Claude Sonnet, Llama 70B, DeepSeek
Complex multi-step reasoning	Depth, nuance, thoroughness	Frontier models	Claude Sonnet, Claude Opus

💡 The key question: "What's the most cost-effective model that meets my quality threshold for this specific task?" — not "which model is the best overall." The Model Arena demo helps you answer this by comparing outputs side-by-side.

Why Does This Happen?

💡 The Three Factors

1. Parameters (size): More parameters = more "knowledge" stored in the model. A 70B model has seen more patterns during training than a 7B model. But more parameters means more computation per token → slower and more expensive.

2. Training data & techniques: Claude models are trained with extensive safety tuning (RLHF/DPO) which makes them more cautious and thorough. Llama models are trained as general-purpose open-source models. Nova models are optimized for AWS integration and cost efficiency.

3. Architecture optimizations: Some models use techniques like mixture-of-experts (MoE) where only a fraction of parameters activate per token — making them faster without losing quality. Others use knowledge distillation to compress a large model's knowledge into a smaller one.

What We Saw in the Demo

In the Merchant Risk Assessment demo, you may have noticed:

Claude Sonnet 4 produced the most detailed analysis with specific data citations — but took 10-15 seconds
Nova Pro produced good structured output — in 3-5 seconds
Nova Lite was fastest (2-3 seconds) but the analysis was more surface-level
Llama models sometimes interpreted risk thresholds differently from Claude — because they were trained on different data with different safety tuning

This is why model selection matters — and why we use decision rules in the prompt to enforce consistency across models.

Choosing the Right Model Tier for Finance Tasks

AI models come in different sizes and specializations. Instead of memorizing specific model names (which change frequently), learn to match your task complexity to the right model tier.

The 4 Model Tiers

Tier	Characteristics	Speed	Cost	Examples (as of 2026)
⚡ Lightweight Small models, 1-17B params	Fast, cheap, good for simple tasks. Limited reasoning depth.	🟢 Fastest	🟢 Lowest	Nova Micro, Nova Lite, Haiku
🎯 Balanced Mid-range, 17-70B params	Good quality + reasonable speed. Handles structured analysis well.	🟡 Fast	🟡 Moderate	Nova Pro, Haiku 4.5, Llama Maverick, DeepSeek
🧠 Capable Large models, 70B+ params	Strong reasoning, nuanced analysis, reliable structured output.	🟡 Moderate	🔴 Higher	Claude Sonnet, Llama 70B, Gemini Pro
🔬 Frontier Largest / reasoning-specialized	Deepest reasoning, multi-step logic, handles ambiguity. Slowest and most expensive.	🔴 Slower	🔴 Highest	Claude Opus, Claude Sonnet (extended thinking), o3

Which Tier for Which Finance Task?

Finance task	Recommended tier	Why
Document classification (invoice vs receipt vs complaint)	⚡ Lightweight	Simple pattern matching, speed matters, lowest cost
Invoice data extraction (fields → JSON)	⚡ Lightweight	Structured extraction doesn't need deep reasoning
Transaction categorization & routing	⚡ Lightweight	High volume, low complexity per item
Customer complaint response drafts	🎯 Balanced	Needs empathy and nuance, but not deep analysis
Merchant risk assessment narrative	🧠 Capable	Needs structured reasoning, data citation, actionable recommendations
Credit committee narrative	🧠 Capable	Multi-perspective analysis (bull/bear case) needs good reasoning
Regulatory impact assessment	🧠 Capable or 🔬 Frontier	Cross-referencing multiple documents, nuanced interpretation
Complex multi-step financial analysis	🔬 Frontier	Deep reasoning across large datasets, highest accuracy needed
Bulk monthly assessments (200+ merchants)	⚡ Lightweight	Cost-effective at scale — use the cheapest model that meets quality threshold

✅ The golden rule: Start with the cheapest tier that might work. Test it. If quality isn't good enough, move up one tier. Don't start with Frontier for a task that Lightweight can handle — you'll waste 100x the cost for no quality gain.

Special Category: Reasoning Models

Some models have a "thinking" or "extended thinking" mode where they reason through problems step-by-step internally before responding. This is different from Chain-of-Thought prompting — the model does it automatically.

Feature	Standard model	Reasoning model / extended thinking
How it works	Generates answer directly	Thinks internally first, then generates answer
Speed	Faster	Slower (thinking takes time)
Cost	Lower	Higher (thinking tokens count)
Best for	Most tasks — extraction, drafts, classification	Math, logic puzzles, complex multi-step analysis, ambiguous problems
Finance example	Extract invoice fields, draft complaint response	Calculate DSCR across multiple scenarios, assess regulatory impact across 3 jurisdictions

💡 When to use reasoning models: If your task involves math, multi-step logic, or comparing multiple documents, a reasoning model will outperform a standard model of the same size. For straightforward extraction and generation tasks, standard mode is faster and cheaper.

What This Means for Your Daily Tools

Tool	Model selection	What you control
Kiro (workshop)	Auto-selected by task	Your prompt quality — Kiro picks the model
Claude (Cowork/Desktop)	Assigned by plan tier	Your prompt quality — Anthropic picks the model
Cursor	You choose per conversation	Model selection + prompt quality
Bedrock Playground	Full control	Model, parameters, prompt — everything
Bedrock API	Full control + routing	Intelligent Prompt Routing can auto-select the cheapest model that meets quality

💡 Key takeaway: In most AI tools, you don't choose the model — the tool does. Focus on writing great prompts and designing good workflows. The prompt engineering skills you learn work regardless of which model or tool you use. When you DO have model choice (Cursor, Bedrock), use the tier framework above.

Knowledge Cutoffs: Why They Matter for Finance

Every model has a knowledge cutoff date — it doesn't know about events after that date. Newer models have more recent cutoffs, but even the newest model won't know about last week's MAS circular.

Frontier models typically have the most recent cutoffs (within 3-6 months)
Older/smaller models may have cutoffs 6-12+ months behind
For recent regulatory changes, always use RAG grounding — attach the actual document and tell the AI to answer ONLY from it. This bypasses the knowledge cutoff entirely.

Bedrock Pricing: What Finance Teams Need to Know

Pricing Model: Pay Per Token

PRICING FORMULA

Cost = (Input tokens × Input price) + (Output tokens × Output price) Output tokens are 3-5x more expensive than input tokens (because generation is computationally harder than reading)

Model Pricing Comparison (On-Demand, US regions)

Model	Input (per 1M tokens)	Output (per 1M tokens)	Best for
Amazon Nova Micro	$0.035	$0.14	Simple classification, routing
Amazon Nova Lite	$0.06	$0.24	Drafts, summaries, FAQ
Llama 4 Maverick 17B	$0.22	$0.88	Cost-effective moderate tasks
DeepSeek v3.2	$0.62	$1.85	Reasoning, cost-effective
Amazon Nova Pro	$0.80	$3.20	Reports, analysis
Claude Haiku 4.5	$1.00	$5.00	Quality + speed balance
Llama 3.3 70B	$2.65	$3.50	Open-source experimentation
Claude Sonnet 4	$3.00	$15.00	Complex reasoning, compliance
Claude Opus 4	$5.00	$25.00	Most complex tasks

⚠️ Pricing accurate as of April 2026 and subject to change. Check aws.amazon.com/bedrock/pricing for current rates.

Cost Per Merchant Risk Assessment

Using our workshop prompt template (~400 input tokens + ~800 output tokens):

Model	Cost per assessment	50/week	200/month
Nova Micro	$0.000126	$0.006	$0.025
Nova Lite	$0.000216	$0.011	$0.043
Nova Pro	$0.002880	$0.144	$0.576
Claude Haiku 4.5	$0.004400	$0.220	$0.880
Claude Sonnet 4	$0.013200	$0.660	$2.640
Claude Opus 4	$0.022000	$1.100	$4.400

✅ The business case: Even Claude Sonnet 4 (high-quality) costs only $2.64/month for 200 merchant risk assessments. Compare that to an analyst spending 30 minutes each at $50/hour = $5,000/month. The AI is 99.95% cheaper.

Cost Optimization Strategies

Strategy	Savings	How it works
Right-size your model	Up to 143x	Use Nova Micro for classification, Sonnet for complex analysis — don't use Opus for simple tasks
Optimize prompts	10-40%	Remove redundant instructions, use shorter examples, constrain output length
Batch processing	50%	Submit requests in bulk (not real-time) — perfect for monthly portfolio assessments
Intelligent Prompt Routing	Up to 30%	Bedrock auto-routes simple tasks to cheaper models, complex tasks to powerful ones
Prompt caching	Up to 90%	Cache your template — pay full price once, 10% for every reuse

Right-Sizing Guide

Task	Recommended model	Cost tier
Document classification ("Is this an invoice or receipt?")	Nova Micro / Lite	$0.04-0.06/1M tokens
Data extraction (fields from invoice PDF)	Nova Pro / Haiku	$0.80/1M tokens
Narrative generation (risk assessment, credit narrative)	Sonnet / Llama 70B	$2.65-3.00/1M tokens
Complex reasoning (regulatory impact, multi-step analysis)	Sonnet / Opus	$3.00-5.00/1M tokens

Context Windows: How Much Can the Model "See"?

The context window is the maximum amount of text the model can process at once — your prompt + the AI's response must fit within it.

Model	Context window	Text equivalent	Practical meaning
Nova Micro	128K tokens	~100 pages	Can read a short book
Nova Pro	300K tokens	~230 pages	Can read a long report
Claude Sonnet 4	200K tokens	~150 pages	Can read a full policy manual
Llama 3.3 70B	128K tokens	~100 pages	Can read a short book

💡 For finance: A typical merchant data file + prompt template + policy document fits easily within any model's context window. You'd only hit limits with very large documents (100+ page regulatory filings).

Token Estimation Quick Reference

Content type	Tokens per page	Tokens per item
Plain English text	~250/page	—
Financial data (CSV)	~400/page	~50/row
JSON structured data	~350/page	—
A typical email	—	~200 tokens
A merchant risk assessment	—	~800 tokens
A credit committee narrative	—	~600 tokens
An invoice (extracted text)	—	~300 tokens

Counting Tokens Before You Spend

Bedrock provides a CountTokens API that lets you check how many tokens your input will use — before you send the actual request. This is free (no charge for counting).

What you can do	Why it matters
Estimate costs before sending requests	Know the cost before you commit — especially for large batch jobs
Optimize prompts to fit within token limits	Trim your prompt if it's too long for the context window
Plan token usage in your applications	Budget your monthly token spend accurately

💡 Key point: Token counting is model-specific — the same text may produce different token counts on different models because each uses a different tokenizer. The CountTokens API returns the exact count for the model you specify.

Example: Count tokens with Python

PYTHON — CountTokens API

import boto3, json client = boto3.client("bedrock-runtime") # Count tokens for a Converse-style request response = client.count_tokens( modelId="anthropic.claude-sonnet-4-20250514-v1:0", input={ "converse": { "messages": [ {"role": "user", "content": [ {"text": "Assess this merchant's risk level based on the following data..."} ]} ], "system": [ {"text": "You are a Senior Risk Analyst..."} ] } } ) print(f"Input tokens: {response['inputTokens']}") # Use this to estimate cost before running the actual inference

Token Quotas: Understanding Rate Limits

AWS sets quotas on how many tokens you can use per minute (TPM) and per day (TPD). Understanding how these work helps you avoid throttling.

Key Terms

Term	What it means
Tokens per Minute (TPM)	Maximum tokens (input + output) you can use in one minute
Tokens per Day (TPD)	Maximum tokens per day (default = TPM × 1,440)
Requests per Minute (RPM)	Maximum number of API calls per minute
max_tokens	Parameter you set to limit how long the AI's response can be

The Burndown Rate: Why Output Tokens Cost More Quota

For newer Claude models (3.7 and later), output tokens consume 5x the quota of input tokens. This is because generating text is computationally much harder than reading it.

Model	Input burndown	Output burndown	Example: 1,000 input + 100 output
Claude Sonnet 4, Opus 4	1:1	5:1	1,000 + (100 × 5) = 1,500 quota tokens
Nova, Llama, older Claude	1:1	1:1	1,000 + 100 = 1,100 quota tokens

⚠️ Important: You're only billed for actual tokens used (1,100 in the example above). The 5x burndown only affects your quota (rate limit), not your bill. But it means you can hit throttling limits faster with Claude 4+ models.

Why max_tokens Matters for Throughput

Bedrock reserves quota for max_tokens at the start of each request, then adjusts after the response is generated:

	max_tokens = 32,000 (too high)	max_tokens = 1,250 (optimized)
Initial quota reserved	40,000 tokens	9,250 tokens
Actual quota used	9,000 tokens	9,000 tokens
Wasted reservation	31,000 tokens	250 tokens
Impact	Fewer concurrent requests possible	More concurrent requests possible

✅ Optimization tip: Set max_tokens close to your expected output size. For a merchant risk assessment (~800 tokens output), set max_tokens to 1,000-1,200 — not the default 4,096 or 32,000. This lets you run more concurrent requests within your quota.

Monitor Your Usage

Use Amazon CloudWatch to track your token consumption:

InputTokenCount — tokens sent to the model
OutputTokenCount — tokens generated by the model
CacheReadInputTokens — tokens served from cache (cheaper)
CacheWriteInputTokens — tokens written to cache

Navigate to CloudWatch → Dashboards → Automatic dashboards → Bedrock → "Token Counts by Model" to see your usage patterns.

Data Privacy & Security

💡 With Amazon Bedrock:

Your data stays in your AWS account — it is not used to train the models
You control the region, encryption, and access
This is different from using ChatGPT or Claude.ai directly — Bedrock provides enterprise-grade data isolation
All API calls are logged and auditable via CloudTrail
You can restrict which models and regions are available to your team

Token Cost in AI Tools (Kiro, Claude, Cursor)

Understanding tokens isn't just about API pricing — it directly affects how you use AI tools like Kiro every day.

Every Message Sends Your Full History

When you chat with Kiro (or any AI assistant), the tool sends all previous messages in the conversation as context to the model. This means:

Message #	What gets sent to the model	Approx. tokens
1st message	Steering files + your message	~1,500
5th message	Steering + 4 previous exchanges + your message	~8,000
15th message	Steering + 14 previous exchanges + your message	~25,000+
After summarization	Steering + compressed summary + your message	~4,000

When the context reaches 80% of the model's limit, Kiro automatically summarizes older messages to free up space. This keeps the conversation going but compresses earlier details.

💡 Practical tip: Start a new session when you switch to a genuinely different task. A fresh session resets the conversation history to zero — you only pay for steering files + your new message, not the accumulated history from the previous task.

Steering Files = Tokens on Every Request

Kiro's steering files (.kiro/steering/*.md) provide persistent context — but they consume tokens on every single message you send. The official Kiro docs state:

"Context files and agent resources consume tokens from your context window on every request, whether referenced or not."
— Kiro Docs: Context Management

This is why Kiro offers inclusion modes — so you control what loads when:

Mode	When it loads	Token impact	Use for
`always`	Every message, every session	🔴 Constant — sent with every request	Essential rules only: company name, currency, PII policy
`fileMatch`	Only when matching files are in context	🟡 Conditional — only when relevant	Component patterns, API standards, test conventions
`manual`	Only when you reference it with `#`	🟢 On-demand — zero cost until used	Heavy reference docs, deployment guides, project status
`auto`	When your request matches the description	🟡 Conditional — AI decides relevance	Specialized domain knowledge, complex workflows

The Math: Why This Matters

Consider a 30-message session with a 500-token always-on steering file:

Steering tokens sent: 500 × 30 = 15,000 tokens just for steering
If that steering file were 2,000 tokens (verbose): 2,000 × 30 = 60,000 tokens
At Sonnet pricing ($3/M input): the verbose version costs $0.18 per session just for steering

Multiply by 24 participants × multiple sessions per day — it adds up.

✅ Best practices for steering files:

Keep always files under 300 words — essential rules only, no lengthy explanations
Move heavy docs to manual — project status, deployment guides, reference materials
Use fileMatch for specialized rules — they only load when you're working on matching files
Write in Markdown — 60% fewer tokens than HTML for the same structure (see above)
Start fresh sessions for new tasks — resets conversation history, keeps token usage low

🔍 Example: Our workshop steering files

In this workshop, we use three steering files with different inclusion modes:

File	Mode	Why
`workshop-rules.md`	`always`	~200 words. Company name, domain context, currency defaults, PII rules. Needed in every interaction.
`project-status.md`	`manual`	~2,000 words. Full project history, session notes, what's been built. Only loaded when instructor needs continuity.
`deployment.md`	`manual`	~500 words. S3 bucket names, CloudFront IDs, deploy commands. Only loaded when deploying.

If project-status.md were set to always, it would add ~2,500 tokens to every single message — even when asking a simple question about slide formatting. By setting it to manual, those tokens are only spent when we actually need the project context.

Workshop Connection

Concept	Where you'll see it
Token estimation	Prompt Engineering: Understanding why prompt length matters for cost and quality
Model selection	Demo: Model Arena — compare 3 models on the same task
Cost optimization	Prompt Management and Optimization
Right-sizing models	Agentic AI: Intelligent Prompt Routing in workflow automation
Context windows	Managing long conversations and knowing when to start fresh
Steering token cost	Kiro Stack: Creating steering files — keep `always` files concise, use `manual` for heavy docs

← GenAI Use Cases Workshop Home →

💰 Tokenization, Pricing & Model Selection

What Are Tokens?

Quick Token Estimation

How Tokenization Works

Why Different Models Tokenize Differently

Why Different Models Perform Differently

Model Size = Brain Size

Choosing the Right Model for the Task

Why Does This Happen?

What We Saw in the Demo

Choosing the Right Model Tier for Finance Tasks

The 4 Model Tiers

Which Tier for Which Finance Task?

Special Category: Reasoning Models

What This Means for Your Daily Tools

Knowledge Cutoffs: Why They Matter for Finance

Bedrock Pricing: What Finance Teams Need to Know

Pricing Model: Pay Per Token

Model Pricing Comparison (On-Demand, US regions)

Cost Per Merchant Risk Assessment

Cost Optimization Strategies

Right-Sizing Guide

Context Windows: How Much Can the Model "See"?

Token Estimation Quick Reference

Counting Tokens Before You Spend

Example: Count tokens with Python

Token Quotas: Understanding Rate Limits

Key Terms

The Burndown Rate: Why Output Tokens Cost More Quota

Why max_tokens Matters for Throughput

Monitor Your Usage

Data Privacy & Security

Token Cost in AI Tools (Kiro, Claude, Cursor)

Every Message Sends Your Full History

Steering Files = Tokens on Every Request

The Math: Why This Matters

Workshop Connection