๐Ÿ’ฐ Tokenization, Pricing & Model Selection

How AI models process text, what it costs, and why different models perform differently โ€” explained for finance teams.

What Are Tokens?

LLMs don't process words โ€” they process tokens. A token is a piece of text, roughly:

TOKEN ESTIMATION
1 token โ‰ˆ 4 characters โ‰ˆ ยพ of a word "AnyCompany Financial Group" = 4 tokens "The merchant's chargeback rate is 4.1%" = 9 tokens A typical merchant risk assessment (500 words) โ‰ˆ 650 tokens

Why this matters for cost: You pay per token โ€” both for what you send (input) and what the AI generates (output). Longer prompts and longer outputs cost more.

Quick Token Estimation

ContentApproximate tokens
A short question ("Assess this merchant")~5 tokens
A paragraph of merchant data (10 lines)~150 tokens
Our engineered prompt template~400 tokens
A full risk assessment output (8 sections)~800 tokens
Total per assessment (input + output)~1,350 tokens

How Tokenization Works

Models use subword tokenization โ€” they break text into meaningful pieces, not whole words:

TextTokensCountNote
"chargebacks"["charge", "backs"]2Split into meaningful subwords
"PayLater"["Pay", "Later"]2CamelCase splits naturally
"SGD"["SG", "D"]2Abbreviations may split
"$4,200"["$", "4", ",", "200"]4Numbers are expensive!
โš ๏ธ Key insight for finance: Numbers and special characters use more tokens than you'd expect. A table of financial data with lots of numbers costs more tokens than the same amount of narrative text.
๐Ÿ” Why do numbers cost more tokens?

Tokenizers learn from text โ€” and most text is words, not numbers.

During training, the tokenizer (BPE) merges the most frequent character pairs into tokens. Common words like "merchant" and "chargeback" appear millions of times, so they become efficient 1-2 token units. But numbers are different:

TextTokensWhy
"merchant"1 tokenVery common word โ€” learned as a single unit
"chargeback"2 tokens"charge" + "back" โ€” common subwords
"$4,200.50"5+ tokens"$" + "4" + "," + "200" + "." + "50" โ€” each piece is separate
"SGD 18,642"4+ tokens"SG" + "D" + " 18" + "," + "642"

Three reasons numbers split into many tokens:

  1. Each digit combination is rare. "200" exists as a token, but "4,200" as a specific sequence is much rarer than common words. The tokenizer never learned to merge it.
  2. Punctuation breaks merging. The comma in "$4,200" and the period in "4,200.50" act as boundaries โ€” each becomes its own token.
  3. Numbers have high entropy. After "$", the next character could be anything (4, 42, 4,200, 42,000). The tokenizer can't predict it, so it keeps numbers as small pieces.

The practical impact:

A CSV row like MC-8842, Kopi Corner, $15,600, 4.1%, SGD, 2025-06-15 uses ~25-30 tokens โ€” while the same length of English text uses only ~12 tokens.

What you can do:

  • Summarize data in narrative form when possible ("Revenue grew 271% from $4,200 to $15,600") rather than pasting raw CSV tables
  • Use the CountTokens API to check before sending large datasets
  • Extract only the relevant columns rather than sending entire spreadsheets

Why Different Models Tokenize Differently

Each model family has its own tokenizer โ€” the same text may be a different number of tokens on different models:

๐Ÿ’ก Document format affects token cost too

The format you use to give documents to AI matters for cost. The same section heading:
## 4.2 Chargeback Thresholds costs ~8 tokens in Markdown. The HTML equivalent:
<h2 id="section-4-2">4.2 Chargeback Thresholds</h2> costs ~20 tokens โ€” 2.5x more for the same information. This is why all your prompt templates, skills, and steering files use Markdown (.md) โ€” it's the most token-efficient structured format that both humans and AI can read.

Why Different Models Perform Differently

Not all AI models are created equal. They differ in size (parameters), training (data and techniques), and architecture โ€” which affects speed, quality, and cost.

Model Size = Brain Size

Parameters are the "knowledge" stored in the model. More parameters = more capacity for complex reasoning, but also slower and more expensive.

Model sizeParametersAnalogyGood for
Small1-17BJunior analyst โ€” fast, handles routine tasksClassification, simple extraction, FAQ
Medium17-70BSenior analyst โ€” balanced speed and depthReports, structured analysis, narratives
Large70B+Expert consultant โ€” thorough but expensiveComplex reasoning, multi-step analysis, research

Choosing the Right Model for the Task

Different tasks need different trade-offs. Match the model to the job โ€” not every task needs the most powerful option:

Task typeWhat matters mostModel categoryExamples on Bedrock
Classification & routingSpeed, low costSmall / lightweight modelsNova Micro, Nova Lite
Data extraction & summarizationAccuracy, structured outputMid-range modelsNova Pro, Claude Haiku, Llama Maverick
Narrative generation & analysisQuality, reasoning depthCapable modelsClaude Sonnet, Llama 70B, DeepSeek
Complex multi-step reasoningDepth, nuance, thoroughnessFrontier modelsClaude Sonnet, Claude Opus
๐Ÿ’ก The key question: "What's the most cost-effective model that meets my quality threshold for this specific task?" โ€” not "which model is the best overall." The Model Arena demo helps you answer this by comparing outputs side-by-side.

Why Does This Happen?

๐Ÿ’ก The Three Factors

1. Parameters (size): More parameters = more "knowledge" stored in the model. A 70B model has seen more patterns during training than a 7B model. But more parameters means more computation per token โ†’ slower and more expensive.

2. Training data & techniques: Claude models are trained with extensive safety tuning (RLHF/DPO) which makes them more cautious and thorough. Llama models are trained as general-purpose open-source models. Nova models are optimized for AWS integration and cost efficiency.

3. Architecture optimizations: Some models use techniques like mixture-of-experts (MoE) where only a fraction of parameters activate per token โ€” making them faster without losing quality. Others use knowledge distillation to compress a large model's knowledge into a smaller one.

What We Saw in the Demo

In the Merchant Risk Assessment demo, you may have noticed:

This is why model selection matters โ€” and why we use decision rules in the prompt to enforce consistency across models.

Choosing the Right Model Tier for Finance Tasks

AI models come in different sizes and specializations. Instead of memorizing specific model names (which change frequently), learn to match your task complexity to the right model tier.

The 4 Model Tiers

TierCharacteristicsSpeedCostExamples (as of 2026)
โšก Lightweight
Small models, 1-17B params
Fast, cheap, good for simple tasks. Limited reasoning depth. ๐ŸŸข Fastest ๐ŸŸข Lowest Nova Micro, Nova Lite, Haiku
๐ŸŽฏ Balanced
Mid-range, 17-70B params
Good quality + reasonable speed. Handles structured analysis well. ๐ŸŸก Fast ๐ŸŸก Moderate Nova Pro, Haiku 4.5, Llama Maverick, DeepSeek
๐Ÿง  Capable
Large models, 70B+ params
Strong reasoning, nuanced analysis, reliable structured output. ๐ŸŸก Moderate ๐Ÿ”ด Higher Claude Sonnet, Llama 70B, Gemini Pro
๐Ÿ”ฌ Frontier
Largest / reasoning-specialized
Deepest reasoning, multi-step logic, handles ambiguity. Slowest and most expensive. ๐Ÿ”ด Slower ๐Ÿ”ด Highest Claude Opus, Claude Sonnet (extended thinking), o3

Which Tier for Which Finance Task?

Finance taskRecommended tierWhy
Document classification (invoice vs receipt vs complaint)โšก LightweightSimple pattern matching, speed matters, lowest cost
Invoice data extraction (fields โ†’ JSON)โšก LightweightStructured extraction doesn't need deep reasoning
Transaction categorization & routingโšก LightweightHigh volume, low complexity per item
Customer complaint response drafts๐ŸŽฏ BalancedNeeds empathy and nuance, but not deep analysis
Merchant risk assessment narrative๐Ÿง  CapableNeeds structured reasoning, data citation, actionable recommendations
Credit committee narrative๐Ÿง  CapableMulti-perspective analysis (bull/bear case) needs good reasoning
Regulatory impact assessment๐Ÿง  Capable or ๐Ÿ”ฌ FrontierCross-referencing multiple documents, nuanced interpretation
Complex multi-step financial analysis๐Ÿ”ฌ FrontierDeep reasoning across large datasets, highest accuracy needed
Bulk monthly assessments (200+ merchants)โšก LightweightCost-effective at scale โ€” use the cheapest model that meets quality threshold
โœ… The golden rule: Start with the cheapest tier that might work. Test it. If quality isn't good enough, move up one tier. Don't start with Frontier for a task that Lightweight can handle โ€” you'll waste 100x the cost for no quality gain.

Special Category: Reasoning Models

Some models have a "thinking" or "extended thinking" mode where they reason through problems step-by-step internally before responding. This is different from Chain-of-Thought prompting โ€” the model does it automatically.

FeatureStandard modelReasoning model / extended thinking
How it worksGenerates answer directlyThinks internally first, then generates answer
SpeedFasterSlower (thinking takes time)
CostLowerHigher (thinking tokens count)
Best forMost tasks โ€” extraction, drafts, classificationMath, logic puzzles, complex multi-step analysis, ambiguous problems
Finance exampleExtract invoice fields, draft complaint responseCalculate DSCR across multiple scenarios, assess regulatory impact across 3 jurisdictions
๐Ÿ’ก When to use reasoning models: If your task involves math, multi-step logic, or comparing multiple documents, a reasoning model will outperform a standard model of the same size. For straightforward extraction and generation tasks, standard mode is faster and cheaper.

What This Means for Your Daily Tools

ToolModel selectionWhat you control
Kiro (workshop)Auto-selected by taskYour prompt quality โ€” Kiro picks the model
Claude (Cowork/Desktop)Assigned by plan tierYour prompt quality โ€” Anthropic picks the model
CursorYou choose per conversationModel selection + prompt quality
Bedrock PlaygroundFull controlModel, parameters, prompt โ€” everything
Bedrock APIFull control + routingIntelligent Prompt Routing can auto-select the cheapest model that meets quality
๐Ÿ’ก Key takeaway: In most AI tools, you don't choose the model โ€” the tool does. Focus on writing great prompts and designing good workflows. The prompt engineering skills you learn work regardless of which model or tool you use. When you DO have model choice (Cursor, Bedrock), use the tier framework above.

Knowledge Cutoffs: Why They Matter for Finance

Every model has a knowledge cutoff date โ€” it doesn't know about events after that date. Newer models have more recent cutoffs, but even the newest model won't know about last week's MAS circular.

Bedrock Pricing: What Finance Teams Need to Know

Pricing Model: Pay Per Token

PRICING FORMULA
Cost = (Input tokens ร— Input price) + (Output tokens ร— Output price) Output tokens are 3-5x more expensive than input tokens (because generation is computationally harder than reading)

Model Pricing Comparison (On-Demand, US regions)

ModelInput (per 1M tokens)Output (per 1M tokens)Best for
Amazon Nova Micro$0.035$0.14Simple classification, routing
Amazon Nova Lite$0.06$0.24Drafts, summaries, FAQ
Llama 4 Maverick 17B$0.22$0.88Cost-effective moderate tasks
DeepSeek v3.2$0.62$1.85Reasoning, cost-effective
Amazon Nova Pro$0.80$3.20Reports, analysis
Claude Haiku 4.5$1.00$5.00Quality + speed balance
Llama 3.3 70B$2.65$3.50Open-source experimentation
Claude Sonnet 4$3.00$15.00Complex reasoning, compliance
Claude Opus 4$5.00$25.00Most complex tasks

โš ๏ธ Pricing accurate as of April 2026 and subject to change. Check aws.amazon.com/bedrock/pricing for current rates.

Cost Per Merchant Risk Assessment

Using our workshop prompt template (~400 input tokens + ~800 output tokens):

ModelCost per assessment50/week200/month
Nova Micro$0.000126$0.006$0.025
Nova Lite$0.000216$0.011$0.043
Nova Pro$0.002880$0.144$0.576
Claude Haiku 4.5$0.004400$0.220$0.880
Claude Sonnet 4$0.013200$0.660$2.640
Claude Opus 4$0.022000$1.100$4.400
โœ… The business case: Even Claude Sonnet 4 (high-quality) costs only $2.64/month for 200 merchant risk assessments. Compare that to an analyst spending 30 minutes each at $50/hour = $5,000/month. The AI is 99.95% cheaper.

Cost Optimization Strategies

StrategySavingsHow it works
Right-size your modelUp to 143xUse Nova Micro for classification, Sonnet for complex analysis โ€” don't use Opus for simple tasks
Optimize prompts10-40%Remove redundant instructions, use shorter examples, constrain output length
Batch processing50%Submit requests in bulk (not real-time) โ€” perfect for monthly portfolio assessments
Intelligent Prompt RoutingUp to 30%Bedrock auto-routes simple tasks to cheaper models, complex tasks to powerful ones
Prompt cachingUp to 90%Cache your template โ€” pay full price once, 10% for every reuse

Right-Sizing Guide

TaskRecommended modelCost tier
Document classification ("Is this an invoice or receipt?")Nova Micro / Lite$0.04-0.06/1M tokens
Data extraction (fields from invoice PDF)Nova Pro / Haiku$0.80/1M tokens
Narrative generation (risk assessment, credit narrative)Sonnet / Llama 70B$2.65-3.00/1M tokens
Complex reasoning (regulatory impact, multi-step analysis)Sonnet / Opus$3.00-5.00/1M tokens

Context Windows: How Much Can the Model "See"?

The context window is the maximum amount of text the model can process at once โ€” your prompt + the AI's response must fit within it.

ModelContext windowText equivalentPractical meaning
Nova Micro128K tokens~100 pagesCan read a short book
Nova Pro300K tokens~230 pagesCan read a long report
Claude Sonnet 4200K tokens~150 pagesCan read a full policy manual
Llama 3.3 70B128K tokens~100 pagesCan read a short book
๐Ÿ’ก For finance: A typical merchant data file + prompt template + policy document fits easily within any model's context window. You'd only hit limits with very large documents (100+ page regulatory filings).

Token Estimation Quick Reference

Content typeTokens per pageTokens per item
Plain English text~250/pageโ€”
Financial data (CSV)~400/page~50/row
JSON structured data~350/pageโ€”
A typical emailโ€”~200 tokens
A merchant risk assessmentโ€”~800 tokens
A credit committee narrativeโ€”~600 tokens
An invoice (extracted text)โ€”~300 tokens

Counting Tokens Before You Spend

Bedrock provides a CountTokens API that lets you check how many tokens your input will use โ€” before you send the actual request. This is free (no charge for counting).

What you can doWhy it matters
Estimate costs before sending requestsKnow the cost before you commit โ€” especially for large batch jobs
Optimize prompts to fit within token limitsTrim your prompt if it's too long for the context window
Plan token usage in your applicationsBudget your monthly token spend accurately
๐Ÿ’ก Key point: Token counting is model-specific โ€” the same text may produce different token counts on different models because each uses a different tokenizer. The CountTokens API returns the exact count for the model you specify.

Example: Count tokens with Python

PYTHON โ€” CountTokens API
import boto3, json client = boto3.client("bedrock-runtime") # Count tokens for a Converse-style request response = client.count_tokens( modelId="anthropic.claude-sonnet-4-20250514-v1:0", input={ "converse": { "messages": [ {"role": "user", "content": [ {"text": "Assess this merchant's risk level based on the following data..."} ]} ], "system": [ {"text": "You are a Senior Risk Analyst..."} ] } } ) print(f"Input tokens: {response['inputTokens']}") # Use this to estimate cost before running the actual inference

Token Quotas: Understanding Rate Limits

AWS sets quotas on how many tokens you can use per minute (TPM) and per day (TPD). Understanding how these work helps you avoid throttling.

Key Terms

TermWhat it means
Tokens per Minute (TPM)Maximum tokens (input + output) you can use in one minute
Tokens per Day (TPD)Maximum tokens per day (default = TPM ร— 1,440)
Requests per Minute (RPM)Maximum number of API calls per minute
max_tokensParameter you set to limit how long the AI's response can be

The Burndown Rate: Why Output Tokens Cost More Quota

For newer Claude models (3.7 and later), output tokens consume 5x the quota of input tokens. This is because generating text is computationally much harder than reading it.

ModelInput burndownOutput burndownExample: 1,000 input + 100 output
Claude Sonnet 4, Opus 41:15:11,000 + (100 ร— 5) = 1,500 quota tokens
Nova, Llama, older Claude1:11:11,000 + 100 = 1,100 quota tokens
โš ๏ธ Important: You're only billed for actual tokens used (1,100 in the example above). The 5x burndown only affects your quota (rate limit), not your bill. But it means you can hit throttling limits faster with Claude 4+ models.

Why max_tokens Matters for Throughput

Bedrock reserves quota for max_tokens at the start of each request, then adjusts after the response is generated:

max_tokens = 32,000 (too high)max_tokens = 1,250 (optimized)
Initial quota reserved40,000 tokens9,250 tokens
Actual quota used9,000 tokens9,000 tokens
Wasted reservation31,000 tokens250 tokens
ImpactFewer concurrent requests possibleMore concurrent requests possible
โœ… Optimization tip: Set max_tokens close to your expected output size. For a merchant risk assessment (~800 tokens output), set max_tokens to 1,000-1,200 โ€” not the default 4,096 or 32,000. This lets you run more concurrent requests within your quota.

Monitor Your Usage

Use Amazon CloudWatch to track your token consumption:

Navigate to CloudWatch โ†’ Dashboards โ†’ Automatic dashboards โ†’ Bedrock โ†’ "Token Counts by Model" to see your usage patterns.

Data Privacy & Security

๐Ÿ’ก With Amazon Bedrock:
  • Your data stays in your AWS account โ€” it is not used to train the models
  • You control the region, encryption, and access
  • This is different from using ChatGPT or Claude.ai directly โ€” Bedrock provides enterprise-grade data isolation
  • All API calls are logged and auditable via CloudTrail
  • You can restrict which models and regions are available to your team

Token Cost in AI Tools (Kiro, Claude, Cursor)

Understanding tokens isn't just about API pricing โ€” it directly affects how you use AI tools like Kiro every day.

Every Message Sends Your Full History

When you chat with Kiro (or any AI assistant), the tool sends all previous messages in the conversation as context to the model. This means:

Message #What gets sent to the modelApprox. tokens
1st messageSteering files + your message~1,500
5th messageSteering + 4 previous exchanges + your message~8,000
15th messageSteering + 14 previous exchanges + your message~25,000+
After summarizationSteering + compressed summary + your message~4,000

When the context reaches 80% of the model's limit, Kiro automatically summarizes older messages to free up space. This keeps the conversation going but compresses earlier details.

๐Ÿ’ก Practical tip: Start a new session when you switch to a genuinely different task. A fresh session resets the conversation history to zero โ€” you only pay for steering files + your new message, not the accumulated history from the previous task.

Steering Files = Tokens on Every Request

Kiro's steering files (.kiro/steering/*.md) provide persistent context โ€” but they consume tokens on every single message you send. The official Kiro docs state:

"Context files and agent resources consume tokens from your context window on every request, whether referenced or not."
โ€” Kiro Docs: Context Management

This is why Kiro offers inclusion modes โ€” so you control what loads when:

ModeWhen it loadsToken impactUse for
alwaysEvery message, every session๐Ÿ”ด Constant โ€” sent with every requestEssential rules only: company name, currency, PII policy
fileMatchOnly when matching files are in context๐ŸŸก Conditional โ€” only when relevantComponent patterns, API standards, test conventions
manualOnly when you reference it with #๐ŸŸข On-demand โ€” zero cost until usedHeavy reference docs, deployment guides, project status
autoWhen your request matches the description๐ŸŸก Conditional โ€” AI decides relevanceSpecialized domain knowledge, complex workflows

The Math: Why This Matters

Consider a 30-message session with a 500-token always-on steering file:

Multiply by 24 participants ร— multiple sessions per day โ€” it adds up.

โœ… Best practices for steering files:
  • Keep always files under 300 words โ€” essential rules only, no lengthy explanations
  • Move heavy docs to manual โ€” project status, deployment guides, reference materials
  • Use fileMatch for specialized rules โ€” they only load when you're working on matching files
  • Write in Markdown โ€” 60% fewer tokens than HTML for the same structure (see above)
  • Start fresh sessions for new tasks โ€” resets conversation history, keeps token usage low
๐Ÿ” Example: Our workshop steering files

In this workshop, we use three steering files with different inclusion modes:

FileModeWhy
workshop-rules.mdalways~200 words. Company name, domain context, currency defaults, PII rules. Needed in every interaction.
project-status.mdmanual~2,000 words. Full project history, session notes, what's been built. Only loaded when instructor needs continuity.
deployment.mdmanual~500 words. S3 bucket names, CloudFront IDs, deploy commands. Only loaded when deploying.

If project-status.md were set to always, it would add ~2,500 tokens to every single message โ€” even when asking a simple question about slide formatting. By setting it to manual, those tokens are only spent when we actually need the project context.

Workshop Connection

ConceptWhere you'll see it
Token estimationPrompt Engineering: Understanding why prompt length matters for cost and quality
Model selectionDemo: Model Arena โ€” compare 3 models on the same task
Cost optimizationPrompt Management and Optimization
Right-sizing modelsAgentic AI: Intelligent Prompt Routing in workflow automation
Context windowsManaging long conversations and knowing when to start fresh
Steering token costKiro Stack: Creating steering files โ€” keep always files concise, use manual for heavy docs