The Business of AI, Decoded

Context Window & Tokens Explained: Why Chatbots “Forget” (and How to Fix It)

114. Context Window & Tokens Explained: Why Chatbots “Forget” (and How to Fix It)

By Sapumal Herath • Owner & Blogger, AI Buzz • Last updated: March 13, 2026Difficulty: Beginner

Table of Contents

You give an AI assistant a clear instruction… and 10 minutes later it ignores it. Or it “forgets” a decision from earlier in the conversation. Or it confidently contradicts something you already told it.

This isn’t (always) because the model is “bad.” Most of the time, it’s because you hit a basic limit: the context window.

This guide explains tokens and context windows in plain English, why chatbots forget, and the practical patterns that make your results more consistent—without needing to be an engineer.

Note: This article is for educational purposes only. It is not legal, security, or compliance advice. Always follow your organization’s policies when sharing text, screenshots, or documents with AI tools.

🎯 What are tokens? (plain English)

Tokens are the “chunks” of text that AI models process. A token might be:

  • a whole word
  • part of a word
  • punctuation or spaces

Because models operate on tokens, not “words,” token counts affect:

  • cost (for API usage)
  • limits (how much the model can read + write at once)
  • quality (longer inputs can cause the model to miss details)

Quick token cheat sheet (approximate)

  • 1 token ≈ 4 characters (English)
  • 100 tokens ≈ 75 words (English)
  • Non-English text often uses more tokens for the same number of characters.

🧠 What is a context window? (the “working memory”)

The context window is the total amount of information the model can “see” at one time—your current message, plus some or all of the conversation history, plus any documents/tool results included.

Think of it as the model’s working memory for the current task. It is not the model’s training data. And it is not permanent memory.

Important: The context window includes both:

  • input tokens (your messages + history + retrieved text)
  • output tokens (the model’s reply)

So if you provide a huge prompt, you leave less “room” for the model to respond—and less room for earlier context to remain visible.

🧭 At a glance

  • Tokens = how text is counted.
  • Context window = how much the model can use as working memory.
  • Why chatbots “forget” = older or less relevant info gets pushed out, summarized, or ignored.
  • Best fixes = context hygiene: pin requirements, summarize, use RAG for documents, separate tasks, and reset threads when needed.

🧩 The 5 most common reasons chatbots “forget”

When an assistant seems forgetful, it’s usually one (or more) of these:

1) You exceeded the context window (history gets squeezed)

As conversations get longer, the model may not be able to include every previous message. Something has to give.

2) Your instructions are competing (the model picks the wrong one)

If you gave one instruction early (“keep it short”) and later asked for detail (“explain deeply”), the model may drift or average them.

3) The “signal-to-noise ratio” got worse

Long pasted logs, repeated content, or giant transcripts can bury the important detail.

4) You changed tasks mid-thread (topic drift)

If you mix five jobs in one thread—research, writing, editing, planning, and policy—the model’s “working set” becomes messy.

5) The model guessed instead of admitting uncertainty

This is related to hallucinations: when the model can’t clearly see the needed info, it may fill gaps with confident-sounding text unless you force it to say “unclear.”

⚙️ The “context budget” model (simple and practical)

Imagine you have a fixed budget (the context window). You are spending that budget on:

  • Task instructions (what you want)
  • Constraints (style, format, rules)
  • Evidence (facts, docs, excerpts)
  • Conversation history (prior decisions)
  • Output (the answer you want back)

If you spend too much on evidence (copy/paste everything), you lose room for reasoning and output. If you spend too much on output, you lose room for history and evidence.

High-quality prompting is basically budgeting attention.

✅ Practical checklist: Make chatbots forget less (copy/paste)

📌 A) Pin your requirements (keep them “sticky”)

  • Put the most important constraints in a short block near the top: audience, tone, format, do/don’t rules.
  • Repeat only the essentials when the thread gets long: “Reminder: keep it under 700 words; include a checklist; avoid hype.”
  • Ask the model to restate your requirements before drafting: “Confirm the rules in bullets, then write.”

🧼 B) Improve “context hygiene”

  • Don’t paste everything. Paste only the relevant excerpt, then link/label where it came from internally.
  • Remove duplicates, boilerplate, signatures, and irrelevant chat history.
  • Prefer structured inputs: bullet points, tables, and labeled sections.

🧾 C) Summarize and continue (when threads get long)

  • Every ~10–20 turns, ask: “Summarize the key decisions, constraints, and open questions in 10 bullets.”
  • Start a fresh thread with that summary as the “source of truth.”
  • This reduces drift and keeps the working memory clean.

📚 D) Use RAG for long documents (instead of pasting full docs)

  • If you regularly work with policies, manuals, or knowledge bases, use a retrieval workflow (RAG) so the model can pull only the needed sections.
  • Require citations or section references inside your workflow when possible.

🧑‍⚖️ E) Force “Observation vs Inference”

  • Add: “If you can’t find the info in the provided context, say ‘unclear’ and list what’s missing.”
  • This reduces made-up details when the model’s context is incomplete.

🧪 Mini-labs (2 no-code exercises)

Mini-lab 1: The “Pin the spec” pattern

Goal: keep the model consistent across a long thread.

Copy/paste prompt:

  • “Pinned spec (do not forget):
  • Audience: ____
  • Format: ____
  • Must include: ____
  • Must avoid: ____
  • Word count: ____
  • Step 1: Repeat the pinned spec back to me in 5 bullets.
  • Step 2: Produce the output.”

What good looks like: the model restates the rules correctly and stays aligned in the draft.

Mini-lab 2: The “Summarize + restart” reset

Goal: stop drift when a conversation gets long.

Steps:

  1. Ask: “Summarize everything important so far: decisions, constraints, facts, and TODOs. Keep it under 150 tokens.”
  2. Open a new chat and paste that summary as the first message.
  3. Continue the task from the clean summary.

What good looks like: fewer contradictions, better focus, and less “forgetting.”

🚩 Red flags (you need to change your workflow)

  • You paste full documents repeatedly instead of excerpting or retrieving.
  • The assistant starts contradicting earlier constraints (length, tone, format).
  • You’re mixing unrelated tasks in one thread and quality keeps dropping.
  • The assistant stops admitting uncertainty and starts “confident guessing.”

🔗 Keep exploring on AI Buzz

📚 Further reading (official references)

🏁 Conclusion

Chatbots don’t “forget” the way humans forget. They lose access to earlier information when it no longer fits cleanly inside the context window—or when it gets buried under noise.

The fix is practical: pin the spec, keep context clean, summarize and restart, and use retrieval instead of pasting everything. If you do those four things, the same model will suddenly feel much more reliable.

❓ Frequently Asked Questions: Context Window & Tokens

1. Does a larger context window always produce better AI outputs?

Not always. Research consistently shows that most models experience “lost in the middle” degradation — where information placed in the center of a very long context window is processed less reliably than information at the beginning or end. A 1 million token context window does not guarantee uniform attention across all content. For critical information, position it at the start or end of your prompt — and validate outputs against a known result when working with very long contexts.

2. Can sensitive data “leak” between different users’ sessions through a shared context window?

Not through the context window itself — but through poorly implemented session management. Each API call creates an isolated context. However, if an application incorrectly persists or shares session state between users — a common implementation error — one user’s context can bleed into another’s. This is a critical AI Security and DLP concern for any multi-user AI application.

3. Do tokens cost the same across all AI providers — and how does this affect total cost of ownership?

No — token pricing varies significantly across providers and even across models from the same provider. Input tokens and output tokens are often priced differently, with output tokens typically costing more. For high-volume production systems, token cost is a primary driver of total cost of ownership — making token efficiency a core engineering concern alongside accuracy. Factor this into your Buy vs. Build analysis before committing to a specific model.

4. Can an attacker exploit the context window to extract system prompt instructions from a deployed AI application?

Yes — this is a well-documented attack called “system prompt extraction.” Through carefully crafted prompt injection sequences, an attacker can sometimes cause the model to repeat or paraphrase its system prompt — revealing proprietary instructions, guardrail logic, or confidential configuration details. Mitigate this by instructing the model never to repeat its system prompt and testing for this vulnerability during every LLM Red Teaming exercise.

5. How does context window size affect the performance of a RAG system — and is bigger always better?

Bigger is not always better for RAG. A very large context window can actually reduce RAG system performance by allowing the system to stuff too many retrieved chunks into the context — diluting the most relevant information with marginally relevant content. Optimal RAG performance typically comes from precise retrieval of a small number of highly relevant chunks, not from maximizing the volume of retrieved content that fits in the available context window.

Leave a Reply

Your email address will not be published. Required fields are marked *

Latest Posts…