Large Language Models — How Chatbots Learn Language and Why They Hallucinate

You type a question; seconds later a confident paragraph arrives — grammatically fluent, occasionally wrong, sometimes stunningly useful. The technology behind this experience is a large language model (LLM): a neural network trained on vast text to predict the next word (technically token) in a sequence. No explicit database of facts inside; no human curator approving each sentence; pattern compression at scale producing behavior that feels like understanding.

This guide explains how LLMs are built and trained, why they hallucinate plausible falsehoods, how they relate to broader AGI debates, and what practical expectations fit 2026 capabilities without mysticism or doom.

Language modeling — the core task

At root, an LLM learns P(next token | previous tokens) — probability distribution over vocabulary continuations. Given “The capital of France is,” high probability mass on “Paris.” Given “The capital of France is renowned for its,” continuation shifts to cuisine, architecture, etc.

Training minimizes cross-entropy loss — penalize surprised assignments when model assigns low probability to actual next tokens in training corpus. Scale training data and model parameters; surprisal on held-out text decreases — perplexity improves. Better perplexity correlates with useful downstream behavior but does not guarantee factual reliability.

Tokens — subword units (Byte Pair Encoding, SentencePiece) — “unbelievable” might split into “un”, “believ”, “able.” Vocabulary sizes 32k–128k+ balance efficiency and coverage. Everything — code, emoji, Chinese characters — maps to token IDs.

Generation autoregressively samples or greedily selects one token at a time, appends, repeats until stop condition. Temperature and top-p sampling control randomness — higher temperature, more creative, more erratic.

No separate module labeled “truth” — factual statements emerge because training text repeatedly associated “Paris” with “capital of France.”

Transformers — architecture that scaled

Before 2017, RNNs and LSTMs processed sequences step-by-step — hard to parallelize on GPUs. “Attention Is All You Need” (Vaswani et al.) introduced the Transformer: self-attention layers letting each token weigh relationships to all others in context window — parallelizable, scales with compute.

Key components:

Self-attention — queries, keys, values matrices; each position attends to relevant positions — “it” links to antecedent noun across long distances better than old architectures.

Multi-head attention — parallel attention patterns — syntax in one head, coreference in another — emergent not hand-designed.

Feed-forward layers — per-position MLP transformations.

Layer normalization and residuals — training stability for deep stacks (dozens to 100+ layers in frontier models).

Positional encoding — order information since attention permutation-sensitive without it; evolved to rotary (RoPE) and ALiBi schemes extending context handling.

Stack encoder-only (BERT — bidirectional, good for classification), decoder-only (GPT — autoregressive generation), encoder-decoder (T5 — translation summarization). Chatbots predominantly decoder-only scaled massive.

Pretraining — the trillion-token diet

Pretraining exposes model to internet-scale corpora — web crawls (Common Crawl), books, Wikipedia, code repositories (GitHub), scientific papers, licensed datasets. Trillions of tokens; weeks on thousands of GPUs; cost tens to hundreds of millions dollars for frontier runs.

Data quality matters — deduplication, filtering toxic or low-quality pages, upsampling curated sources (Wikipedia, textbooks) improves efficiency. Data contamination — benchmark answers appearing in training text — inflates eval scores; labs hold out test sets carefully; skeptics remain.

Objective pure next-token prediction — model internalizes grammar, facts (statistically), reasoning patterns appearing in text, coding idioms, cultural references — implicit world model debated; skeptics say sophisticated mimicry without grounded understanding.

Scaling laws (Kaplan, Hoffmann et al.) — loss improves predictably with model size, data, compute — guided GPT-3 to GPT-4 investments. Diminishing returns and data walls emerging mid-2020s — synthetic data, multimodal video, reinforcement loops supplement raw text.

Fine-tuning and alignment — from completion to assistant

Raw pretrained models complete text — not necessarily helpful dialogue. Supervised Fine-Tuning (SFT) trains on instruction-response pairs human writers craft — teaches format “answer questions politely.”

Reinforcement Learning from Human Feedback (RLHF) — humans rank model outputs; reward model learned; policy optimized to prefer high-ranked responses — reduces toxic or unhelpful patterns, increases perceived helpfulness.

Constitutional AI, RLAIF variants — AI or principles guide preference learning at scale.

System prompts — hidden instructions setting behavior per deployment — “You are a helpful assistant…” — not magic safety, shapes tone and refusals.

Alignment reduces but does not eliminate hallucination or jailbreaks — optimization targets plausible human approval, not ground truth.

Hallucination — why fluent lies happen

Hallucination — model generates false statements confidently — fake citations, wrong dates, nonexistent products, invented legal citations (lawyers sanctioned for submitting ChatGPT fabrications).

Mechanisms:

No grounded retrieval by default — parametric memory compresses training facts; recall imperfect; conflation of similar entities; gap-filling with plausible patterns — “STOC” conference paper title synthesized from style of real papers.

Optimization for plausibility not verification — loss during training rewards predicting likely continuations, not checking external databases each token.

Context pressure — long conversations lose early details; model confabulates continuity.

Adversarial or ambiguous prompts — edge cases underrepresented in training.

Mitigations:

Retrieval-Augmented Generation (RAG) — fetch documents from search index or company knowledge base; condition generation on retrieved passages — reduces but does not eliminate errors if retrieval wrong.

Tool use — model calls calculator, code interpreter, SQL query, web search — AI agents architecture — external tools supply ground truth for bounded tasks.

Citation requirements — force quotes from sources; verify overlap post-hoc.

Uncertainty calibration — train model to say “I don’t know” — imperfect; users prefer confident wrong answers unless UI rewards humility.

Smaller domain-specific models — medical or legal fine-tunes on vetted corpora — narrower hallucination surface, still not zero.

Acceptance: LLMs are probabilistic language interfaces, not oracle databases — verify consequential facts independently.

Context windows and memory

Context window — tokens model attends to at once — early GPT-3 4k; 2026 frontier 128k–1M+ claimed via sparse attention, ring attention, memory compression — enables long document analysis in one shot.

Limits remain — “lost in the middle” phenomenon — information buried mid-context recalled worse than start/end; architectural and training fixes partial.

Persistent memory products — cross-session user profiles stored externally — not inherent LLM capability — privacy implications tie to local vs. cloud deployment.

Multimodal models — vision, audio, video

GPT-4V, Gemini, Claude 3+ ingest images — describe, OCR, reason about diagrams — extends token paradigm — patch embeddings from vision encoder fused into transformer stack.

Audio in/out — voice mode pipelines speech-to-text, LLM, text-to-speech — latency and prosody improving.

Video understanding — frame sampling or native video tokens — training cost explosive.

Multimodality does not automatically fix hallucination — visual misdescription occurs; deepfakes easier — cybersecurity awareness essential.

Code, math, and reasoning — uneven strengths

LLMs write code productively — autocomplete to scaffold projects — compile errors fed back in agent loops improve results. Trained on GitHub — reproduce licensed code patterns — IP and security audit concerns.

Math — arithmetic errors on multi-step problems without tool use; better with chain-of-thought prompting (“think step by step”) and formal tools (Python, Wolfram).

Reasoning benchmarks — GSM8K, MATH, ARC — scores rising — still fail puzzles humans find easy — spatial, physical intuition weak.

Debate: emergent reasoning vs. memorized templates — likely mixture — scale unlocks new behaviors on some tasks abruptly.

Parameters, quantization, and efficiency

Parameters — learned weights — GPT-3 175B; frontier models rumored/traded 500B–1T+ mixture-of-experts (MoE) — only subset activates per token — efficiency trick.

Quantization — 16-bit, 8-bit, 4-bit weights — smaller memory, faster inference, slight quality loss — enables local models on consumer GPUs.

Distillation — small student mimics large teacher — mobile deployment.

Speculative decoding — draft model proposes tokens, big model verifies — latency reduction.

Inference cost per query matters commercially — drives model tiering (mini vs. flagship).

Training infrastructure — chips, cloud, energy

Training runs orchestrated on GPU clusters (Nvidia H100/H200, AMD MI300) in cloud regions — parallelism strategies — data parallel, tensor parallel, pipeline parallel — network bandwidth bottleneck at scale.

Energy and water consumption — datacenter siting controversies — renewable pledges — environmental policy angle.

Export controls on advanced chips shape which countries train frontier models — geopolitical dimension overlapping semiconductor supply chains.

Open vs. closed weights

Closed API models — OpenAI, Anthropic, Google — weights secret; access via API; rapid improvement; vendor lock-in; safety filtering centralized.

Open-weight models — Meta Llama series, Mistral, Qwen, DeepSeek — downloadable; customizable; self-host; misuse risk (spam, malware assistance) — community fine-tunes proliferate.

Open source debate nuanced — weights open ≠ training data open — reproducibility limited — still enables academic and startup innovation.

Enterprises choose based on privacy, cost predictability, compliance, capability — often hybrid — API for hard tasks, local for sensitive docs.

Safety, bias, and censorship

Training data reflects internet biases — gender, race, culture stereotypes replicated or amplified — mitigation via fine-tuning, filtering, evaluation suites — incomplete.

Refusal behavior — decline harmful requests — jailbreak community finds bypasses — cat-and-mouse.

Content moderation — political sensitivity varies by vendor and jurisdiction — no universal standard.

Deepfake and misinformation — synthetic text cheap — detection arms race — societal not purely technical problem.

Alignment research connects to long-term AGI risk — today’s hallucination is tomorrow’s autonomous agent error at scale if uncorrected.

Practical use patterns that work

Drafting and editing — emails, outlines, tone adjustment — human final review.

Summarization — meetings, long reports — verify key figures against source.

Coding assistance — boilerplate, tests, regex — run tests; security review.

Tutoring — Socratic hints — verify curriculum accuracy.

Structured extraction — pull fields from messy text into JSON — validate schema.

Brainstorming — divergent ideas — discard garbage freely.

Patterns failing without safeguards:

Unverified legal/medical/financial advice — high stakes hallucination cost.

Autonomous decision-making — loan approvals without human — regulatory and ethical barriers.

Real-time factual news — training cutoff; no live knowledge unless tools connected.

Prompting interaction — interface not magic

Users interact via prompts — see dedicated prompt engineering guide — clarity, examples, decomposition improve outputs — does not change underlying uncertainty.

System design should assume outputs are drafts — UI nudges verification — cite sources, show confidence, link retrieval passages.

Token economics — why answers cost money

Every generated token consumes compute — inference cost scales with model size and output length. Providers price per million tokens input/output differently — long contexts and verbose chain-of-thought expensive — business models favor subscriptions with hidden caps or enterprise contracts.

Understanding tokens explains product limits — message length caps, summarization chunking — not arbitrary meanness — GPU memory and margin mathematics. Local models shift cost to hardware electricity — tradeoff analysis for high-volume use.

Prompt designers internalize brevity — ask concise answers when sufficient — environmental micro-kindness at scale.

Fine-tuning versus RAG — when to choose what

Teams debate investment paths:

Fine-tuning (supervised or LoRA adapters) — bake domain style and vocabulary into weights — good for consistent tone, format, specialized jargon — bad for rapidly changing facts — retrain needed — risk catastrophic forgetting if done naively.

RAG — keep facts in updatable document index — model retrieves fresh passages — good for policy manuals, product catalogs — bad if retrieval fails silently — hybrid common: small fine-tune plus RAG plus tool calls.

Prompt engineering alone — cheapest first step — often sufficient — see practical prompt guide — do not leap to fine-tune without eval proving gap.

Decision matrix depends on change frequency, privacy (can docs leave premises?), latency budget, and team ML depth.

Sociolinguistics and multilingual behavior

LLMs trained on uneven language corpora — English dominates — quality gaps persist Spanish, Hindi, Swahili relative — improving with targeted data campaigns — but localization not automatic — cultural nuance, idioms, formality registers err — human native review still needed for customer-facing global products.

Code-switching and dialect fairness — AAVE, regional variants — benchmark gaps documented — alignment efforts ongoing — not solved — deploy carefully in sensitive communications.

Watermarking and detection — arms race

Labs experiment watermarking generated text — statistical signatures detectable — bypassed by paraphrase models — detectors false-positive student essays — academic integrity crisis unresolved — policy responses vary institution to institution.

Audio/image deepfakes parallel — multimodal LLM stacks increase forgery capability — detection never perfect — societal adaptation (confirm channels, provenance standards) required alongside technical filters.

Evaluation — how labs measure progress

Benchmarks: MMLU (multitask knowledge), HumanEval (code), HELM holistic, custom enterprise evals — gaming risk — train on test inadvertently or overfit benchmark formats.

LLM-as-judge — another model scores answers — scalable, biased.

Human eval gold standard expensive — vendor claims cherry-pick.

Users should evaluate on their tasks — generic leaderboard ≠ your workflow success.

Long-context retrieval benchmarks

Researchers test needle-in-haystack — hide fact in long document ask retrieval — frontier models improved but fail at multiple needles and counterfactual distractors — architectural progress (ring attention, memory caches) ongoing — do not assume 128k window equals perfect recall — product UX should highlight uncertainty and cite source spans when RAG used.

Legal discovery and due diligence workflows adopting LLMs must human-verify every extracted clause against PDF page — automation assist not replacement — malpractice risk real — courts sanctioning lawyers for unchecked chat output set precedent.

Future directions — beyond autocomplete

Agents — LLM plans, uses tools, loops until goal — error accumulation without verification — active research.

World models and robotics — language grounded in physical interaction — data harder than text scrape.

Continuous learning — update weights from new events without catastrophic forgetting — unsolved at frontier scale.

Formal verification — prove properties of generated code — niche but growing.

Regulation — EU AI Act transparency duties; watermarking debates; liability for harms — landscape evolving.

LLMs likely remain components in larger systems rather than standalone oracles — composition with retrieval, tools, human oversight — agent frameworks embody this.

Mental model for the curious non-engineer

Think of an LLM as lossy compression of humanity’s writing — decompression generates novel text statistically similar to training distribution — astonishing for creative and linguistic tasks; unreliable as single source of truth; improvable with external verification loops; not conscious, not AGI, not meaningless either — a new interface layer between intent and articulated language requiring calibrated trust.

When someone says “the AI knows,” translate to “the model assigns high probability to sequences resembling authoritative text it absorbed.” When someone says “the AI is useless,” counter that workflow design separates toy from tool — same model writes nonsense or useful memo depending on human process around it.

Closing frame

Large language models reshaped expectations of software — conversation as interface, code as dialog, knowledge work augmented not replaced. Understanding next-token prediction explains both wizardry and failure modes — fluency without guaranteed fidelity. Use them generously for drafts and narrowly for facts; connect retrieval and tools for grounding; keep humans responsible for consequential claims. Hallucination is not a bug awaiting single patch — inherent to architecture optimized for plausible language — managed by process, not worship or rejection.

Lumen is edited by Leo Hartmann. Related: AGI Explained · AI Agents in 2026 · Local AI Models and Privacy · Prompt Engineering Guide