A
-
Agentic AI
Agentic AI enables LLMs to plan, use tools, and act in closed-loop cycles with memory and safety controls, turning models into goal‑directed systems for real-world tasks.
-
AI Hallucination
AI hallucination is when a generative model confidently outputs false, fabricated, or unsupported content. It stems from likelihood-driven decoding without grounding and is mitigated with RAG, citations, constrained decoding, tool use, and verification.
B
C
-
Constitutional AI
Constitutional AI aligns models to an explicit set of principles, enabling self‑critique and revision (and optionally AI feedback) to reduce harmful or non‑compliant outputs and speed policy updates.
-
Chain-of-Thought (CoT)
Chain-of-thought (CoT) prompts models to show intermediate reasoning steps, improving multi-step problem solving and interpretability for math, logic, planning, and code tasks.
D
-
Direct Preference Optimization (DPO)
DPO aligns LLMs using human preference pairs—no reward model or RL required—by training the policy to prefer chosen responses while constraining divergence from a reference model for stable, cost-effective helpfulness and safety.
-
Diffusion Model
A diffusion model generates data by reversing a gradual noising process, denoising step by step—often in latent space—and can be conditioned (e.g., by text) for controllable image, video, or audio synthesis.
E
F
-
Function Calling
Function calling lets LLMs emit structured tool invocations with validated arguments to safely call APIs and code, enabling reliable, auditable agent workflows.
-
FlashAttention
FlashAttention is an IO‑aware, exact attention algorithm that tiles work into GPU SRAM and fuses kernels to cut memory traffic, accelerating long‑context Transformer training and inference.
G
-
Graph Retrieval-Augmented Generation (Graph RAG)
Graph RAG organizes knowledge as a graph and retrieves connected subgraphs for LLMs, enabling multi-hop reasoning, disambiguation, and explainable, cited answers across documents.
-
Grouped-Query Attention (GQA)
Grouped-Query Attention shares keys/values across groups of query heads, shrinking KV caches and bandwidth to speed LLM inference and extend context with minimal quality loss; MQA is the single-group limit.
H
I
-
In-Context Learning (ICL)
ICL lets LLMs infer tasks from prompt-only examples—no weight updates—enabling zero/few-shot classification, extraction, and reasoning with schema-following in a single session.
-
Instruction Tuning
Instruction tuning fine-tunes LMs on instruction–response pairs to improve adherence, helpfulness, and controllability, and often precedes preference tuning (DPO/RLHF) for tone and safety.
J
K
-
Key-Value Cache (KV Cache)
A KV cache stores past attention keys/values so LLMs reuse them at each step, cutting latency, enabling continuous batching, and supporting longer contexts with paging and quantization.
L
-
Low-Rank Adaptation (LoRA)
LoRA fine-tunes LLMs by training small low-rank adapters on top of frozen weights, slashing memory and compute while preserving base capabilities and enabling modular, swappable domain skills.
M
-
Mixture of Experts (MoE)
Mixture of Experts (MoE) scales model capacity by routing each token to a small subset of expert networks, delivering high quality per compute with sparse activation at the cost of added routing, memory, and distributed systems complexity.
-
Model Quantization
Model quantization reduces precision (e.g., INT8/INT4) for weights and activations to shrink memory and speed inference, enabling cheaper, longer-context LLM serving on smaller hardware with minimal accuracy loss.
N
O
P
-
Paged Attention
Paged Attention organizes LLM KV caches into fixed-size pages to reduce fragmentation, enable continuous batching, and support long contexts by dynamically placing pages across GPU/CPU memory.
Q
R
-
Retrieval-Augmented Generation (RAG)
RAG grounds LLM outputs in retrieved documents via sparse, dense, or hybrid search, improving factuality, citations, and freshness without retraining the model.
-
ReAct Prompting
ReAct prompting interleaves reasoning with tool actions and observations (Thought → Action → Observation), letting LLM agents plan, act, verify, and revise for grounded, auditable multi-step tasks.
-
Reinforcement Learning from Human Feedback (RLHF)
RLHF aligns language models by training a reward model on human preferences and optimizing the policy with RL (e.g., PPO) under a KL constraint to a reference, improving helpfulness, safety, and style.
S
-
Speculative Decoding
Speculative decoding speeds up LLM inference by letting a fast draft model propose tokens that a larger model verifies, accepting matching prefixes to cut latency and cost while preserving the target model’s output distribution.
-
Small Language Model (SLM)
A Small Language Model (SLM) is a compact LLM optimized for low latency and memory via distillation, pruning, quantization, and domain tuning—often combined with RAG and tool use to deliver strong accuracy on targeted tasks under tight compute budgets.
-
Structured Output
Structured output constrains LLMs to emit schema‑valid JSON or similar formats, boosting reliability, safety, and integration by replacing free‑form text with machine‑validated fields and types.
T
-
Tree-of-Thought (ToT)
Tree-of-Thought structures reasoning as a search over branching steps. An LLM expands candidate thoughts, a controller scores and prunes them, and the best path is selected—improving multi‑step math, logic, planning, and code, at higher compute cost.
-
Text Embedding
A text embedding is a dense vector that encodes the meaning of text for similarity search, clustering, and RAG. Encoders map sentences or documents to vectors so systems compare by cosine or dot distance for retrieval and downstream tasks.
-
Toolformer
Toolformer teaches LMs to autonomously invoke external tools during generation by training on interleaved tool-call traces, boosting factuality and arithmetic/lookup reliability for agents and assistants.
U
V
-
Vision-Language Model (VLM)
A Vision-Language Model (VLM) jointly learns from images and text to understand and generate multimodal content, enabling captioning, VQA, grounding, and retrieval via vision encoders, cross-attention connectors, and instruction-tuned language decoders.
-
Vector Database
A vector database indexes high‑dimensional embeddings for fast similarity search with metadata filters and hybrid retrieval—foundational for RAG, semantic search, recommendations, and multimodal search at scale.
