Fundamentals
Getting Started with LLMs
A comprehensive introduction to Large Language Models — architecture, training, capabilities, and practical setup
Published: 2026-04-01 · Last updated: 2026-04-10
Getting Started with Large Language Models
Large Language Models (LLMs) are artificial intelligence systems trained on vast amounts of text data to understand, generate, and manipulate human language. They have become the cornerstone of modern AI, powering everything from chatbots and code assistants to research tools and creative writing aids.
This guide covers what LLMs are, how they work under the hood, the landscape of available models, and how to get started using them in your own projects.
What are LLMs?
LLMs are deep learning models based on the Transformer architecture (introduced in the 2017 paper "Attention Is All You Need"). They are characterized by several key properties:
- Scale: Ranging from hundreds of millions to trillions of parameters
- Pre-training: Learned self-supervised from massive text corpora spanning web pages, books, code, and scientific papers
- Emergent Abilities: Capabilities that appear unpredictably at scale — reasoning, code generation, translation — that were not explicitly trained for
- Few-shot & Zero-shot Learning: Ability to perform tasks from natural language instructions without task-specific fine-tuning
- Multimodality: Modern LLMs increasingly handle text, images, audio, video, and structured data
A Brief History
| Era | Milestone | Significance |
|---|---|---|
| 2017 | Transformer architecture | Replaced RNNs/CNNs as the dominant sequence model |
| 2018 | GPT, BERT | Proved pre-training + fine-tuning paradigm |
| 2019 | GPT-2 (1.5B) | Showed scaling improves coherence and task ability |
| 2020 | GPT-3 (175B), T5 | Demonstrated few-shot learning at scale |
| 2021 | InstructGPT, Codex | Alignment via RLHF; code generation |
| 2022 | ChatGPT, StableLM | Conversational AI goes mainstream |
| 2023 | GPT-4, Llama, Claude, Mistral | Multimodal models; open-source renaissance |
| 2024 | Llama 3, Claude 3, Gemini | Frontier models rival human experts |
| 2025-2026 | Reasoning models, agentic systems | Long-horizon planning, tool use, autonomy |
How LLMs Work
The Transformer Architecture
The Transformer replaced the sequential processing of RNNs with self-attention, enabling parallel computation across entire sequences. Here's the core idea:
import torch
import torch.nn as nn
import math
class MultiHeadSelfAttention(nn.Module):
"""Self-attention mechanism: each token attends to all others."""
def __init__(self, d_model: int, num_heads: int):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
batch_size = x.size(0)
# Linear projections
Q = self.W_q(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention = torch.softmax(scores, dim=-1)
# Weighted sum of values
output = torch.matmul(attention, V)
output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
return self.W_o(output)
Key Components
| Component | Purpose | Details |
|---|---|---|
| Tokenizer | Text → subword tokens | Byte-pair encoding (BPE), SentencePiece, or tiktoken; vocabulary sizes 32K–200K tokens |
| Embedding Layer | Tokens → dense vectors | Learnable embedding matrices; positional encodings add sequence order information |
| Attention Mechanism | Contextual token mixing | Scaled dot-product attention; multi-head enables attending to different subspace features |
| Feed-Forward Network | Non-linear transformation | Typically 2-layer MLP with GeLU activation; ~4x hidden dimension expansion |
| Layer Normalization | Training stability | Pre-LN or Post-LN placement; RMSNorm is common in modern models |
| Residual Connections | Gradient flow | Skip connections around each sub-layer |
| Output Head | Token prediction | Linear layer projecting to vocabulary; softmax for probability distribution |
Training Pipeline
- Pre-training: Model learns to predict the next token (autoregressive LM) on trillions of tokens. This takes weeks on thousands of GPUs and costs millions of dollars for frontier models.
- Supervised Fine-tuning (SFT): Model is further trained on high-quality instruction-response pairs to follow human directions.
- Alignment (RLHF/DPO): Reinforcement Learning from Human Feedback or Direct Preference Optimization aligns outputs with human values — helpfulness, honesty, harmlessness.
- Safety Filtering: Additional guardrails prevent harmful outputs, jailbreaks, and policy violations.
Types of LLMs
By Architecture
| Type | Examples | Strengths | Weaknesses |
|---|---|---|---|
| Decoder-only | GPT-4, Llama 3, Claude, Mistral | Autoregressive generation, versatile, dominant paradigm | Can hallucinate; unidirectional attention |
| Encoder-only | BERT, RoBERTa, DeBERTa | Bidirectional understanding, excellent for classification | Cannot generate text natively |
| Encoder-Decoder | T5, BART, Flan-T5 | Sequence-to-sequence tasks, translation | Less common now; outperformed by decoder-only at scale |
| Mixture of Experts (MoE) | Mixtral 8x7B, Grok, Qwen MoE | Sparse activation; high capacity with lower compute | Complex routing; harder to serve |
By Access Model
| Type | Examples | Characteristics |
|---|---|---|
| Closed / API-only | GPT-4o, Claude Sonnet, Gemini Pro | State-of-the-art performance; pay-per-use; limited customization |
| Open-weight | Llama 3.1 405B, Mistral Large, Qwen 2.5 | Downloadable weights; self-hostable; community fine-tunes |
| Open-weight + permissive license | Mistral, Gemma, OLMo | Commercial-friendly licenses; research-friendly |
| Fully open | OLMo, Pythia | Weights + training data + code; maximum transparency |
By Size Class
| Class | Parameters | Hardware | Use Cases |
|---|---|---|---|
| Tiny | < 1B | CPU, edge devices | On-device inference, IoT, mobile |
| Small | 1–7B | Single consumer GPU | Personal assistants, lightweight apps |
| Medium | 7–70B | 1–4 GPUs | Enterprise chatbots, code assistants, RAG |
| Large | 70–200B | GPU cluster or high-end single GPU | Research, high-quality generation |
| Frontier | 200B+ | Massive clusters | State-of-the-art benchmarks, general intelligence tasks |
Getting Started
Prerequisites
- Python 3.10+ installed
- 8 GB RAM minimum (16 GB recommended)
- For local models: a GPU with 8+ GB VRAM (NVIDIA recommended for CUDA support)
Option 1: Quick Start with Hugging Face Transformers
# Install core libraries
pip install transformers torch accelerate
# Quick sentiment analysis
python -c "
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('I love working with LLMs!')
print(result) # [{'label': 'POSITIVE', 'score': 0.9998}]
"
Option 2: Text Generation with Open-Source Models
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto" # Automatically use GPU if available
)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain quantum computing in 3 sentences."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Option 3: Using API Providers
# OpenAI
pip install openai
# Anthropic
pip install anthropic
# Google
pip install google-generativeai
# OpenAI example
from openai import OpenAI
client = OpenAI(api_key="YOUR_API_KEY")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is machine learning?"}],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
Option 4: Local Inference with Ollama (No GPU Required)
# Install Ollama: https://ollama.com
ollama run llama3.2:3b "Explain the difference between supervised and unsupervised learning"
# Try different models
ollama run mistral "Write a Python function to sort a list"
ollama run qwen2.5:7b "Summarize the key points of attention mechanisms"
Popular Model Providers
| Provider | Flagship Models | Access | Notable Features |
|---|---|---|---|
| OpenAI | GPT-4o, GPT-4o mini, o1, o3 | API, ChatGPT | Strong reasoning, tool use, multimodal |
| Anthropic | Claude 3.5/4 Sonnet, Opus | API, Claude.ai | Safety-focused, long context, computer use |
| Gemini 2.0/2.5 Pro, Flash | API, AI Studio | Native multimodal, long context window | |
| Meta | Llama 3.1/3.3, Llama 4 | Open weights | Strong open-source ecosystem |
| Mistral AI | Mistral Large, Mixtral | API + open weights | Efficient MoE architecture |
| Alibaba | Qwen 2.5, QwQ | API + open weights | Strong multilingual and code abilities |
| DeepSeek | DeepSeek V3, R1 | API + open weights | Competitive reasoning, open weights |
Essential Concepts to Learn Next
Understanding LLMs goes beyond just running them. These topics will deepen your expertise:
- Prompt Engineering — Crafting effective instructions
- Tokenization & Embeddings — How text becomes vectors
- Transformer Architecture — Deep dive into attention mechanisms
- Fine-tuning & LoRA — Adapting models to your domain
- RAG Systems — Grounding LLMs in your data
- Inference Optimization — Running models efficiently
Quick Reference: Key Terms
| Term | Definition |
|---|---|
| Token | A subword unit; roughly 0.75 words in English |
| Context Window | Maximum input + output tokens the model can process |
| Temperature | Controls randomness (0 = deterministic, 1 = creative) |
| Top-p | Nucleus sampling; filters low-probability tokens |
| Hallucination | Confident but incorrect or fabricated output |
| Fine-tuning | Further training on domain-specific data |
| RLHF | Reinforcement Learning from Human Feedback |
| RAG | Retrieval-Augmented Generation — grounding with external data |
Related docs
Model Training and Pre-training
The complete LLM training pipeline — data preparation, distributed training, optimization techniques, and checkpoint management
LLM Applications and Use Cases
A survey of real-world LLM applications — from chatbots and code assistants to scientific research and creative industries
LLM Architectures Overview
Compare decoder-only, encoder-only, encoder-decoder, and MoE architectures — understanding the design space of modern language models