Fundamentals

Getting Started with LLMs

A comprehensive introduction to Large Language Models — architecture, training, capabilities, and practical setup

Published: 2026-04-01 · Last updated: 2026-04-10

Getting Started with Large Language Models

Large Language Models (LLMs) are artificial intelligence systems trained on vast amounts of text data to understand, generate, and manipulate human language. They have become the cornerstone of modern AI, powering everything from chatbots and code assistants to research tools and creative writing aids.

This guide covers what LLMs are, how they work under the hood, the landscape of available models, and how to get started using them in your own projects.

What are LLMs?

LLMs are deep learning models based on the Transformer architecture (introduced in the 2017 paper "Attention Is All You Need"). They are characterized by several key properties:

  • Scale: Ranging from hundreds of millions to trillions of parameters
  • Pre-training: Learned self-supervised from massive text corpora spanning web pages, books, code, and scientific papers
  • Emergent Abilities: Capabilities that appear unpredictably at scale — reasoning, code generation, translation — that were not explicitly trained for
  • Few-shot & Zero-shot Learning: Ability to perform tasks from natural language instructions without task-specific fine-tuning
  • Multimodality: Modern LLMs increasingly handle text, images, audio, video, and structured data

A Brief History

EraMilestoneSignificance
2017Transformer architectureReplaced RNNs/CNNs as the dominant sequence model
2018GPT, BERTProved pre-training + fine-tuning paradigm
2019GPT-2 (1.5B)Showed scaling improves coherence and task ability
2020GPT-3 (175B), T5Demonstrated few-shot learning at scale
2021InstructGPT, CodexAlignment via RLHF; code generation
2022ChatGPT, StableLMConversational AI goes mainstream
2023GPT-4, Llama, Claude, MistralMultimodal models; open-source renaissance
2024Llama 3, Claude 3, GeminiFrontier models rival human experts
2025-2026Reasoning models, agentic systemsLong-horizon planning, tool use, autonomy

How LLMs Work

The Transformer Architecture

The Transformer replaced the sequential processing of RNNs with self-attention, enabling parallel computation across entire sequences. Here's the core idea:

import torch
import torch.nn as nn
import math

class MultiHeadSelfAttention(nn.Module):
    """Self-attention mechanism: each token attends to all others."""
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None) -> torch.Tensor:
        batch_size = x.size(0)
        
        # Linear projections
        Q = self.W_q(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attention = torch.softmax(scores, dim=-1)
        
        # Weighted sum of values
        output = torch.matmul(attention, V)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(output)

Key Components

ComponentPurposeDetails
TokenizerText → subword tokensByte-pair encoding (BPE), SentencePiece, or tiktoken; vocabulary sizes 32K–200K tokens
Embedding LayerTokens → dense vectorsLearnable embedding matrices; positional encodings add sequence order information
Attention MechanismContextual token mixingScaled dot-product attention; multi-head enables attending to different subspace features
Feed-Forward NetworkNon-linear transformationTypically 2-layer MLP with GeLU activation; ~4x hidden dimension expansion
Layer NormalizationTraining stabilityPre-LN or Post-LN placement; RMSNorm is common in modern models
Residual ConnectionsGradient flowSkip connections around each sub-layer
Output HeadToken predictionLinear layer projecting to vocabulary; softmax for probability distribution

Training Pipeline

  1. Pre-training: Model learns to predict the next token (autoregressive LM) on trillions of tokens. This takes weeks on thousands of GPUs and costs millions of dollars for frontier models.
  2. Supervised Fine-tuning (SFT): Model is further trained on high-quality instruction-response pairs to follow human directions.
  3. Alignment (RLHF/DPO): Reinforcement Learning from Human Feedback or Direct Preference Optimization aligns outputs with human values — helpfulness, honesty, harmlessness.
  4. Safety Filtering: Additional guardrails prevent harmful outputs, jailbreaks, and policy violations.

Types of LLMs

By Architecture

TypeExamplesStrengthsWeaknesses
Decoder-onlyGPT-4, Llama 3, Claude, MistralAutoregressive generation, versatile, dominant paradigmCan hallucinate; unidirectional attention
Encoder-onlyBERT, RoBERTa, DeBERTaBidirectional understanding, excellent for classificationCannot generate text natively
Encoder-DecoderT5, BART, Flan-T5Sequence-to-sequence tasks, translationLess common now; outperformed by decoder-only at scale
Mixture of Experts (MoE)Mixtral 8x7B, Grok, Qwen MoESparse activation; high capacity with lower computeComplex routing; harder to serve

By Access Model

TypeExamplesCharacteristics
Closed / API-onlyGPT-4o, Claude Sonnet, Gemini ProState-of-the-art performance; pay-per-use; limited customization
Open-weightLlama 3.1 405B, Mistral Large, Qwen 2.5Downloadable weights; self-hostable; community fine-tunes
Open-weight + permissive licenseMistral, Gemma, OLMoCommercial-friendly licenses; research-friendly
Fully openOLMo, PythiaWeights + training data + code; maximum transparency

By Size Class

ClassParametersHardwareUse Cases
Tiny< 1BCPU, edge devicesOn-device inference, IoT, mobile
Small1–7BSingle consumer GPUPersonal assistants, lightweight apps
Medium7–70B1–4 GPUsEnterprise chatbots, code assistants, RAG
Large70–200BGPU cluster or high-end single GPUResearch, high-quality generation
Frontier200B+Massive clustersState-of-the-art benchmarks, general intelligence tasks

Getting Started

Prerequisites

  • Python 3.10+ installed
  • 8 GB RAM minimum (16 GB recommended)
  • For local models: a GPU with 8+ GB VRAM (NVIDIA recommended for CUDA support)

Option 1: Quick Start with Hugging Face Transformers

# Install core libraries
pip install transformers torch accelerate

# Quick sentiment analysis
python -c "
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('I love working with LLMs!')
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]
"

Option 2: Text Generation with Open-Source Models

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "meta-llama/Llama-3.2-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"  # Automatically use GPU if available
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum computing in 3 sentences."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Option 3: Using API Providers

# OpenAI
pip install openai

# Anthropic
pip install anthropic

# Google
pip install google-generativeai
# OpenAI example
from openai import OpenAI

client = OpenAI(api_key="YOUR_API_KEY")
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is machine learning?"}],
    temperature=0.7,
    max_tokens=500
)
print(response.choices[0].message.content)

Option 4: Local Inference with Ollama (No GPU Required)

# Install Ollama: https://ollama.com
ollama run llama3.2:3b "Explain the difference between supervised and unsupervised learning"

# Try different models
ollama run mistral "Write a Python function to sort a list"
ollama run qwen2.5:7b "Summarize the key points of attention mechanisms"
ProviderFlagship ModelsAccessNotable Features
OpenAIGPT-4o, GPT-4o mini, o1, o3API, ChatGPTStrong reasoning, tool use, multimodal
AnthropicClaude 3.5/4 Sonnet, OpusAPI, Claude.aiSafety-focused, long context, computer use
GoogleGemini 2.0/2.5 Pro, FlashAPI, AI StudioNative multimodal, long context window
MetaLlama 3.1/3.3, Llama 4Open weightsStrong open-source ecosystem
Mistral AIMistral Large, MixtralAPI + open weightsEfficient MoE architecture
AlibabaQwen 2.5, QwQAPI + open weightsStrong multilingual and code abilities
DeepSeekDeepSeek V3, R1API + open weightsCompetitive reasoning, open weights

Essential Concepts to Learn Next

Understanding LLMs goes beyond just running them. These topics will deepen your expertise:

Quick Reference: Key Terms

TermDefinition
TokenA subword unit; roughly 0.75 words in English
Context WindowMaximum input + output tokens the model can process
TemperatureControls randomness (0 = deterministic, 1 = creative)
Top-pNucleus sampling; filters low-probability tokens
HallucinationConfident but incorrect or fabricated output
Fine-tuningFurther training on domain-specific data
RLHFReinforcement Learning from Human Feedback
RAGRetrieval-Augmented Generation — grounding with external data

Related docs