Advanced Technical

Energy & Environmental Impact of LLMs

The environmental cost of LLMs — training energy, inference energy, carbon footprint, water usage, and sustainable AI practices

Published: 2026-04-25 · Last updated: 2026-04-25

Energy & Environmental Impact of LLMs

The environmental cost of Large Language Models is substantial and growing. Training a single frontier model can consume as much energy as hundreds of homes use in a year, and the cumulative inference cost of serving billions of requests daily is orders of magnitude larger. This guide provides a comprehensive analysis of the environmental impact of LLMs and practical strategies for reducing it.

The Scale of the Problem

Energy Consumption at a Glance

Activity	Estimated Energy	Equivalent	CO2 Emissions
Training GPT-4-class model	~50 GWh	5,000 US homes for 1 year	~20,000 tonnes CO2e
Training Llama 3.1 405B	~30 GWh	3,000 US homes for 1 year	~12,000 tonnes CO2e
Single GPT-4 response (avg)	~0.003 kWh	Charging a phone once	~1.5 g CO2e
1M GPT-4 responses	~3,000 kWh	100 US homes for 1 day	~1,500 kg CO2e
1B GPT-4 responses	~3 GWh	Small town for 1 day	~1,500 tonnes CO2e
Daily global LLM inference (est.)	~500 GWh	Denmark's daily electricity	~200,000 tonnes CO2e
Annual global LLM inference (est.)	~180 TWh	Argentina's annual electricity	~72 million tonnes CO2e

Estimates based on published research and provider disclosures. Actual figures vary and are often not publicly disclosed.

The Inference Problem

While training gets the most attention, inference dominates total environmental impact for widely-used models:

Total lifecycle energy of a deployed LLM:

Training:        ████████████████████████████████████████  30 GWh  (5%)
Inference (1yr): ████████████████████████████████████████████████████████████████████████████████████  570 GWh  (95%)

Key insight: A model used by millions of people daily has an inference
footprint that dwarfs its training cost within weeks.

Training Energy Breakdown

Where Does Training Energy Go?

def estimate_training_energy(
    model_parameters: int,        # e.g., 405_000_000_000
    training_tokens: int,          # e.g., 15_000_000_000_000
    gpu_type: str = "H100",
    gpu_count: int = 24000,
    training_days: int = 54,
    pue: float = 1.15,            # Power Usage Effectiveness of datacenter
) -> dict:
    """Estimate the energy consumption of training an LLM."""

    # GPU power consumption
    gpu_tdp = {
        "H100": 700,    # Watts (SXM5)
        "A100": 400,    # Watts (SXM4)
        "A100_80G": 500,
        "L40S": 350,
        "B200": 1000,
    }

    gpu_watts = gpu_tdp.get(gpu_type, 700) * gpu_count

    # Total compute energy (GPU + overhead)
    training_hours = training_days * 24
    compute_kwh = (gpu_watts / 1000) * training_hours * pue
    compute_mwh = compute_kwh / 1000
    compute_gwh = compute_mwh / 1000

    # Additional energy costs
    data_prep_energy = compute_gwh * 0.02      # Data processing and preparation
    experiment_energy = compute_gwh * 0.10      # Failed runs, hyperparameter search
    evaluation_energy = compute_gwh * 0.03      # Evaluation and benchmarking

    total_gwh = compute_gwh + data_prep_energy + experiment_energy + evaluation_energy

    # Carbon footprint (varies by grid)
    carbon_intensity = {
        "us_average": 0.386,    # kg CO2e per kWh
        "eu_average": 0.230,
        "norway": 0.023,
        "france": 0.052,
        "china": 0.555,
        "india": 0.708,
    }

    carbon_footprint = {
        region: round(total_gwh * 1_000_000 * intensity / 1000)
        for region, intensity in carbon_intensity.items()
    }

    return {
        "compute_energy_gwh": round(compute_gwh, 1),
        "data_prep_energy_gwh": round(data_prep_energy, 1),
        "experiment_energy_gwh": round(experiment_energy, 1),
        "evaluation_energy_gwh": round(evaluation_energy, 1),
        "total_energy_gwh": round(total_gwh, 1),
        "carbon_footprint_tonnes": carbon_footprint,
        "flops_estimate": round(model_parameters * training_tokens * 6 / 1e24, 2),  # ZettaFLOPs
        "efficiency_flops_per_watt_hour": round(model_parameters * training_tokens * 6 / (compute_kwh * 3600), 1),
    }

# Estimate for a 405B model
energy = estimate_training_energy(
    model_parameters=405_000_000_000,
    training_tokens=15_000_000_000_000,
    gpu_type="H100",
    gpu_count=24000,
    training_days=54,
)

print(f"Total training energy: {energy['total_energy_gwh']} GWh")
print(f"Carbon footprint (US grid): {energy['carbon_footprint_tonnes']['us_average']} tonnes CO2e")
print(f"Carbon footprint (Norway): {energy['carbon_footprint_tonnes']['norway']} tonnes CO2e")
# Total training energy: ~32.5 GWh
# Carbon footprint (US grid): ~12,500 tonnes CO2e
# Carbon footprint (Norway): ~750 tonnes CO2e

Training Energy by Model Size

Model Size	Approx. GPU Hours	Energy (GWh)	CO2e (US grid, tonnes)	CO2e (EU grid, tonnes)
7B	~30,000	~0.02	~8	~5
13B	~100,000	~0.07	~27	~16
70B	~1,000,000	~0.7	~270	~161
405B	~30,000,000	~20-30	~8,000-12,000	~4,600-7,000
1T+ (est.)	~100,000,000+	~70-100	~27,000-39,000	~16,000-23,000

Inference Energy

Per-Request Energy Cost

def estimate_inference_energy_per_request(
    model_size_params: int,
    input_tokens: int,
    output_tokens: int,
    gpu_type: str = "H100",
    gpu_utilization: float = 0.6,
    requests_per_gpu_per_second: float = 10,
    pue: float = 1.15,
) -> dict:
    """Estimate energy per inference request."""

    gpu_tdp_watts = {"H100": 700, "A100": 400, "L40S": 350, "B200": 1000}
    gpu_watts = gpu_tdp_watts.get(gpu_type, 700)

    # Energy per second per GPU (including overhead)
    energy_per_second = gpu_watts * pue  # Joules/second = Watts

    # Requests processed per second
    requests_per_second = requests_per_gpu_per_second

    # Energy per request
    energy_per_request_joules = energy_per_second / requests_per_second
    energy_per_request_kwh = energy_per_request_joules / 3_600_000

    total_tokens = input_tokens + output_tokens

    return {
        "energy_per_request_wh": round(energy_per_request_joules / 3600, 4),
        "energy_per_request_kwh": round(energy_per_request_kwh, 6),
        "energy_per_token_wh": round(energy_per_request_joules / (3600 * total_tokens), 6),
        "co2_per_request_g": round(energy_per_request_kwh * 386, 4),  # US grid
        "equivalent": {
            "phone_charging_seconds": round(energy_per_request_joules / 5 * 60, 1),  # 5W phone charger
            "led_bulb_seconds": round(energy_per_request_joules / 10 * 60, 1),  # 10W LED bulb
        }
    }

# Compare different models
scenarios = [
    ("Llama 70B (H100, batched)", 70_000_000_000, 500, 200, "H100", 0.6, 25),
    ("Llama 405B (H100, batched)", 405_000_000_000, 500, 200, "H100", 0.6, 5),
    ("GPT-4.1 Mini (shared infra)", None, 500, 200, "H100", 0.6, 100),
    ("GPT-4.1 (shared infra)", None, 500, 200, "H100", 0.6, 15),
]

print(f"{'Scenario':<40} {'Energy/Req (Wh)':>16} {'CO2/Req (g)':>14}")
print("-" * 72)
for name, params, inp, out, gpu, util, rps in scenarios:
    result = estimate_inference_energy_per_request(params or 100_000_000_000, inp, out, gpu, util, rps)
    print(f"{name:<40} {result['energy_per_request_wh']:>16.4f} {result['co2_per_request_g']:>14.4f}")

Annual Inference Energy at Scale

Daily Requests	Small Model (7B)	Medium (70B)	Large (405B)	GPT-4-class
10,000	~10 kWh/day	~50 kWh/day	~200 kWh/day	~300 kWh/day
100,000	~100 kWh/day	~500 kWh/day	~2,000 kWh/day	~3,000 kWh/day
1,000,000	~1 MWh/day	~5 MWh/day	~20 MWh/day	~30 MWh/day
10,000,000	~10 MWh/day	~50 MWh/day	~200 MWh/day	~300 MWh/day
100,000,000	~100 MWh/day	~500 MWh/day	~2 GWh/day	~3 GWh/day

Water Usage

Datacenters consume vast amounts of water for cooling. This is often overlooked in environmental discussions.

Water Consumption Estimates

Activity	Water Usage	Notes
Training a 405B model	~300,000-500,000 liters	Equivalent to 200-350 car washes
Training a 70B model	~10,000-20,000 liters	~10-15 car washes
1M inference requests (70B)	~500-1,000 liters	Depends on datacenter cooling efficiency
1M inference requests (405B)	~2,000-5,000 liters	Larger models = more heat = more water
Annual inference (100M req/day)	~180-500 million liters	Small city's annual water usage

Water Usage Calculator

def estimate_water_usage(
    total_compute_kwh: float,
    datacenter_wue: float = 1.8,  # Water Usage Effectiveness (liters per kWh)
) -> dict:
    """Estimate water consumption for compute."""
    water_liters = total_compute_kwh * datacenter_wue

    # Contextualize
    shower_liters = 60        # Average 8-minute shower
    car_wash_liters = 150     # Average commercial car wash
    pool_liters = 50000       # Average residential pool

    return {
        "water_liters": round(water_liters, 0),
        "water_gallons": round(water_liters * 0.264, 0),
        "equivalent_showers": round(water_liters / shower_liters),
        "equivalent_car_washes": round(water_liters / car_wash_liters),
        "equivalent_pools": round(water_liters / pool_liters, 1),
    }

# Training a 405B model
water = estimate_water_usage(30_000_000)  # 30 GWh in kWh
print(f"Water usage: {water['water_liters']:,.0f} liters ({water['water_gallons']:,.0f} gallons)")
print(f"Equivalent to {water['equivalent_showers']:,} showers")
print(f"Equivalent to {water['equivalent_car_washes']:,} car washes")

Sustainable AI Practices

1. Choose Efficient Models

def compare_sustainability_options(
    task_requirements: dict,
) -> dict:
    """Compare environmental impact of different model choices."""

    models = {
        "GPT-4.1": {"params": "unknown", "energy_per_req_kwh": 0.05, "capability_score": 95},
        "GPT-4.1 Mini": {"params": "unknown", "energy_per_req_kwh": 0.008, "capability_score": 88},
        "GPT-4.1 Nano": {"params": "unknown", "energy_per_req_kwh": 0.003, "capability_score": 80},
        "Claude Sonnet 4": {"params": "unknown", "energy_per_req_kwh": 0.04, "capability_score": 93},
        "Claude Haiku 3.5": {"params": "unknown", "energy_per_req_kwh": 0.01, "capability_score": 82},
        "Llama 3.1 70B": {"params": "70B", "energy_per_req_kwh": 0.015, "capability_score": 82},
        "Llama 3.1 405B": {"params": "405B", "energy_per_req_kwh": 0.08, "capability_score": 90},
        "Distilled 7B": {"params": "7B", "energy_per_req_kwh": 0.002, "capability_score": 65},
    }

    min_capability = task_requirements.get("min_capability_score", 70)

    results = []
    for name, info in models.items():
        if info["capability_score"] >= min_capability:
            annual_kwh = info["energy_per_req_kwh"] * task_requirements["daily_requests"] * 365
            annual_co2_kg = annual_kwh * 0.386  # US grid
            results.append({
                "model": name,
                "capability": info["capability_score"],
                "energy_per_request_wh": info["energy_per_req_kwh"] * 1000,
                "annual_energy_mwh": round(annual_kwh / 1000, 1),
                "annual_co2_tonnes": round(annual_co2_kg / 1000, 1),
            })

    return sorted(results, key=lambda x: x["annual_co2_tonnes"])

# Example: 100K requests/day, minimum capability 80
results = compare_sustainability_options({
    "daily_requests": 100_000,
    "min_capability_score": 80,
})

print(f"{'Model':<25} {'Capability':>12} {'CO2/yr (tonnes)':>18}")
print("-" * 57)
for r in results:
    print(f"{r['model']:<25} {r['capability']:>12} {r['annual_co2_tonnes']:>18.1f}")

2. Optimize Inference

inference_optimization_strategies:
  quantization:
    description: "Reduce precision from FP16 to INT8 or INT4"
    energy_savings: "30-60%"
    quality_impact: "Minimal for INT8, noticeable for INT4"
    tools: ["bitsandbytes", "AWQ", "GGUF", "TensorRT-LLM"]

  speculative_decoding:
    description: "Use small model to draft, large model to verify"
    energy_savings: "30-50% (fewer forward passes in large model)"
    quality_impact: "None (same output)"
    tools: ["vLLM speculative decoding", "Medusa", "EAGLE"]

  caching:
    description: "Cache responses for repeated/similar queries"
    energy_savings: "20-60% (depends on query repeat rate)"
    quality_impact: "None for exact matches, requires similarity threshold for fuzzy"
    tools: ["semantic cache", "Redis embedding cache"]

  batch_sizing:
    description: "Optimize batch size for GPU utilization"
    energy_savings: "10-30% better utilization"
    quality_impact: "None"
    tools: ["vLLM continuous batching", "TGI dynamic batching"]

  early_exit:
    description: "Exit transformer layers early for easy queries"
    energy_savings: "20-40% for simple queries"
    quality_impact: "Quality depends on exit threshold tuning"
    tools: ["Custom layer-exit logic", "PAB (Patience-Based)"]

  model_cascading:
    description: "Route easy queries to small models, hard to large"
    energy_savings: "40-70% (if most queries are easy)"
    quality_impact: "Maintains quality if routing is accurate"
    tools: ["LLM router", "confidence-based routing"]

3. Carbon-Aware Scheduling

import requests
from datetime import datetime, timedelta

def get_carbon_intensity(region: str = "US-CAISO") -> dict:
    """Get current and forecasted grid carbon intensity."""
    # Using Electricity Maps API or similar
    response = requests.get(
        f"https://api.electricitymap.org/v3/carbon-intensity/latest",
        params={"zone": region},
        headers={"auth-token": "YOUR_TOKEN"},
    )
    data = response.json()
    return {
        "carbon_intensity_gco2_per_kwh": data["carbonIntensity"],
        "fossil_fuel_percentage": data.get("fossilFuelPercentage"),
        "timestamp": data["datetime"],
    }

def schedule_training_optimal(
    training_hours_needed: int,
    region: str = "US-CAISO",
    forecast_hours: int = 72,
) -> dict:
    """Find the most carbon-efficient window for training."""
    # Get forecast
    forecast = requests.get(
        f"https://api.electricitymap.org/v3/carbon-intensity/forecast",
        params={"zone": region},
        headers={"auth-token": "YOUR_TOKEN"},
    )
    data = forecast.json()

    # Find window with lowest average carbon intensity
    best_start = None
    best_avg = float("inf")

    for i in range(len(data["forecast"]) - training_hours_needed):
        window = data["forecast"][i:i + training_hours_needed]
        avg_intensity = sum(w["carbonIntensity"] for w in window) / len(window)

        if avg_intensity < best_avg:
            best_avg = avg_intensity
            best_start = i

    start_time = datetime.fromisoformat(data["forecast"][best_start]["datetime"])
    end_time = start_time + timedelta(hours=training_hours_needed)

    return {
        "optimal_start": start_time.isoformat(),
        "optimal_end": end_time.isoformat(),
        "avg_carbon_intensity": round(best_avg, 1),
        "carbon_savings_vs_worst_pct": round(
            (1 - best_avg / max(sum(w["carbonIntensity"] for w in data["forecast"][j:j+training_hours_needed]) / training_hours_needed
                                for j in range(len(data["forecast"]) - training_hours_needed))) * 100,
            1
        ),
    }

4. Green Infrastructure Choices

Provider	Renewable Energy	Carbon Neutral	Water-Efficient Cooling	Notes
Google Cloud	100% renewable (matched)	Yes	Yes (some regions)	24/7 carbon-free energy by 2030 goal
AWS	Growing renewables	Partially	Improving	Climate Pledge: net-zero by 2040
Azure	100% renewable (matched)	Yes (by 2030 target)	Yes (some regions)	Carbon negative by 2030
CoreWeave	Partial	No	Limited	GPU-specialized, growing green commitment
Lambda Labs	Limited	No	Limited	GPU cloud, less transparency
On-prem (renewable-powered)	Depends on your energy source	Depends	Depends	Maximum control, requires investment

5. Report and Track Your Impact

# sustainability-report.yaml
reporting_framework:
  metrics_to_track:
    training:
      - "Total GPU hours consumed"
      - "Total energy (kWh/GWh)"
      - "Carbon footprint (tonnes CO2e) by grid region"
      - "Water consumption (liters)"
      - "Number of failed experiments (avoidable waste)"
    inference:
      - "Energy per request (kWh)"
      - "Energy per token (kWh)"
      - "Monthly/annual total energy"
      - "Carbon footprint per 1K requests"
      - "Cache hit rate (energy saved)"
    optimization:
      - "Quantization level and energy saved"
      - "Speculative decoding speedup"
      - "Model cascading energy savings"
      - "Percentage of requests served by efficient models"

  reporting_cadence:
    monthly: "Internal dashboard review"
    quarterly: "Leadership summary"
    annually: "Public sustainability report"

  targets:
    - "Reduce energy per request by 50% year-over-year"
    - "Route 80% of queries through efficient model cascade"
    - "Achieve 95%+ carbon-free energy for training by 2027"
    - "Publish annual environmental impact assessment"

Cross-References

Cost Management & Optimization — Energy savings often align with cost savings
Inference Optimization & Quantization — Technical methods to reduce inference energy
Speculative Decoding — Energy-efficient decoding technique
Knowledge Distillation — Create smaller, more efficient models
Model Comparison Guide — Factor efficiency into model selection