Advanced Technical

Energy & Environmental Impact of LLMs

The environmental cost of LLMs — training energy, inference energy, carbon footprint, water usage, and sustainable AI practices

Published: 2026-04-25 · Last updated: 2026-04-25

Energy & Environmental Impact of LLMs

The environmental cost of Large Language Models is substantial and growing. Training a single frontier model can consume as much energy as hundreds of homes use in a year, and the cumulative inference cost of serving billions of requests daily is orders of magnitude larger. This guide provides a comprehensive analysis of the environmental impact of LLMs and practical strategies for reducing it.

The Scale of the Problem

Energy Consumption at a Glance

ActivityEstimated EnergyEquivalentCO2 Emissions
Training GPT-4-class model~50 GWh5,000 US homes for 1 year~20,000 tonnes CO2e
Training Llama 3.1 405B~30 GWh3,000 US homes for 1 year~12,000 tonnes CO2e
Single GPT-4 response (avg)~0.003 kWhCharging a phone once~1.5 g CO2e
1M GPT-4 responses~3,000 kWh100 US homes for 1 day~1,500 kg CO2e
1B GPT-4 responses~3 GWhSmall town for 1 day~1,500 tonnes CO2e
Daily global LLM inference (est.)~500 GWhDenmark's daily electricity~200,000 tonnes CO2e
Annual global LLM inference (est.)~180 TWhArgentina's annual electricity~72 million tonnes CO2e

Estimates based on published research and provider disclosures. Actual figures vary and are often not publicly disclosed.

The Inference Problem

While training gets the most attention, inference dominates total environmental impact for widely-used models:

Total lifecycle energy of a deployed LLM:

Training:        ████████████████████████████████████████  30 GWh  (5%)
Inference (1yr): ████████████████████████████████████████████████████████████████████████████████████  570 GWh  (95%)

Key insight: A model used by millions of people daily has an inference
footprint that dwarfs its training cost within weeks.

Training Energy Breakdown

Where Does Training Energy Go?

def estimate_training_energy(
    model_parameters: int,        # e.g., 405_000_000_000
    training_tokens: int,          # e.g., 15_000_000_000_000
    gpu_type: str = "H100",
    gpu_count: int = 24000,
    training_days: int = 54,
    pue: float = 1.15,            # Power Usage Effectiveness of datacenter
) -> dict:
    """Estimate the energy consumption of training an LLM."""

    # GPU power consumption
    gpu_tdp = {
        "H100": 700,    # Watts (SXM5)
        "A100": 400,    # Watts (SXM4)
        "A100_80G": 500,
        "L40S": 350,
        "B200": 1000,
    }

    gpu_watts = gpu_tdp.get(gpu_type, 700) * gpu_count

    # Total compute energy (GPU + overhead)
    training_hours = training_days * 24
    compute_kwh = (gpu_watts / 1000) * training_hours * pue
    compute_mwh = compute_kwh / 1000
    compute_gwh = compute_mwh / 1000

    # Additional energy costs
    data_prep_energy = compute_gwh * 0.02      # Data processing and preparation
    experiment_energy = compute_gwh * 0.10      # Failed runs, hyperparameter search
    evaluation_energy = compute_gwh * 0.03      # Evaluation and benchmarking

    total_gwh = compute_gwh + data_prep_energy + experiment_energy + evaluation_energy

    # Carbon footprint (varies by grid)
    carbon_intensity = {
        "us_average": 0.386,    # kg CO2e per kWh
        "eu_average": 0.230,
        "norway": 0.023,
        "france": 0.052,
        "china": 0.555,
        "india": 0.708,
    }

    carbon_footprint = {
        region: round(total_gwh * 1_000_000 * intensity / 1000)
        for region, intensity in carbon_intensity.items()
    }

    return {
        "compute_energy_gwh": round(compute_gwh, 1),
        "data_prep_energy_gwh": round(data_prep_energy, 1),
        "experiment_energy_gwh": round(experiment_energy, 1),
        "evaluation_energy_gwh": round(evaluation_energy, 1),
        "total_energy_gwh": round(total_gwh, 1),
        "carbon_footprint_tonnes": carbon_footprint,
        "flops_estimate": round(model_parameters * training_tokens * 6 / 1e24, 2),  # ZettaFLOPs
        "efficiency_flops_per_watt_hour": round(model_parameters * training_tokens * 6 / (compute_kwh * 3600), 1),
    }

# Estimate for a 405B model
energy = estimate_training_energy(
    model_parameters=405_000_000_000,
    training_tokens=15_000_000_000_000,
    gpu_type="H100",
    gpu_count=24000,
    training_days=54,
)

print(f"Total training energy: {energy['total_energy_gwh']} GWh")
print(f"Carbon footprint (US grid): {energy['carbon_footprint_tonnes']['us_average']} tonnes CO2e")
print(f"Carbon footprint (Norway): {energy['carbon_footprint_tonnes']['norway']} tonnes CO2e")
# Total training energy: ~32.5 GWh
# Carbon footprint (US grid): ~12,500 tonnes CO2e
# Carbon footprint (Norway): ~750 tonnes CO2e

Training Energy by Model Size

Model SizeApprox. GPU HoursEnergy (GWh)CO2e (US grid, tonnes)CO2e (EU grid, tonnes)
7B~30,000~0.02~8~5
13B~100,000~0.07~27~16
70B~1,000,000~0.7~270~161
405B~30,000,000~20-30~8,000-12,000~4,600-7,000
1T+ (est.)~100,000,000+~70-100~27,000-39,000~16,000-23,000

Inference Energy

Per-Request Energy Cost

def estimate_inference_energy_per_request(
    model_size_params: int,
    input_tokens: int,
    output_tokens: int,
    gpu_type: str = "H100",
    gpu_utilization: float = 0.6,
    requests_per_gpu_per_second: float = 10,
    pue: float = 1.15,
) -> dict:
    """Estimate energy per inference request."""

    gpu_tdp_watts = {"H100": 700, "A100": 400, "L40S": 350, "B200": 1000}
    gpu_watts = gpu_tdp_watts.get(gpu_type, 700)

    # Energy per second per GPU (including overhead)
    energy_per_second = gpu_watts * pue  # Joules/second = Watts

    # Requests processed per second
    requests_per_second = requests_per_gpu_per_second

    # Energy per request
    energy_per_request_joules = energy_per_second / requests_per_second
    energy_per_request_kwh = energy_per_request_joules / 3_600_000

    total_tokens = input_tokens + output_tokens

    return {
        "energy_per_request_wh": round(energy_per_request_joules / 3600, 4),
        "energy_per_request_kwh": round(energy_per_request_kwh, 6),
        "energy_per_token_wh": round(energy_per_request_joules / (3600 * total_tokens), 6),
        "co2_per_request_g": round(energy_per_request_kwh * 386, 4),  # US grid
        "equivalent": {
            "phone_charging_seconds": round(energy_per_request_joules / 5 * 60, 1),  # 5W phone charger
            "led_bulb_seconds": round(energy_per_request_joules / 10 * 60, 1),  # 10W LED bulb
        }
    }

# Compare different models
scenarios = [
    ("Llama 70B (H100, batched)", 70_000_000_000, 500, 200, "H100", 0.6, 25),
    ("Llama 405B (H100, batched)", 405_000_000_000, 500, 200, "H100", 0.6, 5),
    ("GPT-4.1 Mini (shared infra)", None, 500, 200, "H100", 0.6, 100),
    ("GPT-4.1 (shared infra)", None, 500, 200, "H100", 0.6, 15),
]

print(f"{'Scenario':<40} {'Energy/Req (Wh)':>16} {'CO2/Req (g)':>14}")
print("-" * 72)
for name, params, inp, out, gpu, util, rps in scenarios:
    result = estimate_inference_energy_per_request(params or 100_000_000_000, inp, out, gpu, util, rps)
    print(f"{name:<40} {result['energy_per_request_wh']:>16.4f} {result['co2_per_request_g']:>14.4f}")

Annual Inference Energy at Scale

Daily RequestsSmall Model (7B)Medium (70B)Large (405B)GPT-4-class
10,000~10 kWh/day~50 kWh/day~200 kWh/day~300 kWh/day
100,000~100 kWh/day~500 kWh/day~2,000 kWh/day~3,000 kWh/day
1,000,000~1 MWh/day~5 MWh/day~20 MWh/day~30 MWh/day
10,000,000~10 MWh/day~50 MWh/day~200 MWh/day~300 MWh/day
100,000,000~100 MWh/day~500 MWh/day~2 GWh/day~3 GWh/day

Water Usage

Datacenters consume vast amounts of water for cooling. This is often overlooked in environmental discussions.

Water Consumption Estimates

ActivityWater UsageNotes
Training a 405B model~300,000-500,000 litersEquivalent to 200-350 car washes
Training a 70B model~10,000-20,000 liters~10-15 car washes
1M inference requests (70B)~500-1,000 litersDepends on datacenter cooling efficiency
1M inference requests (405B)~2,000-5,000 litersLarger models = more heat = more water
Annual inference (100M req/day)~180-500 million litersSmall city's annual water usage

Water Usage Calculator

def estimate_water_usage(
    total_compute_kwh: float,
    datacenter_wue: float = 1.8,  # Water Usage Effectiveness (liters per kWh)
) -> dict:
    """Estimate water consumption for compute."""
    water_liters = total_compute_kwh * datacenter_wue

    # Contextualize
    shower_liters = 60        # Average 8-minute shower
    car_wash_liters = 150     # Average commercial car wash
    pool_liters = 50000       # Average residential pool

    return {
        "water_liters": round(water_liters, 0),
        "water_gallons": round(water_liters * 0.264, 0),
        "equivalent_showers": round(water_liters / shower_liters),
        "equivalent_car_washes": round(water_liters / car_wash_liters),
        "equivalent_pools": round(water_liters / pool_liters, 1),
    }

# Training a 405B model
water = estimate_water_usage(30_000_000)  # 30 GWh in kWh
print(f"Water usage: {water['water_liters']:,.0f} liters ({water['water_gallons']:,.0f} gallons)")
print(f"Equivalent to {water['equivalent_showers']:,} showers")
print(f"Equivalent to {water['equivalent_car_washes']:,} car washes")

Sustainable AI Practices

1. Choose Efficient Models

def compare_sustainability_options(
    task_requirements: dict,
) -> dict:
    """Compare environmental impact of different model choices."""

    models = {
        "GPT-4.1": {"params": "unknown", "energy_per_req_kwh": 0.05, "capability_score": 95},
        "GPT-4.1 Mini": {"params": "unknown", "energy_per_req_kwh": 0.008, "capability_score": 88},
        "GPT-4.1 Nano": {"params": "unknown", "energy_per_req_kwh": 0.003, "capability_score": 80},
        "Claude Sonnet 4": {"params": "unknown", "energy_per_req_kwh": 0.04, "capability_score": 93},
        "Claude Haiku 3.5": {"params": "unknown", "energy_per_req_kwh": 0.01, "capability_score": 82},
        "Llama 3.1 70B": {"params": "70B", "energy_per_req_kwh": 0.015, "capability_score": 82},
        "Llama 3.1 405B": {"params": "405B", "energy_per_req_kwh": 0.08, "capability_score": 90},
        "Distilled 7B": {"params": "7B", "energy_per_req_kwh": 0.002, "capability_score": 65},
    }

    min_capability = task_requirements.get("min_capability_score", 70)

    results = []
    for name, info in models.items():
        if info["capability_score"] >= min_capability:
            annual_kwh = info["energy_per_req_kwh"] * task_requirements["daily_requests"] * 365
            annual_co2_kg = annual_kwh * 0.386  # US grid
            results.append({
                "model": name,
                "capability": info["capability_score"],
                "energy_per_request_wh": info["energy_per_req_kwh"] * 1000,
                "annual_energy_mwh": round(annual_kwh / 1000, 1),
                "annual_co2_tonnes": round(annual_co2_kg / 1000, 1),
            })

    return sorted(results, key=lambda x: x["annual_co2_tonnes"])

# Example: 100K requests/day, minimum capability 80
results = compare_sustainability_options({
    "daily_requests": 100_000,
    "min_capability_score": 80,
})

print(f"{'Model':<25} {'Capability':>12} {'CO2/yr (tonnes)':>18}")
print("-" * 57)
for r in results:
    print(f"{r['model']:<25} {r['capability']:>12} {r['annual_co2_tonnes']:>18.1f}")

2. Optimize Inference

inference_optimization_strategies:
  quantization:
    description: "Reduce precision from FP16 to INT8 or INT4"
    energy_savings: "30-60%"
    quality_impact: "Minimal for INT8, noticeable for INT4"
    tools: ["bitsandbytes", "AWQ", "GGUF", "TensorRT-LLM"]

  speculative_decoding:
    description: "Use small model to draft, large model to verify"
    energy_savings: "30-50% (fewer forward passes in large model)"
    quality_impact: "None (same output)"
    tools: ["vLLM speculative decoding", "Medusa", "EAGLE"]

  caching:
    description: "Cache responses for repeated/similar queries"
    energy_savings: "20-60% (depends on query repeat rate)"
    quality_impact: "None for exact matches, requires similarity threshold for fuzzy"
    tools: ["semantic cache", "Redis embedding cache"]

  batch_sizing:
    description: "Optimize batch size for GPU utilization"
    energy_savings: "10-30% better utilization"
    quality_impact: "None"
    tools: ["vLLM continuous batching", "TGI dynamic batching"]

  early_exit:
    description: "Exit transformer layers early for easy queries"
    energy_savings: "20-40% for simple queries"
    quality_impact: "Quality depends on exit threshold tuning"
    tools: ["Custom layer-exit logic", "PAB (Patience-Based)"]

  model_cascading:
    description: "Route easy queries to small models, hard to large"
    energy_savings: "40-70% (if most queries are easy)"
    quality_impact: "Maintains quality if routing is accurate"
    tools: ["LLM router", "confidence-based routing"]

3. Carbon-Aware Scheduling

import requests
from datetime import datetime, timedelta

def get_carbon_intensity(region: str = "US-CAISO") -> dict:
    """Get current and forecasted grid carbon intensity."""
    # Using Electricity Maps API or similar
    response = requests.get(
        f"https://api.electricitymap.org/v3/carbon-intensity/latest",
        params={"zone": region},
        headers={"auth-token": "YOUR_TOKEN"},
    )
    data = response.json()
    return {
        "carbon_intensity_gco2_per_kwh": data["carbonIntensity"],
        "fossil_fuel_percentage": data.get("fossilFuelPercentage"),
        "timestamp": data["datetime"],
    }

def schedule_training_optimal(
    training_hours_needed: int,
    region: str = "US-CAISO",
    forecast_hours: int = 72,
) -> dict:
    """Find the most carbon-efficient window for training."""
    # Get forecast
    forecast = requests.get(
        f"https://api.electricitymap.org/v3/carbon-intensity/forecast",
        params={"zone": region},
        headers={"auth-token": "YOUR_TOKEN"},
    )
    data = forecast.json()

    # Find window with lowest average carbon intensity
    best_start = None
    best_avg = float("inf")

    for i in range(len(data["forecast"]) - training_hours_needed):
        window = data["forecast"][i:i + training_hours_needed]
        avg_intensity = sum(w["carbonIntensity"] for w in window) / len(window)

        if avg_intensity < best_avg:
            best_avg = avg_intensity
            best_start = i

    start_time = datetime.fromisoformat(data["forecast"][best_start]["datetime"])
    end_time = start_time + timedelta(hours=training_hours_needed)

    return {
        "optimal_start": start_time.isoformat(),
        "optimal_end": end_time.isoformat(),
        "avg_carbon_intensity": round(best_avg, 1),
        "carbon_savings_vs_worst_pct": round(
            (1 - best_avg / max(sum(w["carbonIntensity"] for w in data["forecast"][j:j+training_hours_needed]) / training_hours_needed
                                for j in range(len(data["forecast"]) - training_hours_needed))) * 100,
            1
        ),
    }

4. Green Infrastructure Choices

ProviderRenewable EnergyCarbon NeutralWater-Efficient CoolingNotes
Google Cloud100% renewable (matched)YesYes (some regions)24/7 carbon-free energy by 2030 goal
AWSGrowing renewablesPartiallyImprovingClimate Pledge: net-zero by 2040
Azure100% renewable (matched)Yes (by 2030 target)Yes (some regions)Carbon negative by 2030
CoreWeavePartialNoLimitedGPU-specialized, growing green commitment
Lambda LabsLimitedNoLimitedGPU cloud, less transparency
On-prem (renewable-powered)Depends on your energy sourceDependsDependsMaximum control, requires investment

5. Report and Track Your Impact

# sustainability-report.yaml
reporting_framework:
  metrics_to_track:
    training:
      - "Total GPU hours consumed"
      - "Total energy (kWh/GWh)"
      - "Carbon footprint (tonnes CO2e) by grid region"
      - "Water consumption (liters)"
      - "Number of failed experiments (avoidable waste)"
    inference:
      - "Energy per request (kWh)"
      - "Energy per token (kWh)"
      - "Monthly/annual total energy"
      - "Carbon footprint per 1K requests"
      - "Cache hit rate (energy saved)"
    optimization:
      - "Quantization level and energy saved"
      - "Speculative decoding speedup"
      - "Model cascading energy savings"
      - "Percentage of requests served by efficient models"

  reporting_cadence:
    monthly: "Internal dashboard review"
    quarterly: "Leadership summary"
    annually: "Public sustainability report"

  targets:
    - "Reduce energy per request by 50% year-over-year"
    - "Route 80% of queries through efficient model cascade"
    - "Achieve 95%+ carbon-free energy for training by 2027"
    - "Publish annual environmental impact assessment"

Cross-References

Sustainability Checklist

  • Measure and track energy consumption per inference request
  • Choose the smallest model that meets your quality requirements
  • Implement response caching for repeated queries
  • Use quantization (INT8) to reduce inference energy 30-50%
  • Deploy speculative decoding where possible
  • Implement model cascading (small model first, escalate only when needed)
  • Schedule training during low-carbon-intensity windows
  • Prefer cloud providers with strong renewable energy commitments
  • Track water consumption alongside energy
  • Publish an annual sustainability report for your AI systems
  • Set year-over-year reduction targets for energy per request
  • Consider the full lifecycle (training + inference) when comparing models