Advanced Technical
Energy & Environmental Impact of LLMs
The environmental cost of LLMs — training energy, inference energy, carbon footprint, water usage, and sustainable AI practices
Published: 2026-04-25 · Last updated: 2026-04-25
Energy & Environmental Impact of LLMs
The environmental cost of Large Language Models is substantial and growing. Training a single frontier model can consume as much energy as hundreds of homes use in a year, and the cumulative inference cost of serving billions of requests daily is orders of magnitude larger. This guide provides a comprehensive analysis of the environmental impact of LLMs and practical strategies for reducing it.
The Scale of the Problem
Energy Consumption at a Glance
| Activity | Estimated Energy | Equivalent | CO2 Emissions |
|---|---|---|---|
| Training GPT-4-class model | ~50 GWh | 5,000 US homes for 1 year | ~20,000 tonnes CO2e |
| Training Llama 3.1 405B | ~30 GWh | 3,000 US homes for 1 year | ~12,000 tonnes CO2e |
| Single GPT-4 response (avg) | ~0.003 kWh | Charging a phone once | ~1.5 g CO2e |
| 1M GPT-4 responses | ~3,000 kWh | 100 US homes for 1 day | ~1,500 kg CO2e |
| 1B GPT-4 responses | ~3 GWh | Small town for 1 day | ~1,500 tonnes CO2e |
| Daily global LLM inference (est.) | ~500 GWh | Denmark's daily electricity | ~200,000 tonnes CO2e |
| Annual global LLM inference (est.) | ~180 TWh | Argentina's annual electricity | ~72 million tonnes CO2e |
Estimates based on published research and provider disclosures. Actual figures vary and are often not publicly disclosed.
The Inference Problem
While training gets the most attention, inference dominates total environmental impact for widely-used models:
Total lifecycle energy of a deployed LLM:
Training: ████████████████████████████████████████ 30 GWh (5%)
Inference (1yr): ████████████████████████████████████████████████████████████████████████████████████ 570 GWh (95%)
Key insight: A model used by millions of people daily has an inference
footprint that dwarfs its training cost within weeks.
Training Energy Breakdown
Where Does Training Energy Go?
def estimate_training_energy(
model_parameters: int, # e.g., 405_000_000_000
training_tokens: int, # e.g., 15_000_000_000_000
gpu_type: str = "H100",
gpu_count: int = 24000,
training_days: int = 54,
pue: float = 1.15, # Power Usage Effectiveness of datacenter
) -> dict:
"""Estimate the energy consumption of training an LLM."""
# GPU power consumption
gpu_tdp = {
"H100": 700, # Watts (SXM5)
"A100": 400, # Watts (SXM4)
"A100_80G": 500,
"L40S": 350,
"B200": 1000,
}
gpu_watts = gpu_tdp.get(gpu_type, 700) * gpu_count
# Total compute energy (GPU + overhead)
training_hours = training_days * 24
compute_kwh = (gpu_watts / 1000) * training_hours * pue
compute_mwh = compute_kwh / 1000
compute_gwh = compute_mwh / 1000
# Additional energy costs
data_prep_energy = compute_gwh * 0.02 # Data processing and preparation
experiment_energy = compute_gwh * 0.10 # Failed runs, hyperparameter search
evaluation_energy = compute_gwh * 0.03 # Evaluation and benchmarking
total_gwh = compute_gwh + data_prep_energy + experiment_energy + evaluation_energy
# Carbon footprint (varies by grid)
carbon_intensity = {
"us_average": 0.386, # kg CO2e per kWh
"eu_average": 0.230,
"norway": 0.023,
"france": 0.052,
"china": 0.555,
"india": 0.708,
}
carbon_footprint = {
region: round(total_gwh * 1_000_000 * intensity / 1000)
for region, intensity in carbon_intensity.items()
}
return {
"compute_energy_gwh": round(compute_gwh, 1),
"data_prep_energy_gwh": round(data_prep_energy, 1),
"experiment_energy_gwh": round(experiment_energy, 1),
"evaluation_energy_gwh": round(evaluation_energy, 1),
"total_energy_gwh": round(total_gwh, 1),
"carbon_footprint_tonnes": carbon_footprint,
"flops_estimate": round(model_parameters * training_tokens * 6 / 1e24, 2), # ZettaFLOPs
"efficiency_flops_per_watt_hour": round(model_parameters * training_tokens * 6 / (compute_kwh * 3600), 1),
}
# Estimate for a 405B model
energy = estimate_training_energy(
model_parameters=405_000_000_000,
training_tokens=15_000_000_000_000,
gpu_type="H100",
gpu_count=24000,
training_days=54,
)
print(f"Total training energy: {energy['total_energy_gwh']} GWh")
print(f"Carbon footprint (US grid): {energy['carbon_footprint_tonnes']['us_average']} tonnes CO2e")
print(f"Carbon footprint (Norway): {energy['carbon_footprint_tonnes']['norway']} tonnes CO2e")
# Total training energy: ~32.5 GWh
# Carbon footprint (US grid): ~12,500 tonnes CO2e
# Carbon footprint (Norway): ~750 tonnes CO2e
Training Energy by Model Size
| Model Size | Approx. GPU Hours | Energy (GWh) | CO2e (US grid, tonnes) | CO2e (EU grid, tonnes) |
|---|---|---|---|---|
| 7B | ~30,000 | ~0.02 | ~8 | ~5 |
| 13B | ~100,000 | ~0.07 | ~27 | ~16 |
| 70B | ~1,000,000 | ~0.7 | ~270 | ~161 |
| 405B | ~30,000,000 | ~20-30 | ~8,000-12,000 | ~4,600-7,000 |
| 1T+ (est.) | ~100,000,000+ | ~70-100 | ~27,000-39,000 | ~16,000-23,000 |
Inference Energy
Per-Request Energy Cost
def estimate_inference_energy_per_request(
model_size_params: int,
input_tokens: int,
output_tokens: int,
gpu_type: str = "H100",
gpu_utilization: float = 0.6,
requests_per_gpu_per_second: float = 10,
pue: float = 1.15,
) -> dict:
"""Estimate energy per inference request."""
gpu_tdp_watts = {"H100": 700, "A100": 400, "L40S": 350, "B200": 1000}
gpu_watts = gpu_tdp_watts.get(gpu_type, 700)
# Energy per second per GPU (including overhead)
energy_per_second = gpu_watts * pue # Joules/second = Watts
# Requests processed per second
requests_per_second = requests_per_gpu_per_second
# Energy per request
energy_per_request_joules = energy_per_second / requests_per_second
energy_per_request_kwh = energy_per_request_joules / 3_600_000
total_tokens = input_tokens + output_tokens
return {
"energy_per_request_wh": round(energy_per_request_joules / 3600, 4),
"energy_per_request_kwh": round(energy_per_request_kwh, 6),
"energy_per_token_wh": round(energy_per_request_joules / (3600 * total_tokens), 6),
"co2_per_request_g": round(energy_per_request_kwh * 386, 4), # US grid
"equivalent": {
"phone_charging_seconds": round(energy_per_request_joules / 5 * 60, 1), # 5W phone charger
"led_bulb_seconds": round(energy_per_request_joules / 10 * 60, 1), # 10W LED bulb
}
}
# Compare different models
scenarios = [
("Llama 70B (H100, batched)", 70_000_000_000, 500, 200, "H100", 0.6, 25),
("Llama 405B (H100, batched)", 405_000_000_000, 500, 200, "H100", 0.6, 5),
("GPT-4.1 Mini (shared infra)", None, 500, 200, "H100", 0.6, 100),
("GPT-4.1 (shared infra)", None, 500, 200, "H100", 0.6, 15),
]
print(f"{'Scenario':<40} {'Energy/Req (Wh)':>16} {'CO2/Req (g)':>14}")
print("-" * 72)
for name, params, inp, out, gpu, util, rps in scenarios:
result = estimate_inference_energy_per_request(params or 100_000_000_000, inp, out, gpu, util, rps)
print(f"{name:<40} {result['energy_per_request_wh']:>16.4f} {result['co2_per_request_g']:>14.4f}")
Annual Inference Energy at Scale
| Daily Requests | Small Model (7B) | Medium (70B) | Large (405B) | GPT-4-class |
|---|---|---|---|---|
| 10,000 | ~10 kWh/day | ~50 kWh/day | ~200 kWh/day | ~300 kWh/day |
| 100,000 | ~100 kWh/day | ~500 kWh/day | ~2,000 kWh/day | ~3,000 kWh/day |
| 1,000,000 | ~1 MWh/day | ~5 MWh/day | ~20 MWh/day | ~30 MWh/day |
| 10,000,000 | ~10 MWh/day | ~50 MWh/day | ~200 MWh/day | ~300 MWh/day |
| 100,000,000 | ~100 MWh/day | ~500 MWh/day | ~2 GWh/day | ~3 GWh/day |
Water Usage
Datacenters consume vast amounts of water for cooling. This is often overlooked in environmental discussions.
Water Consumption Estimates
| Activity | Water Usage | Notes |
|---|---|---|
| Training a 405B model | ~300,000-500,000 liters | Equivalent to 200-350 car washes |
| Training a 70B model | ~10,000-20,000 liters | ~10-15 car washes |
| 1M inference requests (70B) | ~500-1,000 liters | Depends on datacenter cooling efficiency |
| 1M inference requests (405B) | ~2,000-5,000 liters | Larger models = more heat = more water |
| Annual inference (100M req/day) | ~180-500 million liters | Small city's annual water usage |
Water Usage Calculator
def estimate_water_usage(
total_compute_kwh: float,
datacenter_wue: float = 1.8, # Water Usage Effectiveness (liters per kWh)
) -> dict:
"""Estimate water consumption for compute."""
water_liters = total_compute_kwh * datacenter_wue
# Contextualize
shower_liters = 60 # Average 8-minute shower
car_wash_liters = 150 # Average commercial car wash
pool_liters = 50000 # Average residential pool
return {
"water_liters": round(water_liters, 0),
"water_gallons": round(water_liters * 0.264, 0),
"equivalent_showers": round(water_liters / shower_liters),
"equivalent_car_washes": round(water_liters / car_wash_liters),
"equivalent_pools": round(water_liters / pool_liters, 1),
}
# Training a 405B model
water = estimate_water_usage(30_000_000) # 30 GWh in kWh
print(f"Water usage: {water['water_liters']:,.0f} liters ({water['water_gallons']:,.0f} gallons)")
print(f"Equivalent to {water['equivalent_showers']:,} showers")
print(f"Equivalent to {water['equivalent_car_washes']:,} car washes")
Sustainable AI Practices
1. Choose Efficient Models
def compare_sustainability_options(
task_requirements: dict,
) -> dict:
"""Compare environmental impact of different model choices."""
models = {
"GPT-4.1": {"params": "unknown", "energy_per_req_kwh": 0.05, "capability_score": 95},
"GPT-4.1 Mini": {"params": "unknown", "energy_per_req_kwh": 0.008, "capability_score": 88},
"GPT-4.1 Nano": {"params": "unknown", "energy_per_req_kwh": 0.003, "capability_score": 80},
"Claude Sonnet 4": {"params": "unknown", "energy_per_req_kwh": 0.04, "capability_score": 93},
"Claude Haiku 3.5": {"params": "unknown", "energy_per_req_kwh": 0.01, "capability_score": 82},
"Llama 3.1 70B": {"params": "70B", "energy_per_req_kwh": 0.015, "capability_score": 82},
"Llama 3.1 405B": {"params": "405B", "energy_per_req_kwh": 0.08, "capability_score": 90},
"Distilled 7B": {"params": "7B", "energy_per_req_kwh": 0.002, "capability_score": 65},
}
min_capability = task_requirements.get("min_capability_score", 70)
results = []
for name, info in models.items():
if info["capability_score"] >= min_capability:
annual_kwh = info["energy_per_req_kwh"] * task_requirements["daily_requests"] * 365
annual_co2_kg = annual_kwh * 0.386 # US grid
results.append({
"model": name,
"capability": info["capability_score"],
"energy_per_request_wh": info["energy_per_req_kwh"] * 1000,
"annual_energy_mwh": round(annual_kwh / 1000, 1),
"annual_co2_tonnes": round(annual_co2_kg / 1000, 1),
})
return sorted(results, key=lambda x: x["annual_co2_tonnes"])
# Example: 100K requests/day, minimum capability 80
results = compare_sustainability_options({
"daily_requests": 100_000,
"min_capability_score": 80,
})
print(f"{'Model':<25} {'Capability':>12} {'CO2/yr (tonnes)':>18}")
print("-" * 57)
for r in results:
print(f"{r['model']:<25} {r['capability']:>12} {r['annual_co2_tonnes']:>18.1f}")
2. Optimize Inference
inference_optimization_strategies:
quantization:
description: "Reduce precision from FP16 to INT8 or INT4"
energy_savings: "30-60%"
quality_impact: "Minimal for INT8, noticeable for INT4"
tools: ["bitsandbytes", "AWQ", "GGUF", "TensorRT-LLM"]
speculative_decoding:
description: "Use small model to draft, large model to verify"
energy_savings: "30-50% (fewer forward passes in large model)"
quality_impact: "None (same output)"
tools: ["vLLM speculative decoding", "Medusa", "EAGLE"]
caching:
description: "Cache responses for repeated/similar queries"
energy_savings: "20-60% (depends on query repeat rate)"
quality_impact: "None for exact matches, requires similarity threshold for fuzzy"
tools: ["semantic cache", "Redis embedding cache"]
batch_sizing:
description: "Optimize batch size for GPU utilization"
energy_savings: "10-30% better utilization"
quality_impact: "None"
tools: ["vLLM continuous batching", "TGI dynamic batching"]
early_exit:
description: "Exit transformer layers early for easy queries"
energy_savings: "20-40% for simple queries"
quality_impact: "Quality depends on exit threshold tuning"
tools: ["Custom layer-exit logic", "PAB (Patience-Based)"]
model_cascading:
description: "Route easy queries to small models, hard to large"
energy_savings: "40-70% (if most queries are easy)"
quality_impact: "Maintains quality if routing is accurate"
tools: ["LLM router", "confidence-based routing"]
3. Carbon-Aware Scheduling
import requests
from datetime import datetime, timedelta
def get_carbon_intensity(region: str = "US-CAISO") -> dict:
"""Get current and forecasted grid carbon intensity."""
# Using Electricity Maps API or similar
response = requests.get(
f"https://api.electricitymap.org/v3/carbon-intensity/latest",
params={"zone": region},
headers={"auth-token": "YOUR_TOKEN"},
)
data = response.json()
return {
"carbon_intensity_gco2_per_kwh": data["carbonIntensity"],
"fossil_fuel_percentage": data.get("fossilFuelPercentage"),
"timestamp": data["datetime"],
}
def schedule_training_optimal(
training_hours_needed: int,
region: str = "US-CAISO",
forecast_hours: int = 72,
) -> dict:
"""Find the most carbon-efficient window for training."""
# Get forecast
forecast = requests.get(
f"https://api.electricitymap.org/v3/carbon-intensity/forecast",
params={"zone": region},
headers={"auth-token": "YOUR_TOKEN"},
)
data = forecast.json()
# Find window with lowest average carbon intensity
best_start = None
best_avg = float("inf")
for i in range(len(data["forecast"]) - training_hours_needed):
window = data["forecast"][i:i + training_hours_needed]
avg_intensity = sum(w["carbonIntensity"] for w in window) / len(window)
if avg_intensity < best_avg:
best_avg = avg_intensity
best_start = i
start_time = datetime.fromisoformat(data["forecast"][best_start]["datetime"])
end_time = start_time + timedelta(hours=training_hours_needed)
return {
"optimal_start": start_time.isoformat(),
"optimal_end": end_time.isoformat(),
"avg_carbon_intensity": round(best_avg, 1),
"carbon_savings_vs_worst_pct": round(
(1 - best_avg / max(sum(w["carbonIntensity"] for w in data["forecast"][j:j+training_hours_needed]) / training_hours_needed
for j in range(len(data["forecast"]) - training_hours_needed))) * 100,
1
),
}
4. Green Infrastructure Choices
| Provider | Renewable Energy | Carbon Neutral | Water-Efficient Cooling | Notes |
|---|---|---|---|---|
| Google Cloud | 100% renewable (matched) | Yes | Yes (some regions) | 24/7 carbon-free energy by 2030 goal |
| AWS | Growing renewables | Partially | Improving | Climate Pledge: net-zero by 2040 |
| Azure | 100% renewable (matched) | Yes (by 2030 target) | Yes (some regions) | Carbon negative by 2030 |
| CoreWeave | Partial | No | Limited | GPU-specialized, growing green commitment |
| Lambda Labs | Limited | No | Limited | GPU cloud, less transparency |
| On-prem (renewable-powered) | Depends on your energy source | Depends | Depends | Maximum control, requires investment |
5. Report and Track Your Impact
# sustainability-report.yaml
reporting_framework:
metrics_to_track:
training:
- "Total GPU hours consumed"
- "Total energy (kWh/GWh)"
- "Carbon footprint (tonnes CO2e) by grid region"
- "Water consumption (liters)"
- "Number of failed experiments (avoidable waste)"
inference:
- "Energy per request (kWh)"
- "Energy per token (kWh)"
- "Monthly/annual total energy"
- "Carbon footprint per 1K requests"
- "Cache hit rate (energy saved)"
optimization:
- "Quantization level and energy saved"
- "Speculative decoding speedup"
- "Model cascading energy savings"
- "Percentage of requests served by efficient models"
reporting_cadence:
monthly: "Internal dashboard review"
quarterly: "Leadership summary"
annually: "Public sustainability report"
targets:
- "Reduce energy per request by 50% year-over-year"
- "Route 80% of queries through efficient model cascade"
- "Achieve 95%+ carbon-free energy for training by 2027"
- "Publish annual environmental impact assessment"
Cross-References
- Cost Management & Optimization — Energy savings often align with cost savings
- Inference Optimization & Quantization — Technical methods to reduce inference energy
- Speculative Decoding — Energy-efficient decoding technique
- Knowledge Distillation — Create smaller, more efficient models
- Model Comparison Guide — Factor efficiency into model selection
Sustainability Checklist
- Measure and track energy consumption per inference request
- Choose the smallest model that meets your quality requirements
- Implement response caching for repeated queries
- Use quantization (INT8) to reduce inference energy 30-50%
- Deploy speculative decoding where possible
- Implement model cascading (small model first, escalate only when needed)
- Schedule training during low-carbon-intensity windows
- Prefer cloud providers with strong renewable energy commitments
- Track water consumption alongside energy
- Publish an annual sustainability report for your AI systems
- Set year-over-year reduction targets for energy per request
- Consider the full lifecycle (training + inference) when comparing models