Saltar al contenido principal

Escribe una PREreview

A Time-Resolved, SLO-Aware and Bi-Objective Framework to Measure and Minimize LLM Serving’s Carbon and Water Footprints

Publicada
Servidor
Preprints.org
DOI
10.20944/preprints202510.0957.v1

Studies of the environmental footprint of large language model (LLM) inference often disagree because they mix incompatible system boundaries, ignore latency and throughput service level objectives (SLOs), and optimize carbon without accounting for water. We present a provider-agnostic framework that unifies scope-transparent measurement with time-resolved, bi-objective orchestration under realistic SLOs. Measurement follows production practice and reports daily medians at a comprehensive serving boundary that includes active accelerators, host CPU/DRAM, provisioned idle, and facility overhead via PUE. Consumptive water is computed as site plus source. Carbon is location-based (LB) by default with a market-based (MB) sensitivity. Optimization is cast as a mixed‑integer linear program, solved over 288 five‑minute windows per day. For each prompt profile, the solver selects region, batch size, and phase‑aware hardware for prefill and decode while enforcing p95 Time To First Token/Time Per Output Token (TTFT/TPOT) and capacity constraints. Because grid carbon intensity (CIF) and electricity water intensity (EWIF) are only weakly correlated, the policy is dual‑objective by design and balances carbon and water explicitly. Applied to four representative models using public per‑prompt energy tables and per‑region multipliers, a single SLO‑aware policy reduces comprehensive‑boundary medians by 57-59% for energy, 59-60% for consumptive water, and 78-80% for LB CO_2, with SLOs met in every window. For a day with 500M queries on GPT‑4o, median‑scaled totals drop from 0.344 to 0.145~GWh, 1.196 to 0.490~ML, and 121 to 25~tCO_2 (LB). The framework also reproduces the production‑observed accelerator‑only versus comprehensive gap (narrow/comprehensive approx. 0.417), enabling direct translation across studies. Pareto analyses show when routing alone and when joint routing, batching, and token‑length controls deliver concurrent reductions in carbon and water at fixed quality of service. The combination of time‑resolved control, comprehensive accounting, and dual‑objective optimization yields a deployable template for decarbonization and water stewardship in LLM serving.

Puedes escribir una PREreview de A Time-Resolved, SLO-Aware and Bi-Objective Framework to Measure and Minimize LLM Serving’s Carbon and Water Footprints. Una PREreview es una revisión de un preprint y puede variar desde unas pocas oraciones hasta un extenso informe, similar a un informe de revisión por pares organizado por una revista.

Antes de comenzar

Te pediremos que inicies sesión con tu ORCID iD. Si no tienes un iD, puedes crear uno.

¿Qué es un ORCID iD?

Un ORCID iD es un identificador único que te distingue de otros/as con tu mismo nombre o uno similar.

Comenzar ahora