Orphaned MLflow Training Artifacts and Model Checkpoints in Object Storage

Annapurna Mungara

CER:

CER-0330

Service Category

Cloud Provider

AWS

Service Name

AWS S3

Inefficiency Type

Unused Resource

Explanation

Machine learning experimentation workflows — particularly those managed through experiment tracking platforms — generate large volumes of artifacts in object storage. Every training run produces model checkpoints, evaluation outputs, feature snapshots, and tensor logs. Hyperparameter tuning and AutoML workflows amplify this by creating hundreds or thousands of individual runs, each writing its own set of artifacts to locations in S3. When experiments are abandoned, models are never promoted to production, or team members depart, these artifacts remain in storage indefinitely because there is no native lifecycle management for ML experiment artifacts — cleanup must be implemented manually.

The cost impact is driven entirely by object storage capacity charges, which accumulate per GB-month regardless of whether the artifacts are referenced, the experiments are active, or the models are registered. Critically, even when experiment metadata is deleted through the tracking platform, the underlying artifacts in object storage are not automatically purged — they must be removed separately. For organizations training large models, checkpoint files alone can reach hundreds of gigabytes each, and production training pipelines may checkpoint every few hours. Without retention policies, it is common for ML artifact storage costs to grow unchecked and eventually rival or exceed compute costs.

Relevant Billing Model

The billing waste from orphaned ML artifacts occurs in the underlying cloud provider's object storage, not in the ML platform itself. Storage costs are billed separately by the cloud provider. The key cost dimensions are:

Storage capacity charges — billed per GB-month for as long as objects remain in storage. On AWS, S3 Standard storage is priced at $0.023 per GB-month for the first 50 TB. Comparable rates apply on other providers.
Request and retrieval charges — incurred when artifacts are written or read, but the dominant ongoing cost for orphaned artifacts is the at-rest storage charge.
No automatic cleanup — when ML experiment runs are deleted, artifacts in object storage persist and continue to incur charges until explicitly removed.

Because each training run, checkpoint, and experiment version writes separate objects, artifact counts and total storage volume grow rapidly during active experimentation and remain permanently unless governed by retention policies.

Detection

Identify experiment runs in the ML tracking system that have not been accessed or modified beyond a defined retention period.
Review artifact storage locations to determine which paths are not linked to any registered or production-deployed model.
Assess the total storage footprint of ML artifact directories and compare it against the set of active, referenced experiments.
Evaluate whether any artifact retention or lifecycle policies are currently in place for ML experiment storage locations.
Identify artifact storage associated with failed, cancelled, or incomplete training runs that were never successfully completed.
Review storage locations for intermediate model checkpoints that are no longer needed after a final model has been saved.
Confirm whether experiments from former team members or decommissioned projects still have artifacts persisting in object storage.

Remediation

Delete artifacts associated with failed, abandoned, or unreferenced experiment runs, ensuring that both the tracking metadata and the underlying object storage objects are removed.
Remove intermediate model checkpoints that are no longer needed — retain only the final model artifacts for runs that produced registered or deployed models.
Implement artifact retention policies that automatically flag or archive experiment artifacts older than a defined retention window.
Transition infrequently accessed but still-needed artifacts to lower-cost storage tiers (such as infrequent access or archive classes) to reduce ongoing per-GB charges.
Establish a recurring audit process to reconcile ML artifact storage against the model registry, identifying and cleaning up orphaned storage paths on a regular cadence.
Define clear ownership and lifecycle governance for ML experiments, including cleanup responsibilities when team members transition or projects are decommissioned.

Relevant Documentation

Submit Feedback