In the world of distributed systems, reliability is paramount. etcd, a widely used key-value store often critical to infrastructure, has made strides in enhancing this aspect. While etcd's reliability has been robust thanks to the Raft consensus protocol, the same couldn't be said for upgrades/downgrades – until now.
The Challenge of etcd Downgrades
Historically, downgrading etcd has been a complex and unsupported process. There is no way to safely downgrade etcd data after it was touched by a newer version. Upgrades, while reliable, weren't easily reversible, often requiring external tools and backups. This lack of flexibility posed a significant challenge for users who encountered issues after upgrading.
Enter etcd 3.6: A New Era of Downgrade Support
etcd 3.6 introduces a groundbreaking solution: built-in downgrade support. This innovation not only simplifies the upgrade and downgrade processes but also significantly enhances etcd's reliability.
How Does It Work?
- Storage Versioning: A new storage version (SV) is persisted within the etcd data file. This version indicates compatibility, ensuring safe upgrades and downgrades.
- Schema Evolution: A comprehensive schema tracks all fields in the data file and acts as a source of truth about which version a particular was introduced in, allowing etcd to understand and manipulate data across versions.
- etcdutl migrate: A dedicated command-line tool, etcdutl migrate, streamlines skip-level upgrade and downgrade process, eliminating the need for complex manual steps.
Benefits for Users
The introduction of downgrade support in etcd 3.6 offers a range of benefits for users:
- Improved Reliability: Upgrades can be safely reverted, reducing the risk of data loss or operational disruption.
- Simplified Management: The upgrade and downgrade processes are streamlined, reducing the complexity of managing etcd clusters.
- Increased Flexibility: Users have greater flexibility in managing their etcd environments, allowing them to experiment with new versions and roll back if necessary.
Under the Hood: Technical Details
To achieve downgrade support, etcd 3.6 implements a strict storage versioning policy. This means that etcd data is versioned, etcd will no longer be allowed to load data generated by version higher than its own, and must rely on cluster downgrade process instead. This ensures that all the DB and WAL files would not have any information that could be incorrectly interpreted.
During the downgrade process, new fields from the higher version in DB files will be cleaned up. The etcd protocol version will be lowered to allow older versions to join. All new features, rpcs and fields would not be used thus preventing older members from interpreting replicated logs differently. This also means that entries added to the Wal log file should be compatible with lower versions. When a wal snapshot happens, all older incompatible entries should be applied, so they no longer need to be read and the storage version can be downgraded.
The etcdutl migrate command tool is added to simplify etcd data upgrade and downgrade process on 2+ minor version upgrades/downgrades scenarios, by validating the WAL log compatibility with the target version, and executing any necessary schema changes to the DB file and updating the storage version.
Implementation Milestones
The rollout of downgrade support is planned in three milestones:
- Snapshot Storage Versions: Storage versioning is implemented for snapshots.
- Version Annotations: etcd code is annotated with versions, and a schema is created for the data file.
- Full Downgrade Support: Downgrades can be fully implemented using the established storage versioning and schema.
We are currently working on finishing the third milestone.
Looking Ahead
etcd 3.6 marks a significant step forward in the reliability and manageability of etcd clusters. The introduction of downgrade support empowers users with greater flexibility and control over their etcd environments. As etcd continues to evolve, we can expect further enhancements to the upgrade and downgrade processes, further solidifying its position as a critical component in modern distributed systems.
By Siyuan Zhang – Software Engineer