Explanation
During the enactment of the Centrifuge Runtime upgrade 1012 we introduced the collator selection pallet.
This pallet implements a custom session manager of the collators allowed to produce blocks in the parachain.
Since we needed to copy the static validators from the session to the collator selection invulnerable list, we needed to run a migration.
Unfortunately there was an issue with the migration trigger. The code was meant to execute only against a specific version of the runtime, and we accidentally didn’t update that target value according to the new runtime version so the migration code never executed.
The impact was that if the list of invulnerables is not populated then in the following session the list of next collators will be empty and therefore not allowing any collator to produce blocks, causing a chain halt.
Fix
At this point we were on a tight deadline of 6 hours (session time) to “manually” propose the changes that the migration would have done before the next sessions starts, and therefore causing the chain block production halt.
Six hours might seem like a lot of time, but we have to substract the fact that for democracy fast track proposals there is a minimum voting period of 3 hours, so whatever code that had to be executed that way needed to go through that flow. This meant that we only had 3 hours to identify the issue, build an action plan and hope that we get enough council votes (across timezones) to move the motion to the public referenda.
There were three council motions proposed, in which two of them had to go through democracy voting:
- Set QueuedKeys raw storage for each active collator: Subscan | Aggregate Substrate ecological network High-precision Web3 explorer
- Collator allow list for each invulnerable: Subscan | Aggregate Substrate ecological network High-precision Web3 explorer
- Set invulnerables: Subscan | Aggregate Substrate ecological network High-precision Web3 explorer
Learnings
- We will standardize how we implement migrations in runtime upgrades across circles, so we do not depend on runtime versions to check if a migration should be run, instead the use storage state.
- We will improve how we verify and test runtimes before they are enacted, by ensuring that there are automated sanity tests the span across multiple sessions
- Due to the current size of the council, it would be interesting to propose a few changes in the council fastrack logic that can ensure that the process is smoother and still secure, for example:
- Reduce the council threshold to 50% for fastrack: This would mean that at least 5 councillors are needed to push forward a motion (instead of the current number of 7). This is as well relevant for issues that happen at a time where there is no overlap across timezones.
- Reduce fastrack to 1-2 hours instead of 3 hours
- Have explicit council channels where it is specified the urgency and priority of the issue at hand
Thank you all for your understanding,