Post-Mortem of Centrifuge Runtime Upgrade 1012 - 30/09/22

mikiquantum · September 30, 2022, 8:03pm

Explanation

During the enactment of the Centrifuge Runtime upgrade 1012 we introduced the collator selection pallet.
This pallet implements a custom session manager of the collators allowed to produce blocks in the parachain.
Since we needed to copy the static validators from the session to the collator selection invulnerable list, we needed to run a migration.
Unfortunately there was an issue with the migration trigger. The code was meant to execute only against a specific version of the runtime, and we accidentally didn’t update that target value according to the new runtime version so the migration code never executed.
The impact was that if the list of invulnerables is not populated then in the following session the list of next collators will be empty and therefore not allowing any collator to produce blocks, causing a chain halt.

Fix

At this point we were on a tight deadline of 6 hours (session time) to “manually” propose the changes that the migration would have done before the next sessions starts, and therefore causing the chain block production halt.

Six hours might seem like a lot of time, but we have to substract the fact that for democracy fast track proposals there is a minimum voting period of 3 hours, so whatever code that had to be executed that way needed to go through that flow. This meant that we only had 3 hours to identify the issue, build an action plan and hope that we get enough council votes (across timezones) to move the motion to the public referenda.

There were three council motions proposed, in which two of them had to go through democracy voting:

Set QueuedKeys raw storage for each active collator: Subscan | Aggregate Substrate ecological network High-precision Web3 explorer
Collator allow list for each invulnerable: Subscan | Aggregate Substrate ecological network High-precision Web3 explorer
Set invulnerables: Subscan | Aggregate Substrate ecological network High-precision Web3 explorer

Learnings

We will standardize how we implement migrations in runtime upgrades across circles, so we do not depend on runtime versions to check if a migration should be run, instead the use storage state.
We will improve how we verify and test runtimes before they are enacted, by ensuring that there are automated sanity tests the span across multiple sessions
Due to the current size of the council, it would be interesting to propose a few changes in the council fastrack logic that can ensure that the process is smoother and still secure, for example:
- Reduce the council threshold to 50% for fastrack: This would mean that at least 5 councillors are needed to push forward a motion (instead of the current number of 7). This is as well relevant for issues that happen at a time where there is no overlap across timezones.
- Reduce fastrack to 1-2 hours instead of 3 hours
- Have explicit council channels where it is specified the urgency and priority of the issue at hand

Thank you all for your understanding,

ImdioR · September 30, 2022, 8:30pm

Good day MQ
Awesome. Thank you.

lucasvo · October 2, 2022, 10:02pm

Thanks for the detailed explanation @mikiquantum. It was a pretty hectic night - thanks to all the council members @prankstr25, @Ash, @Yarosl6, @vedhavyas, @branan as well for your quick reaction and help!

I think reducing the threshold for a fastrack to 50% given the larger council we already have is a good idea. I am a bit worried about a reduced fastrack time. It gives very little time to react in case this is ever abused.

Tjure07 · October 3, 2022, 5:51am

Should we move this conversation to “RFC” to discuss the proposed changes or should it be clarified internally in the council?

lucasvo · October 3, 2022, 8:19am

No, a new thread should be created if these changes are being proposed. This should stay on topic (the faulty runtime upgrade).

Topic		Replies	Views
Council Motion 11: Proposal for Runtime Upgrade 1007 Proposals runtime-upgrade , 2022 , july	4	569	July 8, 2022
Council Motion 26: Initiate Runtime Upgrade 1012 Proposals 2022 , sept	2	431	September 29, 2022
Democracy Proposal: 2022-07-21 Runtime Upgrade 1009 Proposals runtime-upgrade , 2022 , july	3	560	August 24, 2022
Council Motion 43: Onboarding the first five collators on Centrifuge Proposals proposals , 2022 , oct	2	438	November 7, 2022
CP26 (EP): Runtime Upgrade 1016 Proposals runtime-upgrade , jan , 2023 , cp3	5	368	January 25, 2023

Post-Mortem of Centrifuge Runtime Upgrade 1012 - 30/09/22

Explanation

Fix

Learnings

Related topics