Post-Mortem of Centrifuge Runtime Upgrade 1012 - 30/09/22

Explanation

During the enactment of the Centrifuge Runtime upgrade 1012 we introduced the collator selection pallet.
This pallet implements a custom session manager of the collators allowed to produce blocks in the parachain.
Since we needed to copy the static validators from the session to the collator selection invulnerable list, we needed to run a migration.
Unfortunately there was an issue with the migration trigger. The code was meant to execute only against a specific version of the runtime, and we accidentally didn’t update that target value according to the new runtime version so the migration code never executed.
The impact was that if the list of invulnerables is not populated then in the following session the list of next collators will be empty and therefore not allowing any collator to produce blocks, causing a chain halt.

Fix

At this point we were on a tight deadline of 6 hours (session time) to “manually” propose the changes that the migration would have done before the next sessions starts, and therefore causing the chain block production halt.

Six hours might seem like a lot of time, but we have to substract the fact that for democracy fast track proposals there is a minimum voting period of 3 hours, so whatever code that had to be executed that way needed to go through that flow. This meant that we only had 3 hours to identify the issue, build an action plan and hope that we get enough council votes (across timezones) to move the motion to the public referenda.

There were three council motions proposed, in which two of them had to go through democracy voting:

Learnings

  • We will standardize how we implement migrations in runtime upgrades across circles, so we do not depend on runtime versions to check if a migration should be run, instead the use storage state.
  • We will improve how we verify and test runtimes before they are enacted, by ensuring that there are automated sanity tests the span across multiple sessions
  • Due to the current size of the council, it would be interesting to propose a few changes in the council fastrack logic that can ensure that the process is smoother and still secure, for example:
    • Reduce the council threshold to 50% for fastrack: This would mean that at least 5 councillors are needed to push forward a motion (instead of the current number of 7). This is as well relevant for issues that happen at a time where there is no overlap across timezones.
    • Reduce fastrack to 1-2 hours instead of 3 hours
    • Have explicit council channels where it is specified the urgency and priority of the issue at hand

Thank you all for your understanding,

5 Likes

Good day MQ
Awesome. Thank you.

Thanks for the detailed explanation @mikiquantum. It was a pretty hectic night - thanks to all the council members @prankstr25, @Ash, @Yarosl6, @vedhavyas, @branan as well for your quick reaction and help!

I think reducing the threshold for a fastrack to 50% given the larger council we already have is a good idea. I am a bit worried about a reduced fastrack time. It gives very little time to react in case this is ever abused.

1 Like

Should we move this conversation to “RFC” to discuss the proposed changes or should it be clarified internally in the council?

No, a new thread should be created if these changes are being proposed. This should stay on topic (the faulty runtime upgrade).