<aside> ✏️ 18. Nov 2022 - Philipp Wassibauer
</aside>
Dune.com experienced multiple incidents while migrating to our new Query Execution Service (QES). We are sorry for the disruptions this has caused to our community. In this Post Mortem we want to share what caused the outage, what we learnt from it and what we will do to prevent it in the future.
One of Dune's fundamental precepts is transparency. Every day, we look to make the world of on-chain data more transparent and accessible. We want to hold ourselves accountable to the same standards.Therefore it is important to us to be as open as possible and share it with everyone in our community, and we will strive to do this in relation to future outages too.
The root cause of the incidents originated from the new service needing to rebuild our data cache of query results, coupled with an overly rushed migration plan. This overloaded our data infrastructure, leading to long queues and extra load on many of our systems.
What exacerbated the situation was that the new architecture had larger memory requirements for processing large query results through our Graphql services. In combination with the higher load, this started crashing our Graphql services with each crash affecting more customers, as each crash of a Graphql service led to multiple people experiencing a Failed to fetch
errors while loading Dashboards or Queries.
A further side effect of the system's overloading was an increase in intermittent session management (JWT expired) errors. The JWT expired bug happens predominantly on inactive tabs that poll our backend. In specific edge cases this will lead to the token expiring, causing this error. When the system is under more load more tabs will sit idle, while people do something else with their time, increasing the chance of hitting this bug and therefore making it more prevalent.
Very overloaded Ethereum (V1) Replicas
8th Nov 17:00 - 23:30
9th Nov 4:00 - 11:00
Overloaded Ethereum (V1) Replicas
14th ****Nov 4:00 - 21:00