This document summarizes a 1 year journey to improve the stability, reliability, and performance of an app and platform with over 1 million users and 2 million euros in monthly revenue. Key improvements included implementing dashboards and improved logs to monitor performance, setting alarms to detect issues, prioritizing a backlog of improvements, conducting weekly refactor sessions, and analyzing metrics like latency averages and percentiles to identify and address bottlenecks. These efforts led to significant benefits for clients like much less downtime, less time spent debugging issues, faster app performance, and issues being detected before users were impacted.