Walmart.com generates millions of events per second. At WalmartLabs, I’m working in a team called the Customer Backbone (CBB), where we wanted to upgrade to a platform capable of processing this event volume in real-time and store the state/knowledge of possibly all the Walmart Customers generated by the processing. Kafka streams’ event-driven architecture seemed like the only obvious choice. However, there are a few challenges w.r.t. Walmart’s scale: • the clusters need to be large and the problems thereof. • infinite retention of changelog topics, wasting valuable disk. • slow stand-by task recovery in case of a node failure (changelog topics have GBs of data) • no repartitioning in Kafka Streams. As part of the event-driven development and addressing the challenges above, I’m going to talk about some bold new ideas we developed as features/patches to Kafka Streams to deal with the scale required at Walmart. • Cold Bootstrap: Where in case of a Kafka Streams node failure, how instead of recovering from the change-log topic, we bootstrap the standby from active’s RocksDB using JSch and zero event loss by careful offset management. • Dynamic Repartitioning: We added support for repartitioning in Kafka Streams where state is distributed among the new partitions. We can now elastically scale to any number of partitions and any number of nodes. • Cloud/Rack/AZ aware task assignment: No active and standby tasks of the same partition are assigned to the same rack. • Decreased Partition Assignment Size: With large clusters like ours (>400 nodes and 3 stream threads per node), the size of Partition Assignment of the KS cluster being few 100MBs, it takes a lot of time to settle a rebalance. Key Takeaways: • Basic understanding of Kafka Streams. • Productionizing Kafka Streams at scale. • Using Kafka Streams as Distributed NoSQL DB