Apple Siri is the world’s largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. A.I. and machine learning are used to personalize your experience throughout your day. The more you use, the more helpful it can be. At this scale, the information we are processing is very humongous. We use Apache Spark to get the job done.
In this talk, we will discuss the architecture of Siri data pipelines, and in particular, how Apache Spark is used to aggregate the data coming from different data centers globally into one source of truth for analytical use-cases for ML model building and productizing. We will talk about the specific techniques we use at Siri to scale, and various pitfalls we have found along the way. As part of the OSS community, we contributed back many features and bug fixes during the process; as a result, all the Spark users can get the significant run time improvement and resource savings.