This presentation presents the common challenges in building an analytics platform (audience platform is chosen as the use case) and provides a few guidelines and recommendations on how to address them. The presentation starts with motivating the need for such a platform and the components that make it up. It then provides common design options for these components and suggests alternatives for them. The presentation concludes with a design proposal that is being evaluated for the audience platform in Inmobi.
2. Motivation
➔Audience Analytics platform is extremely critical
➔Segmentation
➔Rule Based
➔Inferred based on Sciences Modeling
➔Third Party
➔Targeting
➔Maximize CTR and CVR
3. Challenges
➔Scale
➔Billions of Ad requests/day, Peak 25K rps, 800M
Users
➔ Multiple Input Sources and Types
➔Fact Data, Dimension Data
➔ Multiple Consumers
➔Reporting, Segmentation and Targeting, Inferences
4. Challenges
➔ Data Curation
➔Define and Measure Data Quality
➔Track sources and possibly assign
confidence
➔Governance and Licensing restrictions
➔ Consistent Querying Interface
6. Activity Data
➔Records actual activity
➔Time-series data
➔ Immutable, actual facts
➔Comprises Dimensions and Measures
➔Measures
➔Ad requests, Impressions, Clicks, Conversion, ...
7. Dimension Data
➔Domain specific Metadata (user, location, app
etc)
➔Each domain will have its own schema
➔User (uid, age, gender, interests etc)
➔Location (Lat/Long – zip/city/country, etc)
➔Device (Handset model, OS, version etc)
➔Mutable (but possibly slowly changing)
8. ETL
➔Need to ingest data from different
sources
➔Transform the data into a format for
optimized storage and easy queriability
➔Query interface for different consumers
9. ETL - Ingestion
➔Naive -- Have custom ingestion flows
➔Quick to develop
➔Could be highly optimized
➔Not scalable
➔Have a generic framework
➔Streamlined and scalable
➔Might need more processing
10. ETL - Storage
➔Naive -- Storage schema closely coupled
with ingestion schema
➔Multiple representations of same data. Age
could be DOB or years
➔Consitent representation a must
➔Would require transformation from input
schema to storage schema
11. ETL - Storage
➔Location – Lat/Long, Zip, City, Country
➔Need to store in the lowest possible granularity
(Lat/Long)
➔GPS readings come with accuracy that needs to
be recorded
➔Queries are almost always nearness queries,
not exact matches
➔
12. ETL - Storage
➔Quadtile representation
➔Use leading bits for tile id, remaining for storing
accuracy
➔Transform all location information to such ids
➔Nearness with Lat/Long distance is a cross-product
join
➔With Tiles, we can translate this into equi-joins (of
course with some loss of accuracy)
13. ETL - Querying
➔Naive -- Users aware of multiple feeds
and schemas, query appropriately
➔Extremely difficult as schemas change,
new feeds get added
➔Closely coupled with internal
representation, not good
14. ETL - Querying
➔Having a consistent, published schema
➔Enables exploration and discovery
➔Well defined querying interfaces that
abstract out internal representation
➔Provide primitives (for example UDFs for
nearness calculations) for easier querying
15.
16. Ingestion Server
● Curation to filter out dubious records
● Adapters for transformation
● REST based ingestion server
– Support multiple compression types
– Support multiple serialization formats
– Handle rate-limiting/throttling
– Bulk/Streaming inputs
●
17. Storage and Querying
● Possibly different schema than ingestion
schema
● Columnar storage format (Parquet/ORC)
● Predominantly Hive friendly
● No direct access to internal storage, access
only through a HQL-like query layer
● Export option for other use case (online store)
18. Tech Stack
● Pig for most pipeline tasks
● Grill for analytics interface
● Hive as the primary execution engine
● Tez as the runtime environment
● ORC/Parquet for the storage format
●