Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assisting millions of active users in real-time"

438 visualizaciones

Publicado el

Nowadays many companies become data rich and intensive. They have millions of users generating billions of interactions and events per day. These massive streams of complex events can be processed and reacted upon to e.g. offer new products, next best actions, communicate to users or detect frauds, and quicker we can do it, the higher value we can generate. Our presentation will be based on our recent experience in building a real-time data analytics platform for telco events. This platform has been jointly built by GetInData and Kcell - the leading telco in Kazakhstan - in just a few months and it currently runs in production at the scale of 10M subscribers and 160K events per second on a still small cluster. It's used as a backbone for personalized marketing campaigns, detecting frauds, cross-sell & up-sell by following the behavior of millions of users in real-time and reacting to it instantly. We will share how we build such platform using current best of breed open-source projects like Flink, Kafka, and Nifi. We won't skimp on the details how we designed and optimized our Flink applications for high load and performance. We will also describe challenges that we faced during development and try to provide some tips what one should pay attention to when developing similar solutions, not only for telco, but also for banks, e-commerce, IoT and other industries.

Publicado en: Tecnología
  • Sé el primero en comentar

Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assisting millions of active users in real-time"

  1. 1. Assisting Millions of User in Real-Time Flink Forward Berlin 2018
  2. 2. 2 The Speakers Who are these guys? Alexey Brodovshuk @alexeybrod Krzysztof Zarzycki @k_zarzyk
  3. 3. 3 About Kcell Kcell JSC is a part of the largest Scandinavian telecommunications holding – TeliaCompany Kcell has a strong software development team and lots of experience in building services and products We like innovations > 10 000 000 subscribers Largest GSM operator in Kazakhstan 4G (40%), 3G (73%), 2G (96%) population Great network coverage There is the ongoing process of company digital transformation Not only telco
  4. 4. 4 Business needs Assisting Millions of Users in Real-Time SMS events Voice usage events Data usage events Roaming events Location events Input Process Actions
  5. 5. 5 Use Cases Use case scenarios. Just few of many. Case If subscriber top-ups her balance too often in short period of time. We can offer her a less expensive tariff or auto-payment services. Balance Top Up Case Trigger UI
  6. 6. 6 Roaming Fraud Trigger to Marketing Platform if subscriber visited X country OR/AND registered in Y visited mobile network and his device's type is Z Roaming case Send an email to the anti-fraud unit if subscriber registered in roaming but his balance at the moment is equal to 0. This situation is impossible in standard case. Fraud case in roaming
  7. 7. 7 Old System Why did we start to look for the new solution? External Vendor Solution Blackbox Solution Scalability issues Not reliable 1 2 3 Kcell Developers can’t fix, tweak or optimize it Limited to ~2000 events / sec Can’t support all needed data sources Multiple accidents which took too much time to resolve
  8. 8. 8 Scale Required system throughput 165KEvents / second 10MSubscribers 22.5TB / month
  9. 9. 9 About GetInData Big Data. Passion. Experience. Roots at Spotify Focus on Big Data from Day 1 Production Experience Contributions to Apache Flink
  10. 10. 10 New Solution Real-time Stream Processing ingestion outgestion events hub events processing HTTP push/pull FTP NFS MQ HTTP push/pull FTP MQ
  11. 11. 11 New Solution Real-time Stream Processing flink ingestion outgestion events hub events processing HTTP push/pull FTP NFS MQ HTTP push/pull FTP MQ
  12. 12. 12 New Solution (Operations) Web UI, Monitoring, Security flink ingestion outgestion events hub events processing HTTP push/pull FTP NFS MQ HTTP push/pull FTP MQ Admin UI (Triggers workbench) Monitoring ELK stack - logs InfluxDB/Grafana - metrics Security FreeIPA Kerberos LDAP/AD API (kafka based)
  13. 13. 13 New Solution (Data Lake) Data Lake and Sub-second OLAP Analytics flink ingestion outgestion events hub events processing HTTP push/pull FTP NFS MQ HTTP push/pull FTP MQ Data Lake Historical Storage (HDFS) Batch (Spark) SQL (Hive) Keep history, Report, Explore Column-oriented Data store OLAP (Druid) Interactive BI
  14. 14. 14 Processing Flow Real-time Stream Processing raw call events data usage events transform transformed events transform transformed events local state RocksDB control topic Admin UI HTTP calls notification events outgestion ingestion ingestion submit/stop triggers
  15. 15. Dynamic Rules Design Some treats for Squirrels
  16. 16. 16 Dynamic Rules Design Key Points ● We want to run 100s of triggers/business rules ● A typical approach: job per rule ● Won’t work in our case: ○ Run 100s of topologies/jobs = multiplied resources cost ○ Pull data from Kafka 100s of times ○ State (user features) replicated 100s times ○ Starting rule requires deployment of the job
  17. 17. 17 Dynamic Rules Design Key Points ● Our approach: One job to run all triggers/rules ○ And to consume all the sources ● Trigger “templates” still coded with java ● adding/removing rules without restarting application ● 100s of rules running efficiently
  18. 18. 18 Dynamic Rules Design The Overview billing events roaming Sort by time control topic notification events Deduplicate Router Late events Trigger 1 Trigger 2 State Updater Apply Triggers (CoProcess Function) Keyed by User
  19. 19. 19 Dynamic Rules Design Pros and Cons Shared resources and costs ● CPU, RAM, state, shuffle ● Pulling data from Kafka One bad rule affects whole system ● Watermarks are shared ● Failures are shared No job restart on start of new rules ● Rules started by business, no IT involved Still need to code rule template in the job ● No way to use SQL, Table API, CEP Sharing of state ● Build customer features, that can be seen by all rules Can be tricky to debug ● Code is shared ● Code paths enabled externally
  20. 20. 20 Dynamic Rules Design Issue: lagging sources slow down all rules Source A: highly unordered, late Source B: Ordered, low latency Late notifications Low latency notifications Triggers Triggers Group 1 Triggers Group 2 Source A: highly unordered, late Source B: Ordered, low latency Triggers Group Late notifications Problem Solution
  21. 21. 21 Join Streams Solution Stream X Stream Y Issue: No Data = No Watermark
  22. 22. 22 Flink Changes Wishlist What could be even better? attach new branch to existing topology that receives the same data Dynamic Topologies Cheaper topologies ● Graph of topologies that pass data locally in Flink ● Other words: Local Proxying/fan-out of Kafka traffic Share inputs between topologies Dynamic SQL SQL { }
  23. 23. 23 Decisions made Some decisions our team made before or during project implementation Streaming-first approach Apache Kafka for event hub Apache Flink Powerful Real-Time Analytics
  24. 24. 24 Apache Avro Keep state local to the process Ingest reference data for local joins and enrichment ● No need to query external systems while processing ● Data time correlation correctness Performance transformed events transformed events Subscriber profile data (events) Local State Not at >100K events / sec
  25. 25. 25 Nifi for data ingestion (no coding) ● but not for CEP Web UI for configuring triggers Ease of Use
  26. 26. 26 Flink on YARN, with HDFS HA for redundancy and running ~24/7 InfluxDb & Grafana for monitoring & alerting ELK for logs collection and aggregation Reliability and battle-tested techniques Kerberos and AD thanks to FreeIPA Apache Ranger for authorization Security
  27. 27. 27 One platform for the whole Enterprise Batch (adhoc) queries too ● Spark, Hive/Presto Online analytics ● OLAP Extensiveness HDP Open-source technologies HDP as a licence-free distribution Just start with a bunch of servers Cost-Efficiency
  28. 28. 28 Our Collaboration Two heads are better than one Joint development team Not a vendor solution Development as one team Code quality Code review and automated tools for code quality control Agile Practices Distant geographic locations, but everyday standups Go live quickly! <4 months to first production case running 24/7! Deliver DevOps/Automation Knowledge sharing Constant knowledge exchange in areas of expertise Testing Separate testing environment Automated Unit/E2E tests
  29. 29. 29 Make it a company-wide, self-service go-to place for data analysis Future Work We have already done a lot. But more great things are coming. 2018 Q4 2019 Q1 2019 Q2 Bright Future More Data Sources More Triggers Geolocation data Equipment logs Data Lake Machine Learning We plan to include machine learning and other tools that would enhance our platform even more Real-time BI Intraday view on business and operations Call center, clickstream, communication… all in one place ready for behavioral analysis Customer 360 view Monetize valuable insights from our combined rich data sources. Data Monetization Predictive maintenance Network Optimization To lower operational costs And make better investments And many more...
  30. 30. Questions? Flink Forward Berlin 2018 Contact us: