Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Grokking Techtalk #37: Data intensive problem

At some point in your software engineer career, you will have to deal with data and your success depends on how big the data that your software can deal with. From a simple problem that requires processing a large amount of data, this talk will present to you how to approach this kind of issue and how to design and choose an efficient solution.

About speaker:

Hồ is Senior Software Engineer at AXON where he helps design and develops complex distributed systems, including image and video encoding, distributed file conversion system. Besides coding, Ho likes to read manga and meet friends in his free time.

  • Sé el primero en comentar

Grokking Techtalk #37: Data intensive problem

  1. 1. Ho Nguyen • Senior Software Engineer • Technical Interests: • Solution & code design • Distributed systems • Video/Image encoding • Hobbies • Movies & music • Manga & anime (One Piece, Dragon Ball...) • Coffee lover
  2. 2. Data-intensive problem Ho Nguyen Senior Software Engineer
  3. 3. Outline • Simple problem • When the data is big • More problems • Approaches
  4. 4. Simple problem
  5. 5. Program diagram
  6. 6. Complete code
  7. 7. Face Detection
  8. 8. When the data is big
  9. 9. How big is the data? • A data set of 2 billion records of unique URLs • Assuming the previous program needs 2 seconds to complete => Concurrency number = 0.5 URL/s 2 ∗ 2 ∗ 10𝑒8 3600 ∗ 24 = 46296 𝑑𝑎𝑦𝑠 ≈ 127(𝑦𝑒𝑎𝑟𝑠)
  10. 10. What is the concurrency number we need to complete the dataset in X days?
  11. 11. What is the concurrency number we need? • Goal: X=7 Days • 2 billions URLs • Current concurrency 0.5 URL/s. 2 ∗ 10𝑒8 X ∗ 3600 ∗ 24 = 2 ∗ 10𝑒8 7 ∗ 3600 ∗ 24 ≈ 3307 𝑈𝑅𝐿𝑠/𝑠
  12. 12. How to increase concurrency? • Optimize code performance • Increase hardware resource (CPU, RAM, Disk, Network…) aka Scale- up • Scale-out • Cloning to multiple processes (X-Axis) • Splitting by functions (Y-Axis) • Data partitioning (Z-Axis)
  13. 13. Optimize code • Pros • Most effective if we found a bottleneck that can increase performance to 661,300% • Save infrastructure cost • Cons • Time consuming and uncertain
  14. 14. Scale-up • Pros • Easy to apply • Cons • Take time to find out the suitable hardware configuration • Expensive and limited • Still need to optimize code and redesign to take advantage of hardware resources when cannot scale-up
  15. 15. Scale-out by cloning (X-Axis) • Pros • Can use all hardware resources • Not limited by hardware • Cons • More complex than scale-up • Concurrency problems Node 1 Node 2 Node 3
  16. 16. Scale-out by Splitting (Y-Axis) Review the workflow
  17. 17. Scale-out by Splitting (Y-Axis) • Download and resize image using CPU • Face detection on GPU is faster Reference: https://sites.google.com/site/facedetectionongpu/
  18. 18. Scale-out by Splitting (Y-Axis) X-axis: Cloning Download and Process Image Download and Process Image Download and Process Image Face Detection Face Detection Y-axis:Splitting
  19. 19. Scale-out by Splitting (Y-Axis) • Pro • Reuse the advantage of hardware • Cons • Complex • Concurrency problems
  20. 20. Scale-out by data-partitioning (Z-Axis) Data schema ID URL Done 1 https://abc.com/image1.jpg 1 2 https://abc.com/image2.jpg 0 3 https://abc.com/image3.jpg 0 4 https://abc.com/image4.jpg 0
  21. 21. Scale-out by data-partitioning (Z-Axis) ID URL Done 1 https://abc.com/image1.jpg 0 3 https://abc.com/image2.jpg 0 ID URL Done 2 https://abc.com/image2.jpg 0 4 https://abc.com/image4.jpg 0 Key hashing
  22. 22. Scale-out by data-partitioning (Z-Axis) ID URL Done 1 https://abc.com/image1.jpg 0 2 https://abc.com/image2.jpg 0 ID URL Done 3 https://abc.com/image2.jpg 0 4 https://abc.com/image4.jpg 0 Range base
  23. 23. Scale-out by data-partitioning (Z-Axis) • Pros • Increase database performance • Reduce locking/non-locking • Cons • Increase maintenance and infrastructure cost • Hard for automation scaling
  24. 24. Summary • Skip the code optimization approach • Skip the scale-up approach • Focus on scale-out approaches • We can increase the number of processes/machines to increase the concurrency number • We can split into 2 services: Downloader and Face Detections • We may need data partition to optimize database performance
  25. 25. Current approach
  26. 26. High Concurrency Problems
  27. 27. Race condition • Cause • Same URL process twice or more • Impact • Waste of resources • Data corruption • Faking concurrency
  28. 28. Race condition: How to solve? • Distributed locks • Pros • N/a • Cons • Pessimistic locking impact performance • Hard to apply because we need to synchronize multiples nodes • Not good fault-tolerance • Data sharding • Pro • High performance because of share load (Physical shard) • Cons • Hard for scaling • Increase maintenance & infrastructure cost • Queue/Worker • Pros • Easy to implement • Easy to scale • Good fault-tolerance • Reusable communnication • Con • The load concentrates on the queue so it can become a bottleneck
  29. 29. Race condition: root cause Race condition only causes between Downloaders => If we found a way to distribute the unique URL for each downloader it will solve the race condition for the whole system.
  30. 30. Fault Tolerance • Faults • Network fault • Network interruption • IP Blocking • Service crash • Problems • Can data be lost? • Can the service restart and continue to work on remaining tasks?
  31. 31. Fault Tolerance criteria Given When Then A service crashed It restarted No Rework (Continue on remaining items only) Downloader service is running It crashed All downloaded images should not be lost FaceDetector service is running It crashed All detected result should not be lost Downloader is downloading image Network error happens Retry Downloader retry to download an image again Network error is IP Locking Should rotate proxy to change the ip
  32. 32. Service communication • How do the services communicate? • Do we need a load balancer?
  33. 33. Service communication methods Type Method Pros Cons Synchronous HTTP • Familiar and Simple to use • Need a load balancer • Tight coupling • Lock thread wait for response RPC • High performance than HTTP Asynchronous Queue Messaging (One-One) • High performance • Failure isolation • Act as a load balancer • Reduced coupling • Extra maintenance cost • Queue may become bottleneck Publich/Subscribe (One-Many) • We only need the one-to-one comunication
  34. 34. Summary • Find approach to distribute unique URL to downloaders. • The approach should pass the fault tolerance criteria • We can base on the communication methods table to choose the final solution
  35. 35. High Concurrency Approaches
  36. 36. Approach 1: Range based physical shard 𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑈𝑅𝐿𝑠 𝑚 ∶ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑡𝑖𝑜𝑛𝑠 𝑖 ∈ [0 … 𝑚 − 1]: 𝑝𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟 𝑘 = 𝑛 𝑚 ∶ number of urls in a partition 𝑠𝑡𝑎𝑟𝑡 𝑖 = 𝑘 ∗ 𝑖 𝑒𝑛𝑑 𝑖 = 𝑠𝑡𝑎𝑟𝑡 𝑖 + 𝑘, 0 ≤ 𝑖 < 𝑚 − 1 𝑠𝑡𝑎𝑟𝑡 𝑖 + 𝑘 + 𝑛 𝑚𝑜𝑑 𝑚 , 𝑖 = 𝑚 − 1
  37. 37. Approach 1: Range based physical shard Solve Race Condition Faul tolerance Comunication Types Notes Solved + No rework + Need to download image again if crash when face detection + Partition can be abadoned HTTP/gPRC • Pros • Non locking on db level • Cons • Take time for preparation • Hard to scale out/adjust • Need load balancer
  38. 38. Approach 2: Logical shard 𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑢𝑟𝑙𝑠 𝑚: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠 𝑖𝑑 ∈ 0. . 𝑚 − 1 : 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖𝑑 𝑘 = 𝑛 𝑚 𝑖: 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑢𝑟𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑎 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖 ∈ 0. . 𝑘 − 1 𝑖𝑑 < 𝑚 − 1 𝑖 ∈ 0. . 𝑘 − 1 + 𝑛 𝑚𝑜𝑑 𝑚 𝑖𝑑 = 𝑚 − 1 𝑓 𝑖 𝑡ℎ𝑒 𝑖𝑑 𝑜𝑓 𝑢𝑟𝑙 𝑤𝑒 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑝𝑖𝑐𝑘. ⇒ 𝑓 𝑖 = 𝑖𝑑 ∗ 𝑘 + 𝑖
  39. 39. Approach 2: Logical shard Solve Race Condition Faul tolerance Comunication Types Notes Solved + No rework + Need to download image again if crash when face detection + Partition can be abadoned HTTP/gPRC • Pros • Non locking on db level • Simple implementation • Cons • Hard to scale out/adjust • High database throughput • Extra state to maintain: Total Urls, Current Url Id,…
  40. 40. Approach 3: Queue/Worker x Logical Sharding 𝑛: 𝑡𝑜𝑡𝑎𝑙 𝑢𝑟𝑙𝑠 𝑚: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠 𝑖𝑑 ∈ 0. . 𝑚 − 1 : 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖𝑑 𝑘 = 𝑛 𝑚 𝑖: 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝑢𝑟𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑎 𝑝𝑟𝑜𝑐𝑒𝑠𝑠 𝑖 ∈ 0. . 𝑘 − 1 𝑖𝑑 < 𝑚 − 1 𝑖 ∈ 0. . 𝑘 − 1 + 𝑛 𝑚𝑜𝑑 𝑚 𝑖𝑑 = 𝑚 − 1 𝑓 𝑖 𝑡ℎ𝑒 𝑖𝑑 𝑜𝑓 𝑢𝑟𝑙 𝑤𝑒 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑝𝑖𝑐𝑘. ⇒ 𝑓 𝑖 = 𝑖𝑑 ∗ 𝑘 + 𝑖
  41. 41. Approach 3: Queue/Worker x Logical Sharding Solve Race Condition Faul tolerance Comunication Types Notes Solved + No Rework + Failure isolation + Node is replacable Messaging • Pros • Easy to scale • Easy fault-tolerance • Fail isolation • Asynchronous • Cons • Extra infrastructrure • High throughput on queue
  42. 42. END
  43. 43. Questions • How to measure and debug service? • What is deployment process?
  44. 44. Q&A THANK YOU FOR YOUR ATTENTION

    Sé el primero en comentar

    Inicia sesión para ver los comentarios

  • ssusere303d8

    Jan. 6, 2020
  • BachMai1

    Jan. 6, 2020
  • lifeleech

    Jan. 8, 2020
  • TaiLamPhat

    Jan. 9, 2020
  • hellangle

    Jan. 16, 2020
  • dzinheo

    Mar. 4, 2020
  • PhuocLe38

    Apr. 2, 2020
  • HuyNguyen439

    Sep. 4, 2020
  • aladine

    Dec. 27, 2020
  • ManhPhan23

    Jun. 26, 2021

At some point in your software engineer career, you will have to deal with data and your success depends on how big the data that your software can deal with. From a simple problem that requires processing a large amount of data, this talk will present to you how to approach this kind of issue and how to design and choose an efficient solution. About speaker: Hồ is Senior Software Engineer at AXON where he helps design and develops complex distributed systems, including image and video encoding, distributed file conversion system. Besides coding, Ho likes to read manga and meet friends in his free time.

Vistas

Total de vistas

1.001

En Slideshare

0

De embebidos

0

Número de embebidos

1

Acciones

Descargas

43

Compartidos

0

Comentarios

0

Me gusta

10

×