Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

How to extract valueable information from real time data feeds

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 32 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)

Anuncio

Similares a How to extract valueable information from real time data feeds (20)

Más de Gene Leybzon (20)

Anuncio

Más reciente (20)

How to extract valueable information from real time data feeds

  1. 1. How to extract valuable information from real- time data feeds Gene Leybzon, February 2016
  2. 2. “The critical challenge is using this data when it is still in motion – and extracting valuable information from it.” - Frédéric Combaneyre, SAS IoT Challenge
  3. 3.  Detect events of interest and trigger appropriate actions  Aggregate information for monitoring  Sensor data cleansing and validation  Real-time predictive and optimized operations (support for real-time decision making) Role of Data Streams
  4. 4. Platforms
  5. 5. Google Cloud Platform
  6. 6. AWS IoT Initiative
  7. 7. SAS
  8. 8.  Transform data — convert the data into another format, for example, converting a captured device signal voltage to a calibrated unit measure of temperature  Aggregate and compute data — By combining data you can add checks: such as averaging data across multiple devices to avoid acting on a single, spurious, device; or ensure you have actionable data if a single device goes offline. By adding computation to your pipeline, you can apply streaming analytics to data while it is still in the processing pipeline.  Enrich data — You can combine the device-generated data with other metadata about the device, or with other datasets, such as weather or traffic data, for use in subsequent analysis.  Move data — You can store the processed data in one or more final storage locations. Role of “Pipelines”
  9. 9. Architecture
  10. 10.  Fault-tolerance against hardware failures and human errors  Support for a variety of use cases that include low latency querying as well as updates  Linear scale-out capabilities, meaning that throwing more machines at the problem should help with getting the job done  Extensibility so that the system is manageable and can accommodate newer features easily  Consistency - data is the same across the cluster  Availability - ability to access the cluster even if a node in the cluster goes down  Partition-tolerance - cluster continues to function even if there is a "partition" (communications break) between two nodes What we want from stream architecture?
  11. 11. “It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:  Consistency (all nodes see the same data at the same time)  Availability (a guarantee that every request receives a response about whether it succeeded or failed)  Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)” CAP Theorem
  12. 12. Facing the Cap Theorem Consistency Availability Partition Tolerance ∅ Cassandra Riak CouchBase MongoDB λ Poxos Zab Raft
  13. 13. λ-Architecture
  14. 14.  One-way data flow (doesn’t transact and make per- event decisions on the streaming data, nor does it respond immediately to the events coming in)  Eventual consistency  NoSQL  Complexity Limitations of the λ-Architecture
  15. 15. Out-of the box Solutions
  16. 16.  Designed for low latency  Open-sourced in 2012  Long history of data  Scale > 500K events/sec in Avg Druid Project
  17. 17. Druid data store
  18. 18.  Distributed stream processing framework  Simple API  Fault tolerance  Manages stream state  Fault tolerance  Guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost. Apache Samza
  19. 19. Apache Samza
  20. 20. Samza Architecture
  21. 21. VoltDB
  22. 22. Stream Databases and Pipelines Building Blocks
  23. 23. PipelineDB (example of usage)
  24. 24. AWS Kinesis
  25. 25. Apache Cassandra  Decentralized (Every node in the cluster has the same role.)  No single point of failure.  Scalable  Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.  Fault-tolerant  Tunable level of consistency, all the way from "writes never fail" to "block for all replicas to be readable”  Hadoop integration, integration with MapReduce  Query language
  26. 26. Apache Flink • High performance • Low latency • Support for out-of order events • Flexible streaming window • Fault tolerance
  27. 27. Stream Processing Algorithms
  28. 28.  Finding frequent items  Estimating number of distinct  Statistics  Finding “signal”  Error correction  Filtering  Anomaly detection  Incremental learning  Data clustering Popular Stream Algorithms
  29. 29. Machine Learning from Stream Data
  30. 30. Take into account recent history ML Model is updatable (“evolves” as new data comes in) How ML from stream data is different from traditional ML techniques?
  31. 31.  Incremental algorithms (both support vector machines and neural networks can work incrementally)  Periodic retraining with new data batch Two Approaches to Adopt ML to Stream Data
  32. 32. Questions?

Notas del editor

  • https://aws.amazon.com/iot/how-it-works/#shadows
  • https://en.wikipedia.org/wiki/CAP_theorem
  • http://www.slideshare.net/gakhov/bbuzz-overview-part1
  • http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
    https://www.mapr.com/developercentral/lambda-architecture
  • http://radar.oreilly.com/2015/02/improving-on-the-lambda-architecture-for-streaming-analysis.html
  • https://en.wikipedia.org/wiki/Druid_(open-source_data_store)
  • https://en.wikipedia.org/wiki/Druid_(open-source_data_store)
  • https://github.com/pipelinedb/pipelinedb
  • https://github.com/pipelinedb/pipelinedb
  • https://flink.apache.org/features.html
    https://flink.apache.org/
  • Considerations:
    Data Horizon
    Data Obsolescence

×