Insight Recent Demo

•Descargar como ODP, PDF•

0 recomendaciones•334 vistas

R

Crowd DetectorCrowd Detector
Reza Asad
Insight Data Engineering June 2015

Motivation
● Avoid waiting time in crowded areas.

Data
● Lets imagine we had data about people's location.
● This could be collected form people's cell phones.
● How can we use such data?

Naive Approach

Demo

Data
● But such data is not available to me ...
● Solution : Engineer the data!
● Take data from yelp
● Perform a random walk

Pipeline
Data

Engineering Challenges
● Choosing K?

Engineering Challenges
● The area of SF: 46.87 mi ²
● For the purpose of this project each cluster is 0.09 mi ²
● This means k is roughly 500

Engineering Challenges
● Parameters to tune:
– Time it takes to produce the messages
– Processing time for k-means in Spark Streaming
– The update interval for a fixed data point in the
database

Goal
● Tune the parameters in order to have a stable system
● The total delay after processing each batch must be
constant and comparable to the batch interval.
● You can check this in the Spark API

Tackling Challenges
●
Having multiple producers and consumers ✔
● Kafka is fast with sending messages and is not the bottleneck
● Establishing some safe limits:
– Using spark.streaming.receiver.maxRate to control
the input rate ✔
– Understanding the complexity of the process in Spark
Streaming ✔
– Choosing the right batch interval ✔

Raw Data

Data Process
● Data filteration in spark streaming

Data Process

About Me
● Long time ago - B.S in pure math, University of Toronto
● More recent - M.S in applied math, University of British Columbia
● The exciting now - A data engineer who wants to go camping with other
data engineers

Más contenido relacionado

La actualidad más candente

For over eight years, the Sensu community has been using Sensu to monitor their applications and infrastructure at scale. Sensu Go became generally available at the beginning of this year, and was designed to be more portable, easier and faster to deploy, and most importantly: more scalable than ever before! In this talk, Sensu CTO Sean Porter will share Sensu Go scaling patterns, best practices, and case studies. He’ll also explain our design and architectural choices and talk about our plan to take things even further.

Keynote: Scaling Sensu Go

Keynote: Scaling Sensu Go

Keynote: Scaling Sensu Go

Kiwi.com Reaches Cruising Altitude with Scylla

Kiwi.com Reaches Cruising Altitude with Scylla

Kiwi.com Reaches Cruising Altitude with Scylla

In this webinar, learn how a long-time Industrial IT Consultant helps his customer make the leap into providing visibility of their processes to everyone in the plant. This journey led to the discovery of untapped opportunity to improve operations, reduce energy consumption, and minimize plant downtime. The collection of data from the individual sensors has led to powerful Grafana dashboards shared across the organization.

How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...

How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...

How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...

Time Series data at Capital One consists of Infrastructure, Application, and Business Process Metrics. The combination of these metrics are what the internal stakeholders rely on for observability which allows them to deliver better service and uptime for their customers, so protecting this critical data with a proven and tested recovery plan is not a “nice to have” but a “must have.” In this talk, the members of IT staff, Saravanan Krisharaju, Rajeev Tomer, and Karl Daman will share how they built a fault-tolerant solution based on InfluxEnterprise and AWS that collects and stores metrics and events. They added to this, Machine Learning, which uses the collected time series to model predictions which are then brought back into InfluxDB time series database for real-time access. This Capital One team shares the journey they took to architect and build this solution as well as plan and execute on their disaster recovery plan.

Why Architecting for Disaster Recovery is Important for Your Time Series Data...

Why Architecting for Disaster Recovery is Important for Your Time Series Data...

Why Architecting for Disaster Recovery is Important for Your Time Series Data...

Distributed tracing is used to analyze performance and error cases in service oriented architectures. The Observability team at Airbnb recently created Upshot, a data pipeline that uses Flink to analyze over 40 million trace events per minute. Summaries of the resulting data are sent to Druid, Datadog, and other downstream datastores. This talk will focus on how we use Flink and how we analyzed and addressed scaling issues we encountered while building Upshot.

Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...

Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...

Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor

Session 03 data_migration_at_scale_by_sameer

Session 03 data_migration_at_scale_by_sameer

Session 03 data_migration_at_scale_by_sameer

Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana

Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana

Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana

NodeTime Tool Review

NodeTime Tool Review

NodeTime Tool Review

Apache Beam lets you write data pipelines over unbounded, out-of-order, global-scale data that are portable across diverse backends including Apache Flink, Apache Apex, Apache Spark, and Google Cloud Dataflow. But not all use cases are pipelines of simple "map" and "combine" operations. Beam's new State API adds scalability and consistency to fine-grained stateful processing, all with Beam's usual portability. Examples of new use cases unlocked include: * Microservice-like streaming applications * Aggregations that aren't natural/efficient as an associative combiner * Fine control over retrieval and storage of intermediate values during aggregation * Output based on customized conditions, such as limiting to only "significant" changes in a learned model (resulting in potentially large cost savings in subsequent processing) This talk will introduce the new state and timer features in Beam and show how to use them to express common real-world use cases in a backend-agnostic manner.

Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview

Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview

Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview

Slack in the Age of Prometheus

Slack in the Age of Prometheus

Slack in the Age of Prometheus

Golang is a lightweight, new open-source language which has several features that make automated and manual testing easier. Due to feature-rich standard library support, it provides a desirable environment for running and writing tests. Go describes a way to write automated tests that are automatically excluded from the compiled executable. Thus this test suite runs at the development time. It also displays which lines were exercised by tests, and which were not and provides complete code coverage analysis.

Golang testing

GoWitek Consulting Pvt.Ltd

Cassandra Meetup Nov 2019 - Cassandra Resiliency

Cassandra Meetup Nov 2019 - Cassandra Resiliency

Cassandra Meetup Nov 2019 - Cassandra Resiliency

Sumanth Pasupuleti

Html5 devconf nodejs_devops_shubhra

Html5 devconf nodejs_devops_shubhra

Html5 devconf nodejs_devops_shubhra

Lambda - Building On-prem GPU Training Infrastructure

Lambda - Building On-prem GPU Training Infrastructure

Lambda - Building On-prem GPU Training Infrastructure

Stephen Balaban

High-throughput DNA sequencing is a key data acquisition technology which enables dozens of important applications, from oncology to personalized diagnostics. We extended work presented last year to port additional portions of the standard genomics data processing pipeline to Flink. Our Flink-based processor consists of two distinct specialized modules (reader and writer) that are loosely linked via Kafka streams, thus allowing for easy composability and integration into already existing Hadoop workflows. To extend our work we had to manage the dynamical creation and detection of the data streams: the set of output files is not known in advance by the writer, which learns it at running time. Particular care had to be taken to handle the finite nature of the genomic streams: since we use some already existing Hadoop output formats, we had to properly handle the flow of end-of-streams markers through Flink and Kafka, in order to have the final output files correctly finalized.

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...

Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered and written out in real-time with an ease matching that of batch processing. However the real challenge of matching the prowess of batch ETL remains in doing joins, in maintaining state and to have the data be paused or rested dynamically. Netflix has a microservices architecture. Different microservices serve and record different kind of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when we want to combine the events coming from one high-traffic microservice to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations. Historically we have done this joining of large volume data-sets in batch. However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real time? Why wait a full day to get information from an event that was generated a few mins ago? In this talk, we will share how we solved a complex join of two high-volume event streams using Flink. We will talk about maintaining large state, fault tolerance of a stateful application and strategies for failure recovery.

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...

In 2016, we introduced Alibaba’s compute engine Blink which was based on our private branch of flink. It enalbed many large scale applications in Alibaba’s core business, such as search, recommendation and ads. With the deep and close colaboration with the flink community, we are finally close to contribute our improvements back to the flink community. In this talk, we will present our key contributions to flink runtime recently, such as the new YARN cluster mode for Flip-6, fine-grained failover for Flip-1, async i/o for Flip-12, incremental checkpoint, and the further improvements plan from Alibaba in the near future. Moreover, we will show some production use cases to illustrate how flink works in Alibaba’s large scale online applications, which includes real-time ETL as well as online machine learning. This talk is presented by Alibaba.

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...

La actualidad más candente (18)

Keynote: Scaling Sensu Go

Keynote: Scaling Sensu Go

Keynote: Scaling Sensu Go

Kiwi.com Reaches Cruising Altitude with Scylla

Kiwi.com Reaches Cruising Altitude with Scylla

Kiwi.com Reaches Cruising Altitude with Scylla

How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...

How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...

How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...

Why Architecting for Disaster Recovery is Important for Your Time Series Data...

Why Architecting for Disaster Recovery is Important for Your Time Series Data...

Why Architecting for Disaster Recovery is Important for Your Time Series Data...

Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...

Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...

Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor

Session 03 data_migration_at_scale_by_sameer

Session 03 data_migration_at_scale_by_sameer

Session 03 data_migration_at_scale_by_sameer

Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana

Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana

Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana

NodeTime Tool Review

NodeTime Tool Review

NodeTime Tool Review

Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview

Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview

Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview

Slack in the Age of Prometheus

Slack in the Age of Prometheus

Slack in the Age of Prometheus

Golang testing

Cassandra Meetup Nov 2019 - Cassandra Resiliency

Cassandra Meetup Nov 2019 - Cassandra Resiliency

Cassandra Meetup Nov 2019 - Cassandra Resiliency

Html5 devconf nodejs_devops_shubhra

Html5 devconf nodejs_devops_shubhra

Html5 devconf nodejs_devops_shubhra

Lambda - Building On-prem GPU Training Infrastructure

Lambda - Building On-prem GPU Training Infrastructure

Lambda - Building On-prem GPU Training Infrastructure

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...

Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...

Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...

Destacado

Sobre el autor

IAF training.PDF

IAF training.PDF

IAF training.PDF

KAMARUL HIDAYAT WAHID

electronic-structure_of_aluminum_nitride_-_theory

electronic-structure_of_aluminum_nitride_-_theory

electronic-structure_of_aluminum_nitride_-_theory

Stephen Loughin

Naim Ahmed

2.5. rúbrica de evaluación individual tutores en red intef

2.5. rúbrica de evaluación individual tutores en red intef

2.5. rúbrica de evaluación individual tutores en red intef

How To Capitalize On Opportunities While Minimizing Risk

How To Capitalize On Opportunities While Minimizing Risk

How To Capitalize On Opportunities While Minimizing Risk

Administracion por objetivos 1 yohana

Administracion por objetivos 1 yohana

Administracion por objetivos 1 yohana

Success story: Kiran Mazumdar

Success story: Kiran Mazumdar

Success story: Kiran Mazumdar

Historia del internet en el salvador

Historia del internet en el salvador

Historia del internet en el salvador

Gabriel Flamenco

Destacado (9)

Sobre el autor

IAF training.PDF

IAF training.PDF

IAF training.PDF

electronic-structure_of_aluminum_nitride_-_theory

electronic-structure_of_aluminum_nitride_-_theory

electronic-structure_of_aluminum_nitride_-_theory

Naim Ahmed

2.5. rúbrica de evaluación individual tutores en red intef

2.5. rúbrica de evaluación individual tutores en red intef

2.5. rúbrica de evaluación individual tutores en red intef

How To Capitalize On Opportunities While Minimizing Risk

How To Capitalize On Opportunities While Minimizing Risk

How To Capitalize On Opportunities While Minimizing Risk

Administracion por objetivos 1 yohana

Administracion por objetivos 1 yohana

Administracion por objetivos 1 yohana

Success story: Kiran Mazumdar

Success story: Kiran Mazumdar

Success story: Kiran Mazumdar

Historia del internet en el salvador

Historia del internet en el salvador

Historia del internet en el salvador

Similar a Insight Recent Demo

Vast volume of our processed data is Time Series data and once you start working with distributed systems, you start tackling many scale and performance problems: How to handle missing data?Should I handle both serving and backed process or separating them out? Best Performance for Money? In the talk we will tell the tale of all of the transformations we’ve made to our data model@Windward, some of the problems we’ve handled, review the multiple data persistency layers like: S3, MongoDB, Apache Cassandra, MySQL. And I’ll try my best NOT to answer the question “Which one of them is the Best?"

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...

Spark RDDs are almost identical to Scala collection, just in a distributed manner, all of the transformations and actions are derived from the Scala collections API. As Martin Odersky mentioned, “Spark - The Ultimate Scala Collections” is the right way to look at RDDs. But with that great distributed power comes a great many data problems: at first you’ll start tackling the concept of partitioning, then the actual data becomes the next thing to worry about. In the talk we’ll go through an overview on Spark's architecture, and see how similar RDDs are to the Scala collections API. We'll then shift to the world of problems that you’ll be facing when using Spark for processing a vast volume of time-series data with multiple data stores (S3, MongoDB, Apache Cassandra, MySQL). When you start tackling many scale and performance problems, many questions arise: > How to handle missing data? > Should the system handle both serving and backend processes, or should we separate them out? > Which solution is cheaper? > How do we get the best performance for money spent? In the talk we will tell the tale of all of the transformations we’ve made to our data and review the multiple data persistency layers... and I’ll try my best NOT to answer the question “which persistency layer is the best?” but I do promise to share our pains and lessons learned!

Scala like distributed collections - dumping time-series data with apache spark

Scala like distributed collections - dumping time-series data with apache spark

Scala like distributed collections - dumping time-series data with apache spark

Building real time Data Pipeline using Spark Streaming

Building real time Data Pipeline using Spark Streaming

Building real time Data Pipeline using Spark Streaming

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

Codemotion Tel Aviv

Performance Characterization and Optimization of In-Memory Data Analytics on ...

Performance Characterization and Optimization of In-Memory Data Analytics on ...

Performance Characterization and Optimization of In-Memory Data Analytics on ...

Ahsan Javed Awan

Near Data Computing Architectures: Opportunities and Challenges for Apache Spark

Near Data Computing Architectures: Opportunities and Challenges for Apache Spark

Near Data Computing Architectures: Opportunities and Challenges for Apache Spark

Ahsan Javed Awan

Scale-out big data processing frameworks like Apache Spark have been designed to use on off the shelf commodity machines where each machine has the modest amount of compute , memory and storage capacity. Recent advancement in the hardware technology motivates understanding Spark performance on novel hardware architectures. Our earlier work has shown that the performance of Spark based data analytics is bounded by the frequent accesses to the DRAM. In this talk, we argue in favor of Near Data Computing Architectures that enable processing the data where it resides (e.g Smart SSDs and Compute Memories) for Apache Spark. We envision a programmable logic based hybrid near-memory and near-storage compute architecture for Apache Spark. Furthermore we discuss the challenges involved to achieve 10x performance gain for Apache Spark on NDC architectures.

Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...

Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...

Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...

NetflixOSS Meetup season 3 episode 1

NetflixOSS Meetup season 3 episode 1

NetflixOSS Meetup season 3 episode 1

Ruslan Meshenberg

As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

At Qubole, users run Spark at scale on cloud (900+ concurrent nodes). At such scale, for efficiently running SLA critical jobs, tuning Spark configurations is essential. But it continues to be a difficult undertaking, largely driven by trial and error. In this talk, we will address the problem of auto-tuning SQL workloads on Spark. The same technique can also be adapted for non-SQL Spark workloads. In our earlier work[1], we proposed a model based on simple rules and insights. It was simple yet effective at optimizing queries and finding the right instance types to run queries. However, with respect to auto tuning Spark configurations we saw scope of improvement. On exploration, we found previous works addressing auto-tuning using Machine learning techniques. One major drawback of the simple model[1] is that it cannot use multiple runs of query for improving recommendation, whereas the major drawback with Machine Learning techniques is that it lacks domain specific knowledge. Hence, we decided to combine both techniques. Our auto-tuner interacts with both models to arrive at good configurations. Once user selects a query to auto tune, the next configuration is computed from models and the query is run with it. Metrics from event log of the run is fed back to models to obtain next configuration. Auto-tuner will continue exploring good configurations until it meets the fixed budget specified by the user. We found that in practice, this method gives much better configurations compared to configurations chosen even by experts on real workload and converges soon to optimal configuration. In this talk, we will present a novel ML model technique and the way it was combined with our earlier approach. Results on real workload will be presented along with limitations and challenges in productionizing them. [1] Margoor et al,'Automatic Tuning of SQL-on-Hadoop Engines' 2018,IEEE CLOUD

Auto-Pilot for Apache Spark Using Machine Learning

Auto-Pilot for Apache Spark Using Machine Learning

Auto-Pilot for Apache Spark Using Machine Learning

Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles. Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company. In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced. Take-aways for the audience: 1) A great example of stream processing large, personalization datasets at scale. 2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully. 3) Exposure to some of the technical challenges that should be expected along the way.

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

Seattle Spark Meetup Mobius CSharp API

Seattle Spark Meetup Mobius CSharp API

Seattle Spark Meetup Mobius CSharp API

Debugging data pipelines @OLA by Karan Kumar

Debugging data pipelines @OLA by Karan Kumar

Debugging data pipelines @OLA by Karan Kumar

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Profiling & Testing with Spark

Profiling & Testing with Spark

Profiling & Testing with Spark

Roger Rafanell Mas

LCU14 310- Cisco ODP --------------------------------------------------- Speaker: Robbie King Date: September 17, 2014 --------------------------------------------------- ★ Session Summary ★ Cisco to present their experience using ODP to provide portable accelerated access to crypto functions on various SoCs. --------------------------------------------------- ★ Resources ★ Zerista: http://lcu14.zerista.com/event/member/137757 Google Event: https://plus.google.com/u/0/events/ckmld1hll5jjijq11frbqmptet8 Video: https://www.youtube.com/watch?v=eFlTmslVK-Y&list=UUIVqQKxCyQLJS6xvSmfndLA Etherpad: http://pad.linaro.org/p/lcu14-310 --------------------------------------------------- ★ Event Details ★ Linaro Connect USA - #LCU14 September 15-19th, 2014 Hyatt Regency San Francisco Airport --------------------------------------------------- http://www.linaro.org http://connect.linaro.org

LCU14 310- Cisco ODP v2

LCU14 310- Cisco ODP v2

LCU14 310- Cisco ODP v2

Developing high frequency indicators using real time tick data on apache supe...

Developing high frequency indicators using real time tick data on apache supe...

Developing high frequency indicators using real time tick data on apache supe...

Zekeriya Besiroglu

As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK

Getting the best performance out of your Java applications can often be a challenge due to the managed environment nature of the Java Virtual Machine and the non-deterministic behaviour that this introduces. Automatic garbage collection (GC) can seriously affect the ability to hit SLAs for the 99th percentile and above. This session will start by looking at what we mean by speed and how the JVM, whilst extremely powerful, means we don’t always get the performance characteristics we want. We’ll then move on to discuss some critical features and tools that address these issues, i.e. garbage collection, JIT compilers, etc. At the end of the session, attendees will have a clear understanding of the challenges and solutions for low-latency Java.

Get Lower Latency and Higher Throughput for Java Applications

Get Lower Latency and Higher Throughput for Java Applications

Get Lower Latency and Higher Throughput for Java Applications

Similar a Insight Recent Demo (20)

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...

Scala like distributed collections - dumping time-series data with apache spark

Scala like distributed collections - dumping time-series data with apache spark

Scala like distributed collections - dumping time-series data with apache spark

Building real time Data Pipeline using Spark Streaming

Building real time Data Pipeline using Spark Streaming

Building real time Data Pipeline using Spark Streaming

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

Performance Characterization and Optimization of In-Memory Data Analytics on ...

Performance Characterization and Optimization of In-Memory Data Analytics on ...

Performance Characterization and Optimization of In-Memory Data Analytics on ...

Near Data Computing Architectures: Opportunities and Challenges for Apache Spark

Near Data Computing Architectures: Opportunities and Challenges for Apache Spark

Near Data Computing Architectures: Opportunities and Challenges for Apache Spark

Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...

Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...

Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...

NetflixOSS Meetup season 3 episode 1

NetflixOSS Meetup season 3 episode 1

NetflixOSS Meetup season 3 episode 1

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Auto-Pilot for Apache Spark Using Machine Learning

Auto-Pilot for Apache Spark Using Machine Learning

Auto-Pilot for Apache Spark Using Machine Learning

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...

Seattle Spark Meetup Mobius CSharp API

Seattle Spark Meetup Mobius CSharp API

Seattle Spark Meetup Mobius CSharp API

Debugging data pipelines @OLA by Karan Kumar

Debugging data pipelines @OLA by Karan Kumar

Debugging data pipelines @OLA by Karan Kumar

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Profiling & Testing with Spark

Profiling & Testing with Spark

Profiling & Testing with Spark

LCU14 310- Cisco ODP v2

LCU14 310- Cisco ODP v2

LCU14 310- Cisco ODP v2

Developing high frequency indicators using real time tick data on apache supe...

Developing high frequency indicators using real time tick data on apache supe...

Developing high frequency indicators using real time tick data on apache supe...

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK

Get Lower Latency and Higher Throughput for Java Applications

Get Lower Latency and Higher Throughput for Java Applications

Get Lower Latency and Higher Throughput for Java Applications

Insight Recent Demo

1. Crowd DetectorCrowd Detector Reza Asad Insight Data Engineering June 2015

2. Motivation ● Avoid waiting time in crowded areas.

3. Data ● Lets imagine we had data about people's location. ● This could be collected form people's cell phones. ● How can we use such data?

4. Naive Approach

6. Data ● But such data is not available to me ... ● Solution : Engineer the data! ● Take data from yelp ● Perform a random walk

7. Pipeline Data

8. Engineering Challenges ● Choosing K?

9. Engineering Challenges ● The area of SF: 46.87 mi ² ● For the purpose of this project each cluster is 0.09 mi ² ● This means k is roughly 500

10. Engineering Challenges ● Parameters to tune: – Time it takes to produce the messages – Processing time for k-means in Spark Streaming – The update interval for a fixed data point in the database

11. Goal ● Tune the parameters in order to have a stable system ● The total delay after processing each batch must be constant and comparable to the batch interval. ● You can check this in the Spark API

12. Tackling Challenges ● Having multiple producers and consumers ✔ ● Kafka is fast with sending messages and is not the bottleneck ● Establishing some safe limits: – Using spark.streaming.receiver.maxRate to control the input rate ✔ – Understanding the complexity of the process in Spark Streaming ✔ – Choosing the right batch interval ✔

14. Data Process ● Data filteration in spark streaming

15. Data Process

16. About Me ● Long time ago - B.S in pure math, University of Toronto ● More recent - M.S in applied math, University of British Columbia ● The exciting now - A data engineer who wants to go camping with other data engineers