SlideShare una empresa de Scribd logo
1 de 32
Real time big data with Apache Kafka, Spark
Streaming, Scala, Elastic search.
By
S Annu Ahmed(122N1A0573)
V Indu Priyanka(122N1A0532)
S Ravindra(122N1A0572)
M Imran Basha(122N1A0556)
P B Sravanthi(122N1A0558)
B Baby Likhitha(122N1A0514)
Contents:
• From Data Mining to Big Data
• Introduction to Data
• What is Big Data
• Hadoop
• Scala
• Spark Streaming
• Elastic Search
FromData Mining to Big Data
Inearly 90’s, a buzzword called
Data Mining appeared
Many years after, wehave another one
called Big Data
Well, what’s the difference?
Status of Data Mining andMachine
Learning
Over the years,wehaveall kinds of effective methodsfor
classification, clustering, and regression We also have
good integratedtoolsfor data mining
(e.g., Weka, R, Scikit-learn)
However, mininguseful informationremains difficult for
somereal-world applications
What’s Big Data?
• Though many definitions are
available, we consider the
situation thatdata are larger
than the capacity of a
computer
• Ithink this is a main
difference between data
mining and big data
• So in a sense weare talking
aboutdistributeddata
mining or machine learning
(a), (b): distributed
systems
What is Data ?
“A set of values that may be Qualitative or Quantitate in nature”
What is Big Data ?
“Data so large and voluminous that it overwhelms the existing data
storage and processing infrastructure, is said to be big enough to be
called as-Big data”
What is Real time Big Data ?
The demand for stream processing is increasing a lot these days.
The reason is that often processing big volumes of data is not enough.
Data has to be processed fast, so that a firm can react to changing
business conditions in real time.
Parameters of big data:
 Huge amount of data
 Complex data which consists of lots of unstructured data
 Speed of generating data
 Big Data versus Fast Data
Big data is one of the most used buzzwords. You can best define it by thinking of three
Vs: Big data is not just about Volume, but also about Velocity and Variety .
Often, masses of structured and semi-structured historical
data are stored in Hadoop (Volume + Variety).
 On the other side, stream processing is used for fast data
requirements (Velocity + Variety).
 We focus on real-time and stream processing.
Challenges…
 Capturing
 Privacy and security
 Data access and sharing Information
 Duration
 Storage
 Search
 Analyzing &Visualization
What We Need ?
•Fault Tolerant
•Failure Detection
•Low latency, distributed, data locality
•DataCenters
•Partition-Aware
•Elasticity
•Parallelism
Apache Hadoop is an open source framework for distributed storage
and processing of large sets of data on commodity hardware. It Include
HDFS and MAPREDUCE
Hadoop
Hadoop HDFS Hadoop MAP REDUCE
APACHE HADOOP ECO SYSTEM:
Let’s recall basic concepts of
Messaging System
Point to Point Messaging
(Queue)
Publish-Subscribe Messaging
(Topic)
Apache Kafka
Overview
 An apache project initially developed at LinkedIn
 Distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data e.g.
logs, metrics collections
• Written in Scala
 Features
 Persistent messaging
 High-throughput
 Supports both queue and topic semantics
 Uses Zookeeper for forming a cluster of nodes
(producer/consumer/broker)and many more…
How it works
Real time transfer
Consumer3
(Group2)
Kafka
Broker
Consumer4
(Group2)
Producer
Zookeeper
Consumer2
(Group1)
Consumer1
(Group1)
Update Consumed
Message offset
Queue
Topology
Topic
Topology
Kafka
Broker
Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
About Apache Spark
 Initially started at UC Berkeley in 2009
 Fast and general purpose cluster computing system
 10x (on disk) - 100x (In-Memory) faster
 Most popular for running Iterative Machine Learning Algorithms.
 Provides high level APIs in
 Java
 Scala
 Python
 Integration with Hadoop and its eco-system and can read existing data.
Why Spark, why not Hadoop?
Spark Streaming
Makes it easy to build scalable fault-tolerant
streaming applications.
Ease of Use
Fault Tolerance
Combine streaming with batch and interactive
queries.
zillions of bytes gigabytes per second
Spark Streaming
Input & Output Sources
Spark Streaming
Kinesis, S3
Scala
 Scala was created by Martin Odersky and he released the
first version in 2001
 Scala is the language that addresses the major needs of
the modern developer.
 It is a statically typed, mixed-paradigm, JVM language
with a succinct, elegant, and flexible syntax, a
sophisticated type system, and idioms that promote
scalability from small , interpreted scripts to large,
sophisticated applications.
• Functional
• Object oriented programming
• On the JVM
• Static typing - easier to control performance
Why Scala?
Continued….
 Scala is compelling because it feels like a dynamically
typed scripting language, due to its succinct syntax and
type inference.
 Yet Scala gives you all the benefits of static typing, a
modern object model, functional programming, and an
advanced type system.
 Scala's aim to provide advanced constructs for the
abstraction and composition of components is shared by
several recent research efforts.
What is elasticsearch?
 In short, it can be thought of as “search engine software”
 It provides the realistic potential for you to run your own search engine
service (like a Bing or a Google) but with say, private, sensitive, or
confidential data/documents that you don’t want on the public web
 great extra capability for your company, enterprise, app, startup, client
 elasticsearch is an open-source, distributed web application that runs on
top of Lucene, and it is written in Java, and it sports a REST API
 Apache Lucene is the best open-source search engine, and probably one
of the best search engines available, and holds its own even when
compared against the most expensive commercial alternatives
 very fast search
Where did elasticsearch come from?
 Originally there was a search application project called Apache
Compass, which was primarily worked on by @kimchy
 Compass also relied on Lucene, but was not distributed
 kimchy decided to write elasticsearch to be distributed from the
get go, and so you could say it was built with the cloud in mind
 Add more servers and they play together nicely, and they know
how to work together to split up the work load (and search
queries can be resource intensive and expensive in terms of
memory/disk requirements)
Elastic search is an advanced distributed app
 It has some very cool properties and abilities when it
comes to operations that involve lots of nodes
 It scales extremely gracefully
 It has its own optimized binary protocol and makes its
own “internal network”
 …as long as you know what you are doing when it
comes to configuration
 It is open source
963

Más contenido relacionado

La actualidad más candente

Big Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web ServicesBig Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web Services
Amazon Web Services
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Qubole
 

La actualidad más candente (20)

Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Big Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web ServicesBig Data Analytics with Amazon Web Services
Big Data Analytics with Amazon Web Services
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
 
Dataminds - ML in Production
Dataminds - ML in ProductionDataminds - ML in Production
Dataminds - ML in Production
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Big data on AWS
Big data on AWSBig data on AWS
Big data on AWS
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
10 Things About Spark
10 Things About Spark 10 Things About Spark
10 Things About Spark
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 

Destacado

Destacado (6)

WITSML data processing with Kafka and Spark Streaming
WITSML data processing with Kafka and Spark StreamingWITSML data processing with Kafka and Spark Streaming
WITSML data processing with Kafka and Spark Streaming
 
Real time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingReal time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreaming
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJDataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 

Similar a 963

Similar a 963 (20)

Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Hadoop
HadoopHadoop
Hadoop
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
10 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 201910 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 2019
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Spark
SparkSpark
Spark
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 

963

  • 1. Real time big data with Apache Kafka, Spark Streaming, Scala, Elastic search. By S Annu Ahmed(122N1A0573) V Indu Priyanka(122N1A0532) S Ravindra(122N1A0572) M Imran Basha(122N1A0556) P B Sravanthi(122N1A0558) B Baby Likhitha(122N1A0514)
  • 2. Contents: • From Data Mining to Big Data • Introduction to Data • What is Big Data • Hadoop • Scala • Spark Streaming • Elastic Search
  • 3. FromData Mining to Big Data Inearly 90’s, a buzzword called Data Mining appeared Many years after, wehave another one called Big Data Well, what’s the difference?
  • 4. Status of Data Mining andMachine Learning Over the years,wehaveall kinds of effective methodsfor classification, clustering, and regression We also have good integratedtoolsfor data mining (e.g., Weka, R, Scikit-learn) However, mininguseful informationremains difficult for somereal-world applications
  • 5. What’s Big Data? • Though many definitions are available, we consider the situation thatdata are larger than the capacity of a computer • Ithink this is a main difference between data mining and big data • So in a sense weare talking aboutdistributeddata mining or machine learning (a), (b): distributed systems
  • 6. What is Data ? “A set of values that may be Qualitative or Quantitate in nature” What is Big Data ? “Data so large and voluminous that it overwhelms the existing data storage and processing infrastructure, is said to be big enough to be called as-Big data” What is Real time Big Data ? The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough. Data has to be processed fast, so that a firm can react to changing business conditions in real time.
  • 7. Parameters of big data:  Huge amount of data  Complex data which consists of lots of unstructured data  Speed of generating data
  • 8.  Big Data versus Fast Data Big data is one of the most used buzzwords. You can best define it by thinking of three Vs: Big data is not just about Volume, but also about Velocity and Variety . Often, masses of structured and semi-structured historical data are stored in Hadoop (Volume + Variety).  On the other side, stream processing is used for fast data requirements (Velocity + Variety).  We focus on real-time and stream processing.
  • 9. Challenges…  Capturing  Privacy and security  Data access and sharing Information  Duration  Storage  Search  Analyzing &Visualization
  • 10. What We Need ? •Fault Tolerant •Failure Detection •Low latency, distributed, data locality •DataCenters •Partition-Aware •Elasticity •Parallelism
  • 11. Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. It Include HDFS and MAPREDUCE Hadoop Hadoop HDFS Hadoop MAP REDUCE
  • 12. APACHE HADOOP ECO SYSTEM:
  • 13. Let’s recall basic concepts of Messaging System
  • 14. Point to Point Messaging (Queue)
  • 17. Overview  An apache project initially developed at LinkedIn  Distributed publish-subscribe messaging system • Designed for processing of real time activity stream data e.g. logs, metrics collections • Written in Scala  Features  Persistent messaging  High-throughput  Supports both queue and topic semantics  Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker)and many more…
  • 19. Real time transfer Consumer3 (Group2) Kafka Broker Consumer4 (Group2) Producer Zookeeper Consumer2 (Group1) Consumer1 (Group1) Update Consumed Message offset Queue Topology Topic Topology Kafka Broker Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
  • 20. About Apache Spark  Initially started at UC Berkeley in 2009  Fast and general purpose cluster computing system  10x (on disk) - 100x (In-Memory) faster  Most popular for running Iterative Machine Learning Algorithms.  Provides high level APIs in  Java  Scala  Python  Integration with Hadoop and its eco-system and can read existing data.
  • 21. Why Spark, why not Hadoop?
  • 22. Spark Streaming Makes it easy to build scalable fault-tolerant streaming applications. Ease of Use Fault Tolerance Combine streaming with batch and interactive queries.
  • 23. zillions of bytes gigabytes per second Spark Streaming
  • 24. Input & Output Sources
  • 26. Scala  Scala was created by Martin Odersky and he released the first version in 2001  Scala is the language that addresses the major needs of the modern developer.  It is a statically typed, mixed-paradigm, JVM language with a succinct, elegant, and flexible syntax, a sophisticated type system, and idioms that promote scalability from small , interpreted scripts to large, sophisticated applications.
  • 27. • Functional • Object oriented programming • On the JVM • Static typing - easier to control performance Why Scala?
  • 28. Continued….  Scala is compelling because it feels like a dynamically typed scripting language, due to its succinct syntax and type inference.  Yet Scala gives you all the benefits of static typing, a modern object model, functional programming, and an advanced type system.  Scala's aim to provide advanced constructs for the abstraction and composition of components is shared by several recent research efforts.
  • 29. What is elasticsearch?  In short, it can be thought of as “search engine software”  It provides the realistic potential for you to run your own search engine service (like a Bing or a Google) but with say, private, sensitive, or confidential data/documents that you don’t want on the public web  great extra capability for your company, enterprise, app, startup, client  elasticsearch is an open-source, distributed web application that runs on top of Lucene, and it is written in Java, and it sports a REST API  Apache Lucene is the best open-source search engine, and probably one of the best search engines available, and holds its own even when compared against the most expensive commercial alternatives  very fast search
  • 30. Where did elasticsearch come from?  Originally there was a search application project called Apache Compass, which was primarily worked on by @kimchy  Compass also relied on Lucene, but was not distributed  kimchy decided to write elasticsearch to be distributed from the get go, and so you could say it was built with the cloud in mind  Add more servers and they play together nicely, and they know how to work together to split up the work load (and search queries can be resource intensive and expensive in terms of memory/disk requirements)
  • 31. Elastic search is an advanced distributed app  It has some very cool properties and abilities when it comes to operations that involve lots of nodes  It scales extremely gracefully  It has its own optimized binary protocol and makes its own “internal network”  …as long as you know what you are doing when it comes to configuration  It is open source