1. Real time big data with Apache Kafka, Spark
Streaming, Scala, Elastic search.
By
S Annu Ahmed(122N1A0573)
V Indu Priyanka(122N1A0532)
S Ravindra(122N1A0572)
M Imran Basha(122N1A0556)
P B Sravanthi(122N1A0558)
B Baby Likhitha(122N1A0514)
2. Contents:
• From Data Mining to Big Data
• Introduction to Data
• What is Big Data
• Hadoop
• Scala
• Spark Streaming
• Elastic Search
3. FromData Mining to Big Data
Inearly 90’s, a buzzword called
Data Mining appeared
Many years after, wehave another one
called Big Data
Well, what’s the difference?
4. Status of Data Mining andMachine
Learning
Over the years,wehaveall kinds of effective methodsfor
classification, clustering, and regression We also have
good integratedtoolsfor data mining
(e.g., Weka, R, Scikit-learn)
However, mininguseful informationremains difficult for
somereal-world applications
5. What’s Big Data?
• Though many definitions are
available, we consider the
situation thatdata are larger
than the capacity of a
computer
• Ithink this is a main
difference between data
mining and big data
• So in a sense weare talking
aboutdistributeddata
mining or machine learning
(a), (b): distributed
systems
6. What is Data ?
“A set of values that may be Qualitative or Quantitate in nature”
What is Big Data ?
“Data so large and voluminous that it overwhelms the existing data
storage and processing infrastructure, is said to be big enough to be
called as-Big data”
What is Real time Big Data ?
The demand for stream processing is increasing a lot these days.
The reason is that often processing big volumes of data is not enough.
Data has to be processed fast, so that a firm can react to changing
business conditions in real time.
7. Parameters of big data:
Huge amount of data
Complex data which consists of lots of unstructured data
Speed of generating data
8. Big Data versus Fast Data
Big data is one of the most used buzzwords. You can best define it by thinking of three
Vs: Big data is not just about Volume, but also about Velocity and Variety .
Often, masses of structured and semi-structured historical
data are stored in Hadoop (Volume + Variety).
On the other side, stream processing is used for fast data
requirements (Velocity + Variety).
We focus on real-time and stream processing.
9. Challenges…
Capturing
Privacy and security
Data access and sharing Information
Duration
Storage
Search
Analyzing &Visualization
10. What We Need ?
•Fault Tolerant
•Failure Detection
•Low latency, distributed, data locality
•DataCenters
•Partition-Aware
•Elasticity
•Parallelism
11. Apache Hadoop is an open source framework for distributed storage
and processing of large sets of data on commodity hardware. It Include
HDFS and MAPREDUCE
Hadoop
Hadoop HDFS Hadoop MAP REDUCE
17. Overview
An apache project initially developed at LinkedIn
Distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data e.g.
logs, metrics collections
• Written in Scala
Features
Persistent messaging
High-throughput
Supports both queue and topic semantics
Uses Zookeeper for forming a cluster of nodes
(producer/consumer/broker)and many more…
20. About Apache Spark
Initially started at UC Berkeley in 2009
Fast and general purpose cluster computing system
10x (on disk) - 100x (In-Memory) faster
Most popular for running Iterative Machine Learning Algorithms.
Provides high level APIs in
Java
Scala
Python
Integration with Hadoop and its eco-system and can read existing data.
22. Spark Streaming
Makes it easy to build scalable fault-tolerant
streaming applications.
Ease of Use
Fault Tolerance
Combine streaming with batch and interactive
queries.
26. Scala
Scala was created by Martin Odersky and he released the
first version in 2001
Scala is the language that addresses the major needs of
the modern developer.
It is a statically typed, mixed-paradigm, JVM language
with a succinct, elegant, and flexible syntax, a
sophisticated type system, and idioms that promote
scalability from small , interpreted scripts to large,
sophisticated applications.
27. • Functional
• Object oriented programming
• On the JVM
• Static typing - easier to control performance
Why Scala?
28. Continued….
Scala is compelling because it feels like a dynamically
typed scripting language, due to its succinct syntax and
type inference.
Yet Scala gives you all the benefits of static typing, a
modern object model, functional programming, and an
advanced type system.
Scala's aim to provide advanced constructs for the
abstraction and composition of components is shared by
several recent research efforts.
29. What is elasticsearch?
In short, it can be thought of as “search engine software”
It provides the realistic potential for you to run your own search engine
service (like a Bing or a Google) but with say, private, sensitive, or
confidential data/documents that you don’t want on the public web
great extra capability for your company, enterprise, app, startup, client
elasticsearch is an open-source, distributed web application that runs on
top of Lucene, and it is written in Java, and it sports a REST API
Apache Lucene is the best open-source search engine, and probably one
of the best search engines available, and holds its own even when
compared against the most expensive commercial alternatives
very fast search
30. Where did elasticsearch come from?
Originally there was a search application project called Apache
Compass, which was primarily worked on by @kimchy
Compass also relied on Lucene, but was not distributed
kimchy decided to write elasticsearch to be distributed from the
get go, and so you could say it was built with the cloud in mind
Add more servers and they play together nicely, and they know
how to work together to split up the work load (and search
queries can be resource intensive and expensive in terms of
memory/disk requirements)
31. Elastic search is an advanced distributed app
It has some very cool properties and abilities when it
comes to operations that involve lots of nodes
It scales extremely gracefully
It has its own optimized binary protocol and makes its
own “internal network”
…as long as you know what you are doing when it
comes to configuration
It is open source