3. BEING READY FOR APACHE KAFKA
by Michael G. Noll, Confluent Inc.
http://www.slideshare.net/miguno/being-ready-for-apache-kafka-apache-big-data-europe-2015
4. Apache Kafka is a publish-subscribe messaging
rethought as a distributed commit log.
Producer
Producer
Consumer
Consumer
Broker Broker Broker
Broker Broker Broker
Broker Broker Broker
ZooKeeper
Kafka Cluster
oldest newest
Producer
Customer
Customer
topic
5. ABOUT KAFKA FROM JAY KREPS
• A consumer just maintains an “offset,” which is the log entry number
for the last record it has processed on each of these partitions. So,
changing the consumer’s position to go back and reprocess data is as
simple as restarting the job with a different offset. Adding a second
consumer for the same data is just another reader pointing to a
different position in the log.
• Kafka supports replication and fault-tolerance, runs on cheap,
commodity hardware, and is glad to store many TBs of data per
machine.
• LinkedIn keeps more than a petabyte of Kafka storage online,
and a number of applications make good use of this long retention
pattern for exactly this purpose.
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
6. USING KAFKA
• DEB and RPM are available via Confluence Platform
(http://www.confluent.io/developer)
• Recommended Python client: kafka-python
(https://github.com/mumrah/kafka-python)
• Confluent Kafka-REST is available via Confluent
Platform
• Monitoring is important: Host metrics (CPU, memory,
disk I/O and usage, network I/O), Kafka metrics
(consumer lag, replication stats, message latency, GC),
ZooKeeper metrics (requests latency, #outstanding
requests)
7. NEW IN KAFKA 0.9.0
• Copycat is a new framework for loading structured data into and
out of Kafka
• Kafka Streams is a library that supports basic operations (join/
filter/map/…), windowing, schema and proper time modelling
(event time vs. processing time)
• New unified consumer Java API
• ZooKeeper dependency is removed from clients
copycat copycat
11. LAMBDA ARCHITECTURE
• Batch layer that provides the following functionality:
• managing the master dataset, an immutable, append-only
set of raw data.
• pre-computing arbitrary query functions, called batch views.
• Serving layer (NoSQL such as HBase,Apache Druid, etc.)
• This layer indexes the batch views so that they can be
queried in ad hoc with low latency.
• Speed layer (Apache Storm, Spark Streaming, etc.)
• This layer accommodates all requests that are subject to
low latency requirements. Using fast and incremental
algorithms, the speed layer deals with recent data only.
12. LAMBDA ARCHITECTURE
• Retain the input data
unchanged
• Take in account the
problem of
reprocessing data (the
code change, and you
need to reprocess)
• Maintain the code that
need to produce the
same result from two
complex distributed
system is painful
• Different and diverging
programming
paradigms
Pros Cons
13. KAPPA ARCHITECTURE
• July 2, 2014 Jay Kreps from LinkedIn coined the term Kappa
Architecture
• The proposal of Jay Kreps is simple:
• Use Kafka (or other system) that will let you retain the full log
of the data you need to reprocess.
• When you want to do the reprocessing, start a second instance
of your stream processing job that starts processing from the
beginning of the retained data, but direct this output data to a
new output table.
• When the second job has caught up, switch the application to
read from the new table.
• Stop the old version of the job, and delete the old output table.
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
14. KAPPA ARCHITECTURE
APP
output table n
output table n+1
job version n
job version n+1
input topic
Kafka Cluster Stream Processing Serving DB
LAMDA ARCHITECTURE
APP
speed table
batch table
processing job
processing job
input topic
Kafka Cluster
Stream Processing
Serving DB
Batch Processing
15. • Need to reprocess only when you change the code.
• Check if the new version is working OK and if not
reverse to the old output table.
• You can mirror a Kafka topic to HDFS so you are
not limited to the Kafka retention configuration.
• You have only a code to maintain with an unique
framework.
• The real advantage is allowing your team to
develop, test, debug and operate their systems on
top of a single processing framework.
KAPPA ARCHITECTURE
16. USE CASES: IOT - OBD II
• One of clients install On Board Devices in the cars of
its customers.
• ASPGems implements an API to got all the
information in real time and inject the information in
Kafka.
• The business rules are implemented in a CEP
(complex event processing) running into Apache
Spark Streaming.
• As MPP (massively parallel processing) they use
ElasticSearch.
17. CATCHTHEM INTHE ACT
FRAUD DETECTION IN REAL-TIME
by Seshika Fernando,WSO2
http://events.linuxfoundation.org/sites/events/files/slides/Fraud%20Detection%20in%20Real-time%20-%20Seshika%20Fernando.pdf
18. FRAUD:ATRILLION DOLLAR PROBLEM
• Survey results
• $ 3.5 – 4Trillion in Global Losses per year (5% of Global GDP)
• Payment Fraud Only
• Merchants are losing around $250B globally
• Cost of Fraud is around 0.68% of Revenue for Retailers (2014)
• Steep rise in Fraud in eCommerce (0.85% of Revenue) and
mCommerce (1.36% of Revenue) with a movement of
payments to newer channels
20. FRAUD SCORING
• Use combinations of rules
• Give weights to each rule
• Derive a single number that reflects many fraud indicators
• Use a threshold to reject transactions
• Example: Score = 0.001 * itemPrice + 0.1 *
itemQuantity + 2.5 * isFreeEmail +
5 * riskyCountry + 8 * suspicousIPRange + 5 *
suspicousUsername + 3 * highTransactionVelocity
21. LEARN FROM DATA
• Utilize Machine LearningTechniques to identify
‘unknown’ point anomalies (e.g. k-means clustering)
22. MARKOV MODELS FOR FRAUD DETECTION
• Markov Models are stochastic models used to model
randomly changing systems
Classify
Events
Update
Probability
Matrix
Compare
Incoming
Sequences
Probability
Matrix
events alerts
23. MARKOV MODEL: CLASSIFICATION
Example: Each transaction is classified under
the following three qualities and expressed
as a 3 letter token, e.g., HNN
• Amount spent: Low, Normal and High
• Whether the transaction includes high price item:
Normal and High
• Time elapsed since the last transaction: Large,
Normal and Small
24. MARKOV MODEL: PROBABILITY
LNL LNH LNS LHL HHL …
LNL 0.97 0.54 0.2 0.09 0.07
LNH 0.8 0.6 0.18 0.65 0.11
LNS 0.07 0.83 0.95 0.15 0.12
…
• Compare the probabilities of incoming transaction
sequences with thresholds and flag fraud as appropriate
• Can use direct probabilities or more complex metrics (Miss
Rate Metric, Miss Probability Metric, Entropy Reduction
Metric, …)
• Update Markov Probability table with incoming transactions
25. DIG DEEPER
• Access historical
data using
• expressive
querying
• easy filtering
• useful
visualisations
• to isolate incidents
and unearth
connections
26. NLP STRUCTURED DATA INVESTIGATION
ON NON-TEXTUAL DATA WITH MLLIB
by Casey Stella, Hortonworks
http://events.linuxfoundation.org/sites/events/files/slides/NLP_on_non_textual_data.pdf
27. WORD2VEC
• Word2Vec is a vectorization model created by Google that attempts
to learn relationships between words automatically given a large
corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in
the vector space with cosine similarity.
• Uses a neural network to learn vector representations.
• Recent work by Pennington, Socher, and Manning shows that the
word2vec model is equivalent to weighting a word co-
occurance matrix based on window distance and lowering the
dimension by matrix factorization.
• Read more: http://radimrehurek.com/2014/12/making-sense-of-
word2vec/
28. CLINICAL DATA AS SENTENCES
• Clinical encounters form a sort of sentence over time. For a
given encounter:
• Vitals are measured (e.g. height, weight, BMI).
• Labs are performed and results are recorded (e.g. blood tests).
• Procedures are performed.
• Diagnoses are made (e.g. Diabetes).
• Drugs are prescribed.
• Each of these can be considered clinical “words” and the
encounter forms a clinical “sentence”.
• Idea:We can use word2vec to investigate connections between
these clinical concepts.
29. DEMO FOR KAGGLE COMPETION
• Practice Fusion Diabetes Classification (https://
www.kaggle.com/c/pf2012-diabetes)
• Given a de-identified data set of patient electronic
health records, build a model to determine who
has a diabetes diagnosis, as defined by ICD9 codes
• There are a total of 9,948 patients in the training
set and 4,979 patients in the test set.
• Ingested and preprocessed these records
into197,340 clinical “sentences”
30. SYNONIMS
• Sentence:
• dx::042 rx::benzoyl_peroxide_topical rx::morphine
from pyspark.mllib.feature import Word2Vec
word2vec = Word2Vec()
word2vec.setSeed(0)
word2vec.setVectorSize(100)
model = word2vec.fit(sentences)
def print_synonyms_filt(clinical_concept, model, prefix):
synonyms = model.findSynonyms(clinical_concept, 10000)
for word, cosine_distance in synonyms:
if prefix is None or word.startswith(prefix):
print "{}: {}".format(cosine_distance, word)
31. RESULTS EXAMPLE:
ATHEROSCLEROSIS OFTHE AORTA
• Hearing Loss¶
• From an article from the Journal of Atherosclerosis in 2012:
• Sensorineural hearing loss seemed to be associated with vascular endothelial
dysfunction and an increased cardiovascular risk
• Knee Joint Replacements
• These procedures are common among those with osteoarthritis and there has
been a solid correlation between osteoarthritis and atherosclerosis in the literature.
print_synonyms_filt(‘dx::440.0’, model, None)
0.930721402168: dx: v12.71 -- Personal history of peptic ulcer disease
0.926115810871: dx: 533.40 -- Chronic or unspecified peptic ulcer of
unspecified site with hemorrhage, without mention of obstruction
0.91034334898: dx: 153.6 -- Malignant neoplasm of ascending colon
0.90947073698: dx: 238.75 -- Myelodysplastic syndrome, unspecified
0.907130658627: dx: 389.10 -- Sensorineural hearing loss, unspecified
0.90490090847: dx: 428.30 -- Diastolic heart failure, unspecified
0.902494549751: dx: v43.65 -- Knee joint replacement