SlideShare a Scribd company logo
1 of 28
Spark, and
Kafka timestamp offset
charsyam@naver.com
Common Way: Kafka and Spark Streaming
Server #1
Server #2
Server ...
Server ...
Log
Aggregation
Server
Syslogd
Kafka
Logstash
Spark
Streaming
Kafka Log
Kafka Log Segments
Segment is a file
If you set prealloc flag, segment
will be 1GB.
What is timestamp Index?
Fetching Kafka Logs by Timestamp.(From Kafka-0.10.0.2)
You can query from specific timestamp range of message.(very useful)
ex) From 2018-09-08 00:00:00 To 2018-09-08-23:59:59
Related KIPs
KIP-32 - Add timestamps to Kafka message : 0.10.0.0
KIP-33 - Add a time based log index : 0.10.1.0
Kafka Log Files
00000000000155593652.log Kafka Message Data File
00000000000155593652.index OffsetIndex File
00000000000155593652.timeindex TimeIndex File
There are many files for 1 segment.
.txnindex, .snapshot, .deleted, .cleaned, .swap, etc
Kafka Index Files
OffsetIndex.scala
TimeIndex.scala
There are different Index Files
When Kafka append Logs.
@nonthreadsafe
def append(largestOffset: Long,
largestTimestamp: Long,
shallowOffsetOfMaxTimestamp: Long,
records: MemoryRecords): Unit = {
……
val appendedBytes = log.append(records)
……
offsetIndex.append(largestOffset, physicalPosition)
timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestamp)
……
}
When Kafka writes Logs #1
● Using MMAP(so OS Page Cache is very important for performance)
● Offset is stored as relative offset
○ Current Offset - Base Offset
● Offset will be returned Absolute Offset
When Kafka writes Logs #2
OffsetIndex
Append
Int(4 bytes)
Offset
Int(4 bytes)
Position
When Kafka writes Logs #3
TimeIndex
Append
Long(8 bytes)
Timestamp
Int(4 bytes)
Position
How to fetch by timestamp #1
def fetchOffsetsByTimestamp(targetTimestamp: Long): Option[TimestampOffset] = {
……
val targetSeg = {
val earlierSegs = segmentsCopy.takeWhile(_.largestTimestamp < targetTimestamp)
if (earlierSegs.length < segmentsCopy.length)
Some(segmentsCopy(earlierSegs.length))
else
None
}
targetSeg.flatMap(_.findOffsetByTimestamp(targetTimestamp, logStartOffset))
}
}
How to fetch by timestamp #2
def findOffsetByTimestamp(timestamp: Long, startingOffset: Long = baseOffset): Option[TimestampOffset] = {
// Get the index entry with a timestamp less than or equal to the target timestamp
val timestampOffset = timeIndex.lookup(timestamp)
val position = offsetIndex.lookup(math.max(timestampOffset.offset, startingOffset)).position
// Search the timestamp
Option(log.searchForTimestamp(timestamp, position, startingOffset)).map { timestampAndOffset =>
TimestampOffset(timestampAndOffset.timestamp, timestampAndOffset.offset)
}
}
Using BinarySearch For Search
Simpe Question!!!
● Why offset index is needed?
● How to use binary search in Index File?
How to query by timestamp in spark
Convert timestamp to OffsetRange
Just Create KafkaRDD with OffsetRange
Create KafkaRDD with OffsetRange
KafkaUtils.createRDD[K, V](spark.sparkContext, kafkaParamsMap, offsetRanges, PreferConsistent)
Convert timestamp to OffsetRange
val consumer = createKafkaConsumer(props)
val startOffset = consumer.offsetsForTimes(topicMap)
val endOffset = consumer.offsetsForTimes(topicMap)
KafkaConsumer.offsetsForTimes
Client API
offsetsForTimes
KafkaConsumer.java
getOffsetsByTimes
Fetcher.java
retrieveOffsetsByTimes
Fetcher.java
Broker API
handleListOffsetRequest
KafkaApis.scala
Log Append Flows
Broker API
handleProduceRequest
KafkaApis.scala
appendRecords
ReplicaManager.scala
appendToLocalLog
ReplicaManager.scala
appendRecordsToLeade
r
Partition.scala
appendAsLeader
Log.scala
append
Log.scala
append
LogSegment.scala
When Segmnet is rolled?
def shouldRoll(messagesSize: Int, maxTimestampInMessages: Long, maxOffsetInMessages: Long, now: Long): Boolean = {
val reachedRollMs = timeWaitedForRoll(now, maxTimestampInMessages) > maxSegmentMs - rollJitterMs
size > maxSegmentBytes - messagesSize ||
(size > 0 && reachedRollMs) ||
offsetIndex.isFull || timeIndex.isFull || !canConvertToRelativeOffset(maxOffsetInMessages)
}
When Segmnet is rolled?
def shouldRoll(messagesSize: Int, maxTimestampInMessages: Long, maxOffsetInMessages: Long, now: Long): Boolean = {
val reachedRollMs = timeWaitedForRoll(now, maxTimestampInMessages) > maxSegmentMs - rollJitterMs
size > maxSegmentBytes - messagesSize ||
(size > 0 && reachedRollMs) ||
offsetIndex.isFull || timeIndex.isFull || !canConvertToRelativeOffset(maxOffsetInMessages)
}
1] size > maxSegmentBytes - messageSize
2] size > 0 && reachedRollMs
3] offsetIndex.isFull
4] timeIndex.isFull
5] canCovertToRelativeOffset is false
One Cent for Using Kafka Timestamp offset
● As a default, timestamp is set as sending time by client.
● So it is not a time that log is created.
○ You should specify to use timestamp as created time of log.
Thank you!
Quiz
● If timestamp is older than last timeIndex
○ How Kafka handles this?
Original
00000000317.log TimeIndex
Log1 Offset: 317
Timestamp: 10000
…...
Log100 Offset: 416
Timestamp: 20000
10000, 317
20000, 416
Append Log with old timestamp
00000000317.log TimeIndex
Log1 Offset: 317
Timestamp: 10000
…...
Log100 Offset: 416
Timestamp: 20000
10000, 317
20000, 416
Log101 Offset: 417
Timestamp: 20
Log200 Offset: 516
Timestamp: 30000
…...
30000, 516
Please remember!!!
● If you use too early timestamp.
○ It can remove your all kafka-logs.
● If you use too later timestamp.
○ It is never removed.

More Related Content

What's hot

Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaGuozhang Wang
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8 Dori Waldman
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingSigmoid
 
Prezo at-mesos con2015-final
Prezo at-mesos con2015-finalPrezo at-mesos con2015-final
Prezo at-mesos con2015-finalSharma Podila
 
Transaction Support in Pulsar 2.5.0
Transaction Support in Pulsar 2.5.0Transaction Support in Pulsar 2.5.0
Transaction Support in Pulsar 2.5.0StreamNative
 
"Enabling Googley microservices with gRPC." at Devoxx France 2017
"Enabling Googley microservices with gRPC." at Devoxx France 2017"Enabling Googley microservices with gRPC." at Devoxx France 2017
"Enabling Googley microservices with gRPC." at Devoxx France 2017Alex Borysov
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubySATOSHI TAGOMORI
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSSCloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSSAWS Vietnam Community
 
Heat optimization
Heat optimizationHeat optimization
Heat optimizationRico Lin
 
Using Libvirt with Cluster API to manage baremetal Kubernetes
Using Libvirt with Cluster API to manage baremetal KubernetesUsing Libvirt with Cluster API to manage baremetal Kubernetes
Using Libvirt with Cluster API to manage baremetal KubernetesHimani Agrawal
 
"Enabling Googley microservices with gRPC" at JEEConf 2017
"Enabling Googley microservices with gRPC" at JEEConf 2017"Enabling Googley microservices with gRPC" at JEEConf 2017
"Enabling Googley microservices with gRPC" at JEEConf 2017Alex Borysov
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
 

What's hot (17)

SPARQLstream and Morph-streams
SPARQLstream and Morph-streamsSPARQLstream and Morph-streams
SPARQLstream and Morph-streams
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Graph Everything
Graph EverythingGraph Everything
Graph Everything
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark Streaming
 
Prezo at-mesos con2015-final
Prezo at-mesos con2015-finalPrezo at-mesos con2015-final
Prezo at-mesos con2015-final
 
Fun421 stephens
Fun421 stephensFun421 stephens
Fun421 stephens
 
Transaction Support in Pulsar 2.5.0
Transaction Support in Pulsar 2.5.0Transaction Support in Pulsar 2.5.0
Transaction Support in Pulsar 2.5.0
 
"Enabling Googley microservices with gRPC." at Devoxx France 2017
"Enabling Googley microservices with gRPC." at Devoxx France 2017"Enabling Googley microservices with gRPC." at Devoxx France 2017
"Enabling Googley microservices with gRPC." at Devoxx France 2017
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSSCloud Solution Day 2016: Microservices on Mesos & Netflix OSS
Cloud Solution Day 2016: Microservices on Mesos & Netflix OSS
 
Heat optimization
Heat optimizationHeat optimization
Heat optimization
 
Using Libvirt with Cluster API to manage baremetal Kubernetes
Using Libvirt with Cluster API to manage baremetal KubernetesUsing Libvirt with Cluster API to manage baremetal Kubernetes
Using Libvirt with Cluster API to manage baremetal Kubernetes
 
"Enabling Googley microservices with gRPC" at JEEConf 2017
"Enabling Googley microservices with gRPC" at JEEConf 2017"Enabling Googley microservices with gRPC" at JEEConf 2017
"Enabling Googley microservices with gRPC" at JEEConf 2017
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 

Similar to Kafka timestamp offset_final

Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offsetDaeMyung Kang
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
 
London hug-samza
London hug-samzaLondon hug-samza
London hug-samzahuguk
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration StoryJoan Viladrosa Riera
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Bartosz Konieczny
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...Matt Stubbs
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Timothy Spann
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafkaconfluent
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingReal-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingAbdelhamide EL ARIB
 

Similar to Kafka timestamp offset_final (20)

Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
 
London hug-samza
London hug-samzaLondon hug-samza
London hug-samza
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
 
Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡Apache Spark Structured Streaming + Apache Kafka = ♡
Apache Spark Structured Streaming + Apache Kafka = ♡
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingReal-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
 

More from DaeMyung Kang

How to use redis well
How to use redis wellHow to use redis well
How to use redis wellDaeMyung Kang
 
The easiest consistent hashing
The easiest consistent hashingThe easiest consistent hashing
The easiest consistent hashingDaeMyung Kang
 
How to name a cache key
How to name a cache keyHow to name a cache key
How to name a cache keyDaeMyung Kang
 
Integration between Filebeat and logstash
Integration between Filebeat and logstash Integration between Filebeat and logstash
Integration between Filebeat and logstash DaeMyung Kang
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advanceDaeMyung Kang
 
Massive service basic
Massive service basicMassive service basic
Massive service basicDaeMyung Kang
 
Data Engineering 101
Data Engineering 101Data Engineering 101
Data Engineering 101DaeMyung Kang
 
How To Become Better Engineer
How To Become Better EngineerHow To Become Better Engineer
How To Become Better EngineerDaeMyung Kang
 
Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lakeDaeMyung Kang
 
webservice scaling for newbie
webservice scaling for newbiewebservice scaling for newbie
webservice scaling for newbieDaeMyung Kang
 
Internet Scale Service Arichitecture
Internet Scale Service ArichitectureInternet Scale Service Arichitecture
Internet Scale Service ArichitectureDaeMyung Kang
 

More from DaeMyung Kang (20)

Count min sketch
Count min sketchCount min sketch
Count min sketch
 
Redis
RedisRedis
Redis
 
Ansible
AnsibleAnsible
Ansible
 
Why GUID is needed
Why GUID is neededWhy GUID is needed
Why GUID is needed
 
How to use redis well
How to use redis wellHow to use redis well
How to use redis well
 
The easiest consistent hashing
The easiest consistent hashingThe easiest consistent hashing
The easiest consistent hashing
 
How to name a cache key
How to name a cache keyHow to name a cache key
How to name a cache key
 
Integration between Filebeat and logstash
Integration between Filebeat and logstash Integration between Filebeat and logstash
Integration between Filebeat and logstash
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advance
 
Massive service basic
Massive service basicMassive service basic
Massive service basic
 
Data Engineering 101
Data Engineering 101Data Engineering 101
Data Engineering 101
 
How To Become Better Engineer
How To Become Better EngineerHow To Become Better Engineer
How To Become Better Engineer
 
Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lake
 
Redis acl
Redis aclRedis acl
Redis acl
 
Coffee store
Coffee storeCoffee store
Coffee store
 
Scalable webservice
Scalable webserviceScalable webservice
Scalable webservice
 
Number system
Number systemNumber system
Number system
 
webservice scaling for newbie
webservice scaling for newbiewebservice scaling for newbie
webservice scaling for newbie
 
Internet Scale Service Arichitecture
Internet Scale Service ArichitectureInternet Scale Service Arichitecture
Internet Scale Service Arichitecture
 
Bloomfilter
BloomfilterBloomfilter
Bloomfilter
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Kafka timestamp offset_final

  • 1. Spark, and Kafka timestamp offset charsyam@naver.com
  • 2. Common Way: Kafka and Spark Streaming Server #1 Server #2 Server ... Server ... Log Aggregation Server Syslogd Kafka Logstash Spark Streaming
  • 4. Kafka Log Segments Segment is a file If you set prealloc flag, segment will be 1GB.
  • 5. What is timestamp Index? Fetching Kafka Logs by Timestamp.(From Kafka-0.10.0.2) You can query from specific timestamp range of message.(very useful) ex) From 2018-09-08 00:00:00 To 2018-09-08-23:59:59
  • 6. Related KIPs KIP-32 - Add timestamps to Kafka message : 0.10.0.0 KIP-33 - Add a time based log index : 0.10.1.0
  • 7. Kafka Log Files 00000000000155593652.log Kafka Message Data File 00000000000155593652.index OffsetIndex File 00000000000155593652.timeindex TimeIndex File There are many files for 1 segment. .txnindex, .snapshot, .deleted, .cleaned, .swap, etc
  • 9. When Kafka append Logs. @nonthreadsafe def append(largestOffset: Long, largestTimestamp: Long, shallowOffsetOfMaxTimestamp: Long, records: MemoryRecords): Unit = { …… val appendedBytes = log.append(records) …… offsetIndex.append(largestOffset, physicalPosition) timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestamp) …… }
  • 10. When Kafka writes Logs #1 ● Using MMAP(so OS Page Cache is very important for performance) ● Offset is stored as relative offset ○ Current Offset - Base Offset ● Offset will be returned Absolute Offset
  • 11. When Kafka writes Logs #2 OffsetIndex Append Int(4 bytes) Offset Int(4 bytes) Position
  • 12. When Kafka writes Logs #3 TimeIndex Append Long(8 bytes) Timestamp Int(4 bytes) Position
  • 13. How to fetch by timestamp #1 def fetchOffsetsByTimestamp(targetTimestamp: Long): Option[TimestampOffset] = { …… val targetSeg = { val earlierSegs = segmentsCopy.takeWhile(_.largestTimestamp < targetTimestamp) if (earlierSegs.length < segmentsCopy.length) Some(segmentsCopy(earlierSegs.length)) else None } targetSeg.flatMap(_.findOffsetByTimestamp(targetTimestamp, logStartOffset)) } }
  • 14. How to fetch by timestamp #2 def findOffsetByTimestamp(timestamp: Long, startingOffset: Long = baseOffset): Option[TimestampOffset] = { // Get the index entry with a timestamp less than or equal to the target timestamp val timestampOffset = timeIndex.lookup(timestamp) val position = offsetIndex.lookup(math.max(timestampOffset.offset, startingOffset)).position // Search the timestamp Option(log.searchForTimestamp(timestamp, position, startingOffset)).map { timestampAndOffset => TimestampOffset(timestampAndOffset.timestamp, timestampAndOffset.offset) } } Using BinarySearch For Search
  • 15. Simpe Question!!! ● Why offset index is needed? ● How to use binary search in Index File?
  • 16. How to query by timestamp in spark Convert timestamp to OffsetRange Just Create KafkaRDD with OffsetRange
  • 17. Create KafkaRDD with OffsetRange KafkaUtils.createRDD[K, V](spark.sparkContext, kafkaParamsMap, offsetRanges, PreferConsistent)
  • 18. Convert timestamp to OffsetRange val consumer = createKafkaConsumer(props) val startOffset = consumer.offsetsForTimes(topicMap) val endOffset = consumer.offsetsForTimes(topicMap)
  • 20. Log Append Flows Broker API handleProduceRequest KafkaApis.scala appendRecords ReplicaManager.scala appendToLocalLog ReplicaManager.scala appendRecordsToLeade r Partition.scala appendAsLeader Log.scala append Log.scala append LogSegment.scala
  • 21. When Segmnet is rolled? def shouldRoll(messagesSize: Int, maxTimestampInMessages: Long, maxOffsetInMessages: Long, now: Long): Boolean = { val reachedRollMs = timeWaitedForRoll(now, maxTimestampInMessages) > maxSegmentMs - rollJitterMs size > maxSegmentBytes - messagesSize || (size > 0 && reachedRollMs) || offsetIndex.isFull || timeIndex.isFull || !canConvertToRelativeOffset(maxOffsetInMessages) }
  • 22. When Segmnet is rolled? def shouldRoll(messagesSize: Int, maxTimestampInMessages: Long, maxOffsetInMessages: Long, now: Long): Boolean = { val reachedRollMs = timeWaitedForRoll(now, maxTimestampInMessages) > maxSegmentMs - rollJitterMs size > maxSegmentBytes - messagesSize || (size > 0 && reachedRollMs) || offsetIndex.isFull || timeIndex.isFull || !canConvertToRelativeOffset(maxOffsetInMessages) } 1] size > maxSegmentBytes - messageSize 2] size > 0 && reachedRollMs 3] offsetIndex.isFull 4] timeIndex.isFull 5] canCovertToRelativeOffset is false
  • 23. One Cent for Using Kafka Timestamp offset ● As a default, timestamp is set as sending time by client. ● So it is not a time that log is created. ○ You should specify to use timestamp as created time of log.
  • 25. Quiz ● If timestamp is older than last timeIndex ○ How Kafka handles this?
  • 26. Original 00000000317.log TimeIndex Log1 Offset: 317 Timestamp: 10000 …... Log100 Offset: 416 Timestamp: 20000 10000, 317 20000, 416
  • 27. Append Log with old timestamp 00000000317.log TimeIndex Log1 Offset: 317 Timestamp: 10000 …... Log100 Offset: 416 Timestamp: 20000 10000, 317 20000, 416 Log101 Offset: 417 Timestamp: 20 Log200 Offset: 516 Timestamp: 30000 …... 30000, 516
  • 28. Please remember!!! ● If you use too early timestamp. ○ It can remove your all kafka-logs. ● If you use too later timestamp. ○ It is never removed.