Slides for the talk given on 20-07-2019 at Nairobi JVM. It was a talk about building data pipelines with Apache Kafka as a message broker or enterprise bus and Apache spark as a distributed computing engine that enables processing of large volume of data efficiently.
9. 7 Vs of Big data
● Volume
● Velocity
● Veracity
● Variety
● Variability
● Visualization
● Value
10. What could go wrong
● Data loss
● Data duplication
● Data corruption
● Data Formatting
● Latency
● Process Failure
● Flow control
● Data velocity
● Data volume
13. Apache kafka in ARCHITECTURE
SOURCE INGESTION PROCESSING STORAGE VISUALIZATION
Apache Kafka is a distributed streaming platform capable of handling trillions of
events a day. Initially conceived as messaging system.
14. Apache kafka
Kafka guarantees the following 3 capabilities of streaming systems:
● Publish and subscribe to streams of records, similar to a message queue
or EMS.
● Store streams of records in a fault tolerant durable way.
● Process streams of records as they occur.
Kafka is generally used for 2 broad classes of applications:
● Building real-time streaming data pipelines that reliably get data
between systems or applications
● Building real-time streaming applications that transform or react to
the streams of data
19. KAFKA APIS
● Producer API: allows applications to publish a stream of records to one or more
kafka topics.
● Consumer API: allows applications to subscribe to one or more topics and
process the stream of records provided to them.
● Streams API: allows an application to act as a stream processor, consuming an
input stream from one or more topics and producing an output stream to one or
more output topics, effectively transforming the input streams to output
streams
● Connector API: allows building and running reusable producers or consumers that
connect Kafka topics to existing applications or data systems.
● AdminClient API: supports managing and inspecting topics, brokers, acls, and
other kafka objects.
20. KAFKA CLI TOOLS
● Topic creation: bin/kafka-topics.sh --create --zookeeper
localhost:2181 --replication-factor 1 --partitions 5 --topic test
● Publish to topic: bin/kafka-console-producer.sh --broker-list
localhost:9092 --topic test
● Consume from topic: bin/kafka-console-consumer.sh --bootstrap-server
localhost:9092 --topic test --from-beginning
22. Apache SPARK in ARCHITECTURE
SOURCE INGESTION PROCESSING STORAGE VISUALIZATION
Apache Spark is a unified analytics engine for large-scale data processing. Its
design is focused on speed, ease of use, generality, and multiple environments
deployment. It provides support for the following languages: Scala, Java, Python, R
and SQL.
27. Apache cassandra in ARCHITECTURE
SOURCE INGESTION PROCESSING STORAGE VISUALIZATION
Apache Cassandra is a distributed nosql database for managing large amounts of
structured data across many commodity servers, while providing highly available
service and no single point of failure.