11. FROM confluentinc/cp-kafka-connect:7.1.1
RUN confluent-hub install --no-prompt neo4j/kafka-connect-neo4j:2.0.2
RUN confluent-hub install --no-prompt confluentinc/kafka-connect-salesforce:2.0.4
…
docker build . -t alexwoolford/kafka-connect:1.0.1
docker push alexwoolford/kafka-connect:1.0.1
https://woolford.io/2021-05-17-confluent-cloud-to-neo4j-aura/
12.
13.
14. MATCH (contact:Contact {email: event.email})
MERGE (unsubscribeEvent:EmailUnsubscribe:Event {timestamp: apoc.date.fromISO8601(event.unsubscribe_timestamp)})
MERGE (contact)-[:LAST]->(unsubscribeEvent)
WITH contact, unsubscribeEvent
WITH contact, unsubscribeEvent, CASE WHEN NOT ((contact)-[:FIRST]->()) THEN [1] ELSE [] END AS firstExists
FOREACH (i IN firstExists | MERGE (contact)-[:FIRST]->(unsubscribeEvent))
WITH contact, unsubscribeEvent
MATCH (unsubscribeEvent)<-[:LAST]-(contact)-[oldRel:LAST]->(oldLast)
DELETE oldRel
MERGE (oldLast)-[:NEXT]->(unsubscribeEvent)
#TODO: Kafka/Neo graphic
When to use Kafka and Neo4j together: transactional
#TODO: name, email
Timeliness examples:
Supply-chain: if two companies use the same component and there’s a shortage, the company that discovers this first will have a huge advantage: they’ll be able to buy all the available inventory. This will enable them to ship finished goods AND prevent their competitor from doing so.
Clickstream: every click shows intent.
What is long-polling?
Plugin deprecation: RIP, CDC (for now)
Streams plugin:
proper eventing
CDC (Debezium-style before/after, schema info)
no need to deploy/monitor/manage external Connect cluster
includes procedures, e.g. publish/consume directly to Kafka from function call (streams.publish and streams.consume)
being deprecated
edge-case where data loss is possible (asynchronous producer)
not available on Aura
No schema registry support
Connect plugin:
no data loss edge-case
Connect source/sink from over 100 different technologies
state is stored in Kafka, so it
works on Aura
Long polling depends on timestamp or incrementing integer
no CDC (today)
Streams: everything inside Neo4j JVM; Neo4j can take a while to start
Connect: outer turquoise box == connect JVM; red box == connect tasks
See /Users/alexwoolford/PycharmProjects/scratch/kafka_outtage.py for a practical example of the data-loss edge case.
See Neo4j-Streams deprecation notice: https://neo4j.com/labs/kafka/4.1/consumer/
#TODO: add source/sink labels to Neo4j-Streams
Drama
More detail (e.g. licensing, etc…) available at: https://docs.google.com/spreadsheets/d/1h2DBG5kqzeihDXnPZVdP93QbtYt8yau8Ri86C4_6B9I/edit?usp=sharing
Each Connect instance runs inside an OS (typically a stripped-down version of Linux inside a Docker container).
The instance has plugins installed inside it. There are more than 200 possible plugin types to choose from.
In addition to plugins there are also single message transforms. These are used to do [typically simple] stateless manipulations to the payload before it’s written to Kafka or sink’d to some other system.
SMT’s are optional. In the example in the slide, there is no SMT used in the top job (x).
A connector job consists of one or more tasks. These are often spread over multiple instances for parallelization and fault tolerance.
# TODO: add antennas
Are connectors running in distributed or standalone mode? They should be running in distributed mode.
Are the correct number of tasks configured for the required throughput? Don't exceed 20 tasks per worker in production.
Are Connect workers configured correctly? See https://docs.confluent.io/home/connect/self-managed/userguide.html#configuring-workers
Have you read the Monitoring Connect Operations Guide? See https://docs.confluent.io/platform/current/connect/monitoring.html
Have you configured a dead letter queue to handle bad records?
Are task statuses via the REST API monitored to ensure tasks haven’t failed?
In a CDC source use case, is a "native" CDC connector used instead of the JDBC source connector? The JDBC source connector puts added load on the source system. Unfortunately, this isn’t an option for the Connect Neo4j source.
Don’t use Zookeeper.
MERGE works best when there’s an index.
Show how to access Connect logs
Talk about Neo4j locking w/ connect
# TODO: locking bullet
The same data might be stored multiple ways to provide different access patterns.
Show my clickstream and the two connectors.
Show how easy it is to plugin enrichment logic to events in Kafka, and then use those to enrich the graph.
Snatch 18:20
Show Streams visualization by pasting RTD topology into visualizer
[main] INFO io.woolford.rtd.stream.RtdStreamer - Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [rtd-bus-position])
--> KSTREAM-TRANSFORM-0000000001
Processor: KSTREAM-TRANSFORM-0000000001 (stores: [busPositionStore])
--> KSTREAM-FILTER-0000000002
<-- KSTREAM-SOURCE-0000000000
Processor: KSTREAM-FILTER-0000000002 (stores: [])
--> KSTREAM-MAPVALUES-0000000003
<-- KSTREAM-TRANSFORM-0000000001
Processor: KSTREAM-MAPVALUES-0000000003 (stores: [])
--> KSTREAM-SINK-0000000004
<-- KSTREAM-FILTER-0000000002
Sink: KSTREAM-SINK-0000000004 (topic: rtd-bus-position-enriched)
<-- KSTREAM-MAPVALUES-0000000003
https://zz85.github.io/kafka-streams-viz/
Note that ‘MapValues’ doesn’t require re-keying where ‘Map’ does. Implication: use MapValues if you can.
Docs: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#stateless-transformations
Stateful operations: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#stateful-transformations
Aggregating: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-aggregating
Joining: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-joins
Windowing: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-windowing
Custom: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-process
TODO: split into two slides so it separates Kafka Streams into its own slides and shows which to use where
Flavors of windowing: hopping, tumbling, session, sliding
See https://developer.confluent.io/learn-kafka/kafka-streams/windowing/
For sessionization, consider adding to https://github.com/alexwoolford/snowplow-kafka-streams
https://docs.confluent.io/platform/current/connect/transforms/overview.html
https://jsonpath.com/ <- handy to test JSON path.
https://github.com/alexwoolford/kafka-connect-transform-jolt
#TODO: add diagram showing where SMT gets executed. Also, show multiple transforms; mention ability to write your own.
The Connect API has a plugins folder. Connector plugin jars are put in the plugins folder.
Go to Dockerhub and get the latest version of confluentinc/cp-kafka-connect
http deepthought.woolford.io:8083/connector-plugins
#TODO: “create a connect worker image…”
Caveats: if your events are coming from different topics, the connectors had better not fall behind.
If strict ordering is an absolute must-have, then we’d need to have all the events for any given customer in a single partition, and use APOC’s
https://dmccreary.medium.com/how-to-explain-index-free-adjacency-to-your-manager-1a8e68ec664a
Discussion of BTree when querying an index.
https://en.wikipedia.org/wiki/Strangler_fig
Martin Fowler: use strangler to avoid the risk of a massive re-write
Consider showing Snowplow recommender API
https://zz85.github.io/kafka-streams-viz/
This is particularly useful if you find yourself working on a Kafka Streams job that was written by someone else.
select ID_RESP_H, getGeoForIp(ID_RESP_H) from CONN emit changes;
Allows downstream consumers to restore state after a crash or system failure.
Great blog article that explains the detail: https://towardsdatascience.com/log-compacted-topics-in-apache-kafka-b1aa1e4665a7
Show “cleanup policy” in C3.
https://docs.confluent.io/current/schema-registry/avro.html#summary
Quick demo: https://github.com/alexwoolford/multiple-event-types-demo
#TODO: change to non-binary: MFX
SMT (single message transform):
simple stateless transformations
Streams DSL:
9/10 use cases
aggregations, joins, windowing, custom processors
Streams PAPI:
more flexible; harder to use
Possible to combine DSL and PAPI in the same streaming job.
https://docs.confluent.io/platform/current/streams/developer-guide/dsl-api.html
https://docs.confluent.io/platform/current/streams/developer-guide/processor-api.html
^^ show layers (SMT, DSL, PAPI) and where those components exist
https://www.kai-waehner.de/blog/2021/04/20/comparison-open-source-apache-kafka-vs-confluent-cloudera-red-hat-amazon-msk-cloud/
The Kafka API has become a standard, and can be consumed in many guises (not just Apache Kafka or Confluent).
#TODO: add a graph version of this, and show how Cypher is becoming a standard (e.g. Neptune, Memgraph)