Before we begin, let us see what everyone’s experience with Kafka is.
Can you please raise your hand if you run Kafka right now at any scale.
Now keep your hand raised if you process 1 million messages a day.
1 billion.
1 trillion.
Okay cool, we’ve got a pretty good mix here. Last question before we move on - How many of you have a ‘Kakfa’ typo in your automation source code for Kafka?
Hello everybody. Welcome to our presentation.
My name is Maulin and I am here with my colleagues Thomas and Sanat and we are from Kafka team at PayPal.
I’m here to talk about challenges we face at PayPal managing multiple geo-distributed Kafka clusters and the solutions we apply.
This is the agenda for this session.
We’ll start with some details about Kafka @ PayPal in its present state.
Then we will talk about how we enabled TLS and ACLs at PayPal and share some performance numbers related to the same.
Finally we will conclude with highlighting some of the cool and challenging things we are working on.
Our Kafka ecosystem processes 400 billion messages a day. We have over 50 clusters, which includes over 5000 topics, and 7 petabytes of total disk space.
We’ve been running Kafka for a while now, starting with version 0.8 and our current version is 1.1.
Our tech stack has grown enormously over the past few years, with clients using languages ranging from Java to Python to Node.js.
We also have many different application frameworks connected to Kafka.
Our clusters are multi-tenant, which means our clusters are generic and there are multiple use-cases in a single cluster.
Our Kafka ecosystem is also distributed across multiple security and availability zones.
Let’s take a look at data pipelines.
At PayPal, some use-cases for Kafka are user behavioral tracking, experimental testing such as A/B testing, merchant SLA monitoring, and risk & compliance analytics.
All these use-cases generate data in the form of business events, or application logs, or application metrics, or any combination of the three.
This data flows through Kafka using batch processing or real-time streaming, and they end up in frameworks & platforms land where they are used for analytics or other processing.
Additionally, it is very common for flows to have multiple hops where data is pumped into Kafka, consumed by a framework, and then additional data is pumped back into Kafka and consumed by yet another framework.
Thank you Maulin. Hi everyone, I am Thomas.
As Maulin mentioned before, Kafka team is maintaining a large Kafka ecosystem at PayPal, over 500 billion messages are processed by Kafka everyday.
As a Fintech company, security has always been our highest priority.
Then how to secure kafka at PayPal became the beggest thing at Kafka Team this year.
Now, I am going to talk about how we enable mutual TLS at PayPal.
Before moving to Kafka TLS, let's quickly go through some terminologies.
SSL and its successor, TLS, are protocols for establishing authenticated and encrypted connections between networked computers.
Although the SSL protocol was deprecated with the release of TLS 1.0 in 1999, It is still common to refer to these related technologies as “SSL” or “SSL/TLS.”
In terms of SSL Keys, basically, we mean private key and public keys
Public key is used to encrypt while private keys are used to decrypt.
Public keys can be made available to anyone, hence the term public. On the other hand, private key could not be shared.
SSL certificates provide a verified link between public keys and the entity that it claims that it belongs to.
Certificate Authority is the third party who signs certificate.
A trusted CA means this is a known third party certificate issuer.
Let’s take a look at how to enable TLS for open source Kafka.
There are 4 main procedures to go through to get this done.
Reflecting to SSL terminologies, the procedures are to get the key and certificates, create CA and sign the certificates.
Then configured related Kafka properties.
The first 3 procedures could be done through command lines
Let’s just take a look at how many commands you need to run to get this done.
As you can see, there are 8 commands you need to run using keytool to get the first 3 procedures done for 1 host
After running all these commands, you will get 2 things
A keystore contains private keys, and the certificates with their corresponding public keys.
A truststore contains certificates from other parties that you expect to communicate with, or from Certificate Authorities that you trust.
Now we got the key and certs as JKS type keystore and truststore file/
Next step is to configure kafka broker and client properties.
There are 7 properties you need to configure on broker side to tell broker where to load keystore and truststore file and the related credentials.
The client side configurations are pretty similar.
However, keystore is not necessary at client side with 1 way TLS.
Now after this step, we got every thing ready for TLS connection.
Let's take a look at how does one way TLS work.
As you can see in the diagram, there is a file-based keystore on kafka broker and truststore on client application.
While kafka client try to connect to kafka broker, public certificates will be exported from keystore in kafka broker and imported into truststore at client application.
Truststore will try to authenticate the certificate and if it is succeeded, the TLS connection will be made.
This is the most common scenario for TLS connection. However, in this scenario Client Application trust the broker that it connects to, kafka broker may not know this application.
That's why we need mutual auth.
This diagram shows the work flow for mutual TLS.
As you can see, compare with the previous slide, now both kafka broker and kafka client app have the file based keystore and truststore.
Then same as one way auth, server authentication will also happen at this time. Furthermore, on top of server auth, certificate will be exported from keystore at client app and import to truststore in kafka broker to do client authentication.
After 2 authentications are done, the TLS connection will be made. Then it is guaranteed that client and broker knows each other.
Is kafka secured now? Yes.. But is it what we want at PayPal?
Maybe not.. with the things that current Open Source Kafka provide, it is not easy for us to achieve Kafka TLS.
Let's take a look at our challenges for enabling TLS for kafka.
Due to InfoSec and AppSec Restriction, file-based security material is not allowed in PayPal.
How to deploy these security materials to thousands of broker hosts and hundreds of client hosts is also a big challenge for us.
Also, key rotations and credential security are some extra work to think.
Before moving to the solution, let's us move one step back to see what we have at PayPal.
Let me introduce you with PayPal's Key Management Service. As you can easily imagine from the name, it is a In-house Key Management Service like HashiCorp Vault and AWS KMS.
Key Management Service is a CA to issue certificates for all internal applications and it will manage key rotations.
Also I want to show to how does clients connect to Kafka brokers at PayPal.
Let me introduce you with another service named Kafka configuration service. Basically, people send request to kafka configuration service with the topic name and it will return all the required properties to reach to that topic.
The reason for developing this service is that we want to abstract kafka cluster away from kafka client. So instead of connecting kafka using hard-coded bootstrap server list, kafka client will get all the configuration from config service and use those configs to connect to kafka.
With config service, kafka clients don't need to worry about the boostrap server list any more, what they need is only the topic name. Kafka team can easily maintain kafka cluster by adding and removing nodes without worrying customer impact.
Based on the challenges and the 2 services that we have, our approach is very clear. We need a way to fetch keystore and trustore from Key Management Service and load on client and broker side.
We changed kafka source code on client side and broker side and introduce 2 interfaces for customized keystore and truststore loading.
With the implementation class for the interface, people could loader keystore and truststore from wherever they want no matter in disk or memory.
In PayPal, kafka team will provide these implementation class for clients.
Now let's take a look at the work flow with TLS at PayPal.
Kafka client will also request config from configuration service and config service will return the config with the keystore loader and truststore loader class.
Loader class will fetch keystore and truststore from Key Management service and load the keystore and truststore to connect to kafka broker.
Client will not even notice the change behind because there is nothing needs to be changed from them.
Let me show you how simple it is to use this interface.
You only need to have 1 configuration and you can have the connection to kafka secured with SSL!
You don't need to worry about the location and credentials, all the things are inside the loader class.
Kafka Client (producer/consumer) initializes the Authentication,
Kafka Server autheni
In Kafka Authentication happens when establishing connection to the broker, where as Authorization verification happens on each request.
Configured through Client jaas config, update credentials to Subject
AuthenticateCallbackHandler.handle(Callback) <- sasl.login.callback.handler.class(org.apache.kafka.common.security.auth.AuthenticateCallbackHandler), load token/credentails and return it through Callback
Configured through Client jaas config, update credentials to Subject
AuthenticateCallbackHandler.handle(Callback) <- sasl.login.callback.handler.class(org.apache.kafka.common.security.auth.AuthenticateCallbackHandler), load token/credentails and return it through Callback
Configured through Client jaas config, update credentials to Subject
AuthenticateCallbackHandler.handle(Callback) <- sasl.login.callback.handler.class(org.apache.kafka.common.security.auth.AuthenticateCallbackHandler), load token/credentails and return it through Callback
Configured through Client jaas config, update credentials to Subject
AuthenticateCallbackHandler.handle(Callback) <- sasl.login.callback.handler.class(org.apache.kafka.common.security.auth.AuthenticateCallbackHandler), load token/credentails and return it through Callback
Configured through Client jaas config, update credentials to Subject
AuthenticateCallbackHandler.handle(Callback) <- sasl.login.callback.handler.class(org.apache.kafka.common.security.auth.AuthenticateCallbackHandler), load token/credentails and return it through Callback
This is the work flow if we want to integrate with in-house key management system.
You can see with the current infrastructure, clients need to call key management system and convert is to file-based security material then connect to kafka cluster
Which we think is unnecessary
KMS call from loader implementation that we provide
Alright! We are almost nearing the end of the presentation and I would like to highlight some of the cool and challenging things we are working on in the Kafka team.
With that I would like to thank you all for listening! Now, we can take questions if you have. Thank you!