SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Jorge López-Malla Matute
INDEX
jlopezm@stratio.com
Abel Rincón Matarranz
arincon@stratio.com
Kerberos
● Introduction
● Key concepts
● Workflow
● Impersonation
1
3
Use Case
● Definition
● Workflow
● Crossdata in production
2
4Stratio Solution
● Prerequirement
● Driver side
● Executor Side
● Final result
Demo time
● Demo
● Q&A
Presentation
Presentation
JORGE LÓPEZ-MALLA
After working with traditional
processing methods, I started to
do some R&S Big Data projects
and I fell in love with the Big Data
world. Currently i’m doing some
awesome Big Data projects
and tools at Stratio.
SKILLS
Presentation
Presentation
ABEL RINCÓN MATARRANZ
SKILLS
Our company
Presentation
Our product
Presentation
Kerberos
Kerberos
• What is Kerberos
○ Authentication protocol / standard / service
■ Safe
■ Single-sign-on
■ Trust based
■ Mutual authentication
Kerberos key concepts
Kerberos
• Client/Server → Do you need an explanation???
• Principal → Identify a unique client or service
• Realm → Identify a environment, company, domain …
○ DEMO.EAST.SUMMIT.SPARK.ORG
• KDC → Actor who manages the tickets
• TGT → Ticket which has the client session
• TGS → Ticket which has the client-service session
Kerberos Workflow
Kerberos
1. Client retrieve principal and secret
2. Client performs a TGT request
3. KDC returns TGT
4. Client request a TGS with the TGT
5. KDC returns the TGS
6. El cliente request a service session using
the TGS
7. Service establish a secure connection
directly with the client
Kerberos workflow 2
Kerberos
Client
Service
Backend
AS / KDC
Principal → User1
TGT (user1)
TGS → user1-service1
(tgsUS)
Principal → Service1
TGT (Service1)
TGS → service1-backend1 (tgsSB)
Principal → backend1
TGT (backend1)
tgsUS
tgsSB
user1
service1
Kerberos workflow - Impersonation
Kerberos
Client
Service
Backend
AS / KDC
Principal → User1
TGT (user1)
TGS → user1-service1
(tgsUS)
Principal → Service1
TGT (Service1)
TGS → service1-backend1 (tgsSB)
Principal → backend1
TGT (backend1)
tgsUS
tgsSB
user1
service1user1
Use Case
• Stratio Crossdata is a distributed framework and a fast and general-purpose
computing system powered by Apache Spark
• Can be used both as library and as a server.
• Crossdata Server: Provides a multi-user environment to SparkSQL, giving a
reliable architecture with high-availability and scalability out of the box
• To do so it use both native queries and Spark
• Crossdata Server had a unique long time SparkContext to execute all its Sparks
queries
• Crossdata can use YARN, Mesos and Standalone as a resource manager
Use Case
Crossdata as Server
sql> select * from table1
Crossdata shell
Master
Worker-1 Worker-2
Executor-0 Executor-1
Task-0 Task-1
HDFS
Crossdata server
(Spark Driver)
Crossdata server
sql> select * from table1
-------------------------
|id | name |
-------------------------
|1 | John Doe |
Use Case
Kerbe
ros
• Projects in production needs runtime impersonation to be compliance the
AAA(Authorization, Authentication and Audit) at the storage.
• Crossdata allows several users per execution
• Neither of the Sparks resource managers allows us to impersonate in runtime.
• Evenmore Standalone as resource manager does not provide any Kerberos
feature.
Crossdata in production
Use Case
Prerequirement
Stratio solution
• Keytab have to be accessible in all the cluster machines
• Keytab must provide proxy grants
• Hadoop client configuration located in the cluster
• Each user, both proxy and real, must have a home in HDFS
Introduction
Stratio solution
• Spark access to the storage system both in the Driver and the Executors.
• In the Driver side both Spark Core and SparkSQL will access to the storage
system.
• Executors will always access via Task.
• As Streaming use the same classes than SparkCore or SparkSQL the same
solution will be usable by Streaming jobs
KerberosUser (Utils)
object KerberosUser extends Logging with UserCache {
def setProxyUser(user: String): Unit = proxyUser = Option(user)
def getUserByName(name: Option[String]): Option[UserGroupInformation] = {
if (getConfiguration.isDefined) {
userFromKeyTab(name)
}
else None
}
private def userFromKeyTab(proxyUser: Option[String]): Option[UserGroupInformation] = {
if (realUser.isDefined) realUser.get.checkTGTAndReloginFromKeytab()
(realUser, proxyUser) match {
case (Some(_), Some(proxy)) => users.get(proxy).orElse(loginProxyUser(proxy))
case (Some(_), None) => realUser
case (None, None) => None
}
}
private lazy val getConfiguration: Option[(String, String)] = {
val principal = env.conf.getOption("spark.executor.kerberos.principal")
val keytab = env.conf.getOption("spark.executor.kerberos.keytab")
(principal, keytab) match {
case (Some(p), Some(k)) => Option(p, k)
case _ => None
}
}
Configuration
setting proxy user
(Global)
Choose between real or
proxy user
Stratio solution
Public Method retrieve
user
Wrappers (Utils)
def executeSecure[U, T](proxyUser: Option[String],
funct: (U => T),
inputParameters: U): T = {
KerberosUser.getUserByName(proxyUser) match {
case Some(user) => {
user.doAs(new PrivilegedExceptionAction[T]() {
@throws(classOf[Exception])
def run: T = {
funct(inputParameters)
}
})
}
case None => {
funct(inputParameters)
}
}
}
def executeSecure[T](exe: ExecutionWrp[T]): T = {
KerberosUser.getUser match {
case Some(user) => {
user.doAs(new PrivilegedExceptionAction[T]() {
@throws(classOf[Exception])
def run: T = {
exe.value
}
})
}
case None => exe.value
}
}
class ExecutionWrp[T](wrp: => T) {
lazy val value: T = wrp
}
Stratio solution
Driver Side
Stratio Solution
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
...
/**
* Get the array of partitions of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def partitions: Array[Partition] = {
checkpointRDD.map(_.partitions).getOrElse {
if (partitions_ == null) {
partitions_ = KerberosFunction.executeSecure(new ExecutionWrp(getPartitions))
partitions_.zipWithIndex.foreach { case (partition, index) =>
require(partition.index == index,
s"partitions($index).partition == ${partition.index}, but it should equal $index")
}
}
partitions_
}
}
Wrapping parameterless
method
class PairRDDFunctions[K, V](self: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
...
def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope {
// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038).
val internalSave: (JobConf => Unit) = (conf: JobConf) => {
val hadoopConf = conf
val outputFormatInstance = hadoopConf.getOutputFormat
val keyClass = hadoopConf.getOutputKeyClass
val valueClass = hadoopConf.getOutputValueClass
...
val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => {
….
}
self.context.runJob(self, writeToFile)
writer.commitJob()
}
KerberosFunction.executeSecure(internalSave, conf)
}
Driver Side
Stratio Solution
Hadoop Datastore
RDD save function
Inside wrapper function
that will run in the cluster
Kerberos authentified
save function
Driver Side
class InMemoryCatalog(
conf: SparkConf = new SparkConf,
hadoopConfig: Configuration = new Configuration)
override def createDatabase(
dbDefinition: CatalogDatabase,
ignoreIfExists: Boolean): Unit = synchronized {
def inner: Unit = {
...
val location = new Path(dbDefinition.locationUri)
val fs = location.getFileSystem(hadoopConfig)
fs.mkdirs(location)
} catch {
case e: IOException =>
throw new SparkException(s"Unable to create database ${dbDefinition.name} as failed " +
s"to create its directory ${dbDefinition.locationUri}", e)
}
catalog.put(dbDefinition.name, new DatabaseDesc(dbDefinition))
}
}
KerberosFunction.executeSecure(KerberosUser.principal, new ExecutionWrp(inner))
}
Stratio Solution
Spark create a
directory in HDFS
* Interface used to load a [[Dataset]] from external storage systems (e.g. file systems,
...
class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
...
def load(): DataFrame = {
load(Seq.empty: _*) // force invocation of `load(...varargs...)`
}
...
def load(paths: String*): DataFrame = {
val proxyuser = extraOptions.get("user")
if (proxyuser.isDefined) KerberosUser.setProxyUser(proxyuser.get)
val dataSource = KerberosFunction.executeSecure(proxyuser,
DataSource.apply,
sparkSession,
source,
paths, userSpecifiedSchema, Seq.empty, None, extraOptions.toMap)
val baseRelation = KerberosFunction.executeSecure(proxyuser, dataSource.resolveRelation, false)
KerberosFunction.executeSecure(proxyuser, sparkSession.baseRelationToDataFrame, baseRelation)
}
Driver Side
Stratio Solution
get user from dataset
options
Method for load data from
sources without path
obtaining baseRelation
from datasource
* Interface used to write a [[Dataset]] from external storage systems (e.g. file systems,
...
class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
...
/**
* Saves the content of the [[DataFrame]] as the specified table.
...
def save(): Unit = {
assertNotBucketed("save")
...
val maybeUser = extraOptions.get("user")
def innerWrite(modeData: (SaveMode, DataFrame)): Unit = {
val (mode, data) = modeData
dataSource.write(mode, data)
}
if (maybeUser.isDefined) KerberosUser.setProxyUser(maybeUser.get)
KerberosFunction.executeSecure(maybeUser, innerWrite, (mode, df))
}
Driver Side
Stratio Solution
get user from dataset
options
Method for save data in
external sources
Wrapping save
execution
class DAGScheduler(...){
…
…
KerberosUser.getMaybeUser match {
case Some(user) => properties.setProperty("user", user)
case _ =>
}
...
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
...
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties)
...
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics)
}
...
Driver Side
Stratio Solution
private[spark] abstract class Task[T](
val stageId: Int,
val stageAttemptId: Int,
val partitionId: Int,
// The default value is only used in tests.
val metrics: TaskMetrics = TaskMetrics.registered,
@transient var localProperties: Properties = new Properties) extends Serializable {
...
final def run(
...
try {
val proxyUser =
Option(Executor.taskDeserializationProps.get().getProperty("user"))
KerberosFunction.executeSecure(proxyUser, runTask, context)
} catch {
…
def runTask(context: TaskContext): T
Executor Side
Stratio Solution
properties load in
Driver side
get proxy user and
wrapped execution
method implemented by
Task subclasses
Demo time
Demo time
• Merge this code in Apache Spark (SPARK-16788)
• Pluggable authorization
• Pluggable secret management (why always use Hadoop delegation tokens?)
• Distributed cache.
• ...
Next Steps
Next Steps
Q & A
Q & A
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit east

Más contenido relacionado

La actualidad más candente

Practicing Continuous Deployment
Practicing Continuous DeploymentPracticing Continuous Deployment
Practicing Continuous Deployment
zeeg
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
 

La actualidad más candente (20)

NoSQL and JavaScript: a Love Story
NoSQL and JavaScript: a Love StoryNoSQL and JavaScript: a Love Story
NoSQL and JavaScript: a Love Story
 
Drupal MySQL Cluster
Drupal MySQL ClusterDrupal MySQL Cluster
Drupal MySQL Cluster
 
Winter is coming? Not if ZooKeeper is there!
Winter is coming? Not if ZooKeeper is there!Winter is coming? Not if ZooKeeper is there!
Winter is coming? Not if ZooKeeper is there!
 
Dnsdist
DnsdistDnsdist
Dnsdist
 
Jafka guide
Jafka guideJafka guide
Jafka guide
 
Cassandra 3.0 advanced preview
Cassandra 3.0 advanced previewCassandra 3.0 advanced preview
Cassandra 3.0 advanced preview
 
Ice mini guide
Ice mini guideIce mini guide
Ice mini guide
 
Zookeeper
ZookeeperZookeeper
Zookeeper
 
Nodejs - A quick tour (v5)
Nodejs - A quick tour (v5)Nodejs - A quick tour (v5)
Nodejs - A quick tour (v5)
 
Docker and Fargate
Docker and FargateDocker and Fargate
Docker and Fargate
 
Apache zookeeper 101
Apache zookeeper 101Apache zookeeper 101
Apache zookeeper 101
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with Pacemaker
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Hippo meetup: enterprise search with Solr and elasticsearch
Hippo meetup: enterprise search with Solr and elasticsearchHippo meetup: enterprise search with Solr and elasticsearch
Hippo meetup: enterprise search with Solr and elasticsearch
 
Node.js - A practical introduction (v2)
Node.js  - A practical introduction (v2)Node.js  - A practical introduction (v2)
Node.js - A practical introduction (v2)
 
Practicing Continuous Deployment
Practicing Continuous DeploymentPracticing Continuous Deployment
Practicing Continuous Deployment
 
Nodejs - A quick tour (v4)
Nodejs - A quick tour (v4)Nodejs - A quick tour (v4)
Nodejs - A quick tour (v4)
 
Intro to GemStone/S
Intro to GemStone/SIntro to GemStone/S
Intro to GemStone/S
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 

Destacado

Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 

Destacado (6)

Meetup spark + kerberos
Meetup spark + kerberosMeetup spark + kerberos
Meetup spark + kerberos
 
Meetup errores en proyectos Big Data
Meetup errores en proyectos Big DataMeetup errores en proyectos Big Data
Meetup errores en proyectos Big Data
 
Codemotion 2016
Codemotion 2016Codemotion 2016
Codemotion 2016
 
Spark Hands-on
Spark Hands-onSpark Hands-on
Spark Hands-on
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 

Similar a Kerberizing spark. Spark Summit east

JavaScript Growing Up
JavaScript Growing UpJavaScript Growing Up
JavaScript Growing Up
David Padbury
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
Tom Croucher
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
Manish Pandit
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
lennartkats
 

Similar a Kerberizing spark. Spark Summit east (20)

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
JavaScript Growing Up
JavaScript Growing UpJavaScript Growing Up
JavaScript Growing Up
 
Fun Teaching MongoDB New Tricks
Fun Teaching MongoDB New TricksFun Teaching MongoDB New Tricks
Fun Teaching MongoDB New Tricks
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Brad Wood - CommandBox CLI
Brad Wood - CommandBox CLI Brad Wood - CommandBox CLI
Brad Wood - CommandBox CLI
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
 
Red Hat Agile integration Workshop Labs
Red Hat Agile integration Workshop LabsRed Hat Agile integration Workshop Labs
Red Hat Agile integration Workshop Labs
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptx
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.
 
Declarative Infrastructure Tools
Declarative Infrastructure Tools Declarative Infrastructure Tools
Declarative Infrastructure Tools
 
Hidden pearls for High-Performance-Persistence
Hidden pearls for High-Performance-PersistenceHidden pearls for High-Performance-Persistence
Hidden pearls for High-Performance-Persistence
 
Terraform Cosmos DB
Terraform Cosmos DBTerraform Cosmos DB
Terraform Cosmos DB
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 

Más de Jorge Lopez-Malla

Más de Jorge Lopez-Malla (8)

Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
 
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
 
Haz que tus datos sean sexys
Haz que tus datos sean sexysHaz que tus datos sean sexys
Haz que tus datos sean sexys
 
Mesos con europa 2017
Mesos con europa 2017Mesos con europa 2017
Mesos con europa 2017
 
Spark meetup barcelona
Spark meetup barcelonaSpark meetup barcelona
Spark meetup barcelona
 
Spark web meetup
Spark web meetupSpark web meetup
Spark web meetup
 
Apache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own dataApache Big Data Europa- How to make money with your own data
Apache Big Data Europa- How to make money with your own data
 
Meetup Spark y la Combinación de sus Distintos Módulos
Meetup Spark y la Combinación de sus Distintos MódulosMeetup Spark y la Combinación de sus Distintos Módulos
Meetup Spark y la Combinación de sus Distintos Módulos
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Kerberizing spark. Spark Summit east

  • 1.
  • 2. Jorge López-Malla Matute INDEX jlopezm@stratio.com Abel Rincón Matarranz arincon@stratio.com Kerberos ● Introduction ● Key concepts ● Workflow ● Impersonation 1 3 Use Case ● Definition ● Workflow ● Crossdata in production 2 4Stratio Solution ● Prerequirement ● Driver side ● Executor Side ● Final result Demo time ● Demo ● Q&A
  • 3. Presentation Presentation JORGE LÓPEZ-MALLA After working with traditional processing methods, I started to do some R&S Big Data projects and I fell in love with the Big Data world. Currently i’m doing some awesome Big Data projects and tools at Stratio. SKILLS
  • 7.
  • 8. Kerberos Kerberos • What is Kerberos ○ Authentication protocol / standard / service ■ Safe ■ Single-sign-on ■ Trust based ■ Mutual authentication
  • 9. Kerberos key concepts Kerberos • Client/Server → Do you need an explanation??? • Principal → Identify a unique client or service • Realm → Identify a environment, company, domain … ○ DEMO.EAST.SUMMIT.SPARK.ORG • KDC → Actor who manages the tickets • TGT → Ticket which has the client session • TGS → Ticket which has the client-service session
  • 10. Kerberos Workflow Kerberos 1. Client retrieve principal and secret 2. Client performs a TGT request 3. KDC returns TGT 4. Client request a TGS with the TGT 5. KDC returns the TGS 6. El cliente request a service session using the TGS 7. Service establish a secure connection directly with the client
  • 11. Kerberos workflow 2 Kerberos Client Service Backend AS / KDC Principal → User1 TGT (user1) TGS → user1-service1 (tgsUS) Principal → Service1 TGT (Service1) TGS → service1-backend1 (tgsSB) Principal → backend1 TGT (backend1) tgsUS tgsSB user1 service1
  • 12. Kerberos workflow - Impersonation Kerberos Client Service Backend AS / KDC Principal → User1 TGT (user1) TGS → user1-service1 (tgsUS) Principal → Service1 TGT (Service1) TGS → service1-backend1 (tgsSB) Principal → backend1 TGT (backend1) tgsUS tgsSB user1 service1user1
  • 13.
  • 14. Use Case • Stratio Crossdata is a distributed framework and a fast and general-purpose computing system powered by Apache Spark • Can be used both as library and as a server. • Crossdata Server: Provides a multi-user environment to SparkSQL, giving a reliable architecture with high-availability and scalability out of the box • To do so it use both native queries and Spark • Crossdata Server had a unique long time SparkContext to execute all its Sparks queries • Crossdata can use YARN, Mesos and Standalone as a resource manager Use Case
  • 15. Crossdata as Server sql> select * from table1 Crossdata shell Master Worker-1 Worker-2 Executor-0 Executor-1 Task-0 Task-1 HDFS Crossdata server (Spark Driver) Crossdata server sql> select * from table1 ------------------------- |id | name | ------------------------- |1 | John Doe | Use Case Kerbe ros
  • 16. • Projects in production needs runtime impersonation to be compliance the AAA(Authorization, Authentication and Audit) at the storage. • Crossdata allows several users per execution • Neither of the Sparks resource managers allows us to impersonate in runtime. • Evenmore Standalone as resource manager does not provide any Kerberos feature. Crossdata in production Use Case
  • 17.
  • 18. Prerequirement Stratio solution • Keytab have to be accessible in all the cluster machines • Keytab must provide proxy grants • Hadoop client configuration located in the cluster • Each user, both proxy and real, must have a home in HDFS
  • 19. Introduction Stratio solution • Spark access to the storage system both in the Driver and the Executors. • In the Driver side both Spark Core and SparkSQL will access to the storage system. • Executors will always access via Task. • As Streaming use the same classes than SparkCore or SparkSQL the same solution will be usable by Streaming jobs
  • 20. KerberosUser (Utils) object KerberosUser extends Logging with UserCache { def setProxyUser(user: String): Unit = proxyUser = Option(user) def getUserByName(name: Option[String]): Option[UserGroupInformation] = { if (getConfiguration.isDefined) { userFromKeyTab(name) } else None } private def userFromKeyTab(proxyUser: Option[String]): Option[UserGroupInformation] = { if (realUser.isDefined) realUser.get.checkTGTAndReloginFromKeytab() (realUser, proxyUser) match { case (Some(_), Some(proxy)) => users.get(proxy).orElse(loginProxyUser(proxy)) case (Some(_), None) => realUser case (None, None) => None } } private lazy val getConfiguration: Option[(String, String)] = { val principal = env.conf.getOption("spark.executor.kerberos.principal") val keytab = env.conf.getOption("spark.executor.kerberos.keytab") (principal, keytab) match { case (Some(p), Some(k)) => Option(p, k) case _ => None } } Configuration setting proxy user (Global) Choose between real or proxy user Stratio solution Public Method retrieve user
  • 21. Wrappers (Utils) def executeSecure[U, T](proxyUser: Option[String], funct: (U => T), inputParameters: U): T = { KerberosUser.getUserByName(proxyUser) match { case Some(user) => { user.doAs(new PrivilegedExceptionAction[T]() { @throws(classOf[Exception]) def run: T = { funct(inputParameters) } }) } case None => { funct(inputParameters) } } } def executeSecure[T](exe: ExecutionWrp[T]): T = { KerberosUser.getUser match { case Some(user) => { user.doAs(new PrivilegedExceptionAction[T]() { @throws(classOf[Exception]) def run: T = { exe.value } }) } case None => exe.value } } class ExecutionWrp[T](wrp: => T) { lazy val value: T = wrp } Stratio solution
  • 22. Driver Side Stratio Solution abstract class RDD[T: ClassTag]( @transient private var _sc: SparkContext, @transient private var deps: Seq[Dependency[_]] ) extends Serializable with Logging { ... /** * Get the array of partitions of this RDD, taking into account whether the * RDD is checkpointed or not. */ final def partitions: Array[Partition] = { checkpointRDD.map(_.partitions).getOrElse { if (partitions_ == null) { partitions_ = KerberosFunction.executeSecure(new ExecutionWrp(getPartitions)) partitions_.zipWithIndex.foreach { case (partition, index) => require(partition.index == index, s"partitions($index).partition == ${partition.index}, but it should equal $index") } } partitions_ } } Wrapping parameterless method
  • 23. class PairRDDFunctions[K, V](self: RDD[(K, V)]) (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null) ... def saveAsHadoopDataset(conf: JobConf): Unit = self.withScope { // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038). val internalSave: (JobConf => Unit) = (conf: JobConf) => { val hadoopConf = conf val outputFormatInstance = hadoopConf.getOutputFormat val keyClass = hadoopConf.getOutputKeyClass val valueClass = hadoopConf.getOutputValueClass ... val writeToFile = (context: TaskContext, iter: Iterator[(K, V)]) => { …. } self.context.runJob(self, writeToFile) writer.commitJob() } KerberosFunction.executeSecure(internalSave, conf) } Driver Side Stratio Solution Hadoop Datastore RDD save function Inside wrapper function that will run in the cluster Kerberos authentified save function
  • 24. Driver Side class InMemoryCatalog( conf: SparkConf = new SparkConf, hadoopConfig: Configuration = new Configuration) override def createDatabase( dbDefinition: CatalogDatabase, ignoreIfExists: Boolean): Unit = synchronized { def inner: Unit = { ... val location = new Path(dbDefinition.locationUri) val fs = location.getFileSystem(hadoopConfig) fs.mkdirs(location) } catch { case e: IOException => throw new SparkException(s"Unable to create database ${dbDefinition.name} as failed " + s"to create its directory ${dbDefinition.locationUri}", e) } catalog.put(dbDefinition.name, new DatabaseDesc(dbDefinition)) } } KerberosFunction.executeSecure(KerberosUser.principal, new ExecutionWrp(inner)) } Stratio Solution Spark create a directory in HDFS
  • 25. * Interface used to load a [[Dataset]] from external storage systems (e.g. file systems, ... class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { ... def load(): DataFrame = { load(Seq.empty: _*) // force invocation of `load(...varargs...)` } ... def load(paths: String*): DataFrame = { val proxyuser = extraOptions.get("user") if (proxyuser.isDefined) KerberosUser.setProxyUser(proxyuser.get) val dataSource = KerberosFunction.executeSecure(proxyuser, DataSource.apply, sparkSession, source, paths, userSpecifiedSchema, Seq.empty, None, extraOptions.toMap) val baseRelation = KerberosFunction.executeSecure(proxyuser, dataSource.resolveRelation, false) KerberosFunction.executeSecure(proxyuser, sparkSession.baseRelationToDataFrame, baseRelation) } Driver Side Stratio Solution get user from dataset options Method for load data from sources without path obtaining baseRelation from datasource
  • 26. * Interface used to write a [[Dataset]] from external storage systems (e.g. file systems, ... class DataFrameWriter[T] private[sql](ds: Dataset[T]) { ... /** * Saves the content of the [[DataFrame]] as the specified table. ... def save(): Unit = { assertNotBucketed("save") ... val maybeUser = extraOptions.get("user") def innerWrite(modeData: (SaveMode, DataFrame)): Unit = { val (mode, data) = modeData dataSource.write(mode, data) } if (maybeUser.isDefined) KerberosUser.setProxyUser(maybeUser.get) KerberosFunction.executeSecure(maybeUser, innerWrite, (mode, df)) } Driver Side Stratio Solution get user from dataset options Method for save data in external sources Wrapping save execution
  • 27. class DAGScheduler(...){ … … KerberosUser.getMaybeUser match { case Some(user) => properties.setProperty("user", user) case _ => } ... val tasks: Seq[Task[_]] = try { stage match { case stage: ShuffleMapStage => partitionsToCompute.map { id => ... new ShuffleMapTask(stage.id, stage.latestInfo.attemptId, taskBinary, part, locs, stage.latestInfo.taskMetrics, properties) ... new ResultTask(stage.id, stage.latestInfo.attemptId, taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics) } ... Driver Side Stratio Solution
  • 28. private[spark] abstract class Task[T]( val stageId: Int, val stageAttemptId: Int, val partitionId: Int, // The default value is only used in tests. val metrics: TaskMetrics = TaskMetrics.registered, @transient var localProperties: Properties = new Properties) extends Serializable { ... final def run( ... try { val proxyUser = Option(Executor.taskDeserializationProps.get().getProperty("user")) KerberosFunction.executeSecure(proxyUser, runTask, context) } catch { … def runTask(context: TaskContext): T Executor Side Stratio Solution properties load in Driver side get proxy user and wrapped execution method implemented by Task subclasses
  • 30. • Merge this code in Apache Spark (SPARK-16788) • Pluggable authorization • Pluggable secret management (why always use Hadoop delegation tokens?) • Distributed cache. • ... Next Steps Next Steps
  • 31. Q & A Q & A