SlideShare una empresa de Scribd logo
1 de 36
END-TO-END DATA GOVERNANCE
WITH APACHE AVRO AND ATLAS
Barbara Eckman, Ph.D.
Principal Data Architect
Comcast
Mission
Gather, organize, make sense of Comcast data,
and make it universally accessible through
Platforms, Solutions, Products.
• Dozens of tenants and stakeholders
• Millions of messages/second captured
• Tens of PB of long term data storage
• Thousands of cores of distributed compute
Motivating example
• Record representing an IP Video player download session.
– Time download session created
– List of timestamps when video download started
– Endtime, status (failure, success)
– Device info
– Customer Account info
– Asset info (ie movie, tv show, etc being downloaded)
– Fragments: total, completed
– Failure events
• Suppose I want to integrate this data with :
– Customer experience data -- join on customer account
– IP player analytics session (latency, buffering) -- join on device, asset id
– Comcast network traffic data -- join on device
Data integration shop of horrors!
• Customer account ids:
– Xcal
– Xbo
– Billing
– “Service”
• Device ids:
– Physical
– DeviceId
– Xcal
– Xbo
• Asset ids:
– ProviderId, assetId
– streamId
– recordingId
– EAS_URI
– mediaGuid
– mediaId
– assetContentId
– programId
– platformId
Data and Schema
Governance to the Rescue!
Questions: Schema Governance
• I want to join your data with my data—but
what does your data mean???
• Is it safe to join my data with yours (Avoid
“Frankendata”)?
• What attributes in your data match attributes
in mine? (ie potential join fields)
• Can I change my schema without breaking
systems of others that rely on it?
Questions: Data Governance
• Where can I find data about X?
– How is this data structured?
– Who produced it?
• How has the data changed in its journey
from ingest to where I’m viewing it?
• Where are the derivatives of my original
data to be found?
Outline
• Avro for Schema Governance
– What is Apache Avro
– How Comcast BPs make Avro even stronger
• Atlas for Data Governance
– Metadata Browser and Schema registry
– Comcast extensions to Atlas
– Platforms and Atlas work together for lineage
Apache Avro
• A data serialization system
– A JSON-based schema language
– A compact serialized format
• APIs in a bunch of languages
• Benefits:
– Cross-language support for dynamic data access
– Simple but expressive schema definition and
evolution
Avro Types
• primitive
– string
– bytes
– int & long
– float & double
– boolean
– Null (enables
optional fields)
• complex
– record
– array
– map: string -> Type
– union (aka “choice”)
– fixed<N> (byte arrays,
eg uuid)
– enum:
[“larry”|“moe”|“curly”]
Simple Avro Example—Generic
Application Log Event
{ "name": "loggerName",
"type": "string“,
"default": “unknown”,
"doc": "The name of the logger class that
generated the message.”},
{ "name": "message",
"type": ["null", "string“],
"default": null,
"doc": "The formatted message of the log event."}
I want to change my schema…will it
mess others up??
• AVRO serialized data is never used without its
schema.
• We have empirically determined a complete
set of rules for schema evolution and
backward compatibility, for example:
– You may add an optional field to your schema
– You may not change the type of any field
I want to change my schema…will it
mess others up??
• Some chains of compatible changes produce
incompatible results
– remove optional string field X
– add new optional int field X
– This sequence has changed the type of field X
from string to int!
• We can use this data in reasoning about
schema evolution, e.g. “is this schema
compatible with the schema 3 versions back?”
Avro Schema Creation Best Practices
• Each schema is checked for compliance with
Comcast conventions on compile
• Each schema is reviewed and approved by at
least one human being
• Comcast conventions:
– doc comments required to document every
attribute
– All attributes must have default values
– Unnecessary complexity is discouraged (YAGNI
principle)
Avro Schema Creation Best Practices
Data governance policy on updates:
– Data must always match a schema in the schema
registry or be traceable to such a schema
– Updates to schemas of data “in flight” or “at rest”
are not permitted, though re-publication of
enriched data is permitted.
– Schemas in the registry may be evolved at will, but
non-compatible changes should be kept to a bare
minimum
AND MOST IMPORTANTLY…
Philosophy of modular core
subschema reuse
• Github repo of core subschemas
– application (running on a device)
– customerAccount
– device
– error
– geolocation
– header (timestamp, uuid, hostname)
– logEvent
– moneyTrace (ie distributed message tracing)
– monitoringEvent
– networkInterface
– more added as needed
Recap: Avro benefits
• Benefits to the Business
– Data integration to answer business questions
• Benefits to data and schema governance
– Data standardization
– Data integrity
– Standardized documentation
– Data lineage
– Clear compatibility criteria
Avro in the context of data governance:
Apache Atlas
User-Facing Solutions
and Products
DEAP
Analytics
PERPlatforms
Data Engineering Analytics Platform
Architecture
Athene noctua, “little owl”
Companion of Athena, Greek goddess of Wisdom,
Justice, and Good Governance.
• Where can I find data about
X?
• How has the data changed
in its journey from ingest to
where I’m viewing it?
• Where are the derivatives
of my original data to be
found?
• Can I control who
sees/changes my data?
(esp. PII)
Questions: Data Governance
Data Discovery
Data Lineage
Data Security
Apache Atlas
• Data Discovery, Lineages
– Browser UI
– Rest and Java APIs
– Synchronous and Asynchonous messaging
• Integrated Security (Apache Ranger)
• Schema Registry as well as Metadata Repo
Open Source
Extensible
Atlas Extensions for Schema Registry
• New typedefs: avro_schema, avro_record,
avro_enum, etc
• Extensions to Kafka topic type
– sizing parameters
• Reciprocal links between topics and schemas
• Schema evolution
– Versioning
– schema lineage process
– B/F/Full compatibility
• Better handling of JSON-valued attributes in UI
Avro Schema Registry for Data at
Ingest
• Browsable hierarchical schemas
• Avro schemas corresponding to kafka topics
(or no kafka topics)
• Schema lineage using Atlas processes
• Schema evolution and compatibility
• Kafka topic search capabilities
• Contact person for each kafka topic/schema
• API for fully-expanded avro schemas for serde
Data Lineage for Data at Rest
• Compressed avro binary files in long-term object
storage
• Hive or parquet tables derived directly from avro
data
• Data derived/aggregated from hive tables
• Avro-serialized JSON data in NoSQL
• etc
For any data item in any community data store, we
must be able to identify the corresponding avro
schema in the registry
Atlas Extensions for Data Lineage
• S3 objects (also linked to schemas)
• Hive tables and avro schemas linked
• Link to slack help channel in UI
• Java library to facilitate creation and linking of
entities, using asynchronous messaging
– create S3pseudo, link it to avro
– Update kafka sizing
batch data
with
avro schemas
ATHENE: schemas, metadata, lineage
HEADWATERS:
stream data
with
avro schemas
CLOUDBRIDGE:
transformation
creates
Stream data
Parquet file;
Lineage links
Hive table;
Lineage links
OBJECT
STORAGE
Platforms work together for data
governance
S3 pseudodir;
Lineage links
Demo here
Visualizing Data Lineage
What’s Next
• More types
– Independent kafka2S3 service that messages Atlas
– More transforms as first-class processes
• Generic transform library with avro schemas expressing
inputs and outputs
– Zeppelin and other notebooks
– Other data sources in our hybrid environment
• Circonus checks
Summary
• Avro for Schema Governance
– Our lingua franca for end-to-end metadata
– Comcast BPs make Avro even stronger
• Atlas for Data Governance
– Metadata Browser and Schema registry
– Comcast typedef extensions to Atlas
• Avro schemas, records, enums, maps, arrays, etc
• Add attributes to already existing types (eg kafka topics)
• Data sources outside Hadoop ecosystem
• Transforms, schema versioning, lineage, compatibility
– Platforms and Atlas work together for lineage
Challenges and Solutions
Challenge Solution
Creating conformant avro schemas is not trivial Detailed documentation, sample code in Java,
Python, C#, GO, etc; team of reviewers
Avro schemas are annoying to create in a text
editor
Avro schema builder UI—in beta now
Avro manual schema review process was
originally too slow
Trained more reviewers, streamlined processes;
Avro schema reviewer UI—in beta now
Not everyone can publish data in Avro, but they
can in JSON
Use Apache NiFi to convert from JSON to JSON-
serialized avro
No one likes to be “governed”;
Some developers saw creating Avro schemas as a
drag on their productivity
Top-down mandate; came to recognize the
benefits of a schema in data handling:
excitement from analysts, end-user applications
Finding data governance tooling to handle
metadata and lineage for our hybrid environment
Apache Atlas: Open source, extensible solution!
Need a schema registry that allows browsing,
highlights our BPs, also serves up avro for serde
Atlas: Create avro_schema type, extend kafka
topic type, represent schema evolution
Informing Atlas of changes where there aren’t
built-in “hooks”
Extensibility!! Java library hides Atlas-specific
syntax, makes it easy for other platforms
My collaborators
Vadim Vaks
Sr. Solutions Architect
Hortonworks
Thank You!
Barbara_Eckman@Cable.Comcast.com

Más contenido relacionado

La actualidad más candente

Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
 
MAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMarkus Michalewicz
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connectconfluent
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practicesconfluent
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - RangerIsheeta Sanghi
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 

La actualidad más candente (20)

Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
MAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19c
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 

Similar a End-to-end Data Governance with Apache Avro and Atlas

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsCarole Goble
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empowerDurga Gadiraju
 
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEPBIOVIA
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...HostedbyConfluent
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWSTom Laszewski
 
Using AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceUsing AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceChristian Beedgen
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 

Similar a End-to-end Data Governance with Apache Avro and Atlas (20)

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
(ATS4-PLAT05) Accelrys Catalog: A Search Index for AEP
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
Using AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceUsing AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics Service
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

End-to-end Data Governance with Apache Avro and Atlas

  • 1. END-TO-END DATA GOVERNANCE WITH APACHE AVRO AND ATLAS Barbara Eckman, Ph.D. Principal Data Architect Comcast
  • 2. Mission Gather, organize, make sense of Comcast data, and make it universally accessible through Platforms, Solutions, Products. • Dozens of tenants and stakeholders • Millions of messages/second captured • Tens of PB of long term data storage • Thousands of cores of distributed compute
  • 3. Motivating example • Record representing an IP Video player download session. – Time download session created – List of timestamps when video download started – Endtime, status (failure, success) – Device info – Customer Account info – Asset info (ie movie, tv show, etc being downloaded) – Fragments: total, completed – Failure events • Suppose I want to integrate this data with : – Customer experience data -- join on customer account – IP player analytics session (latency, buffering) -- join on device, asset id – Comcast network traffic data -- join on device
  • 4. Data integration shop of horrors! • Customer account ids: – Xcal – Xbo – Billing – “Service” • Device ids: – Physical – DeviceId – Xcal – Xbo • Asset ids: – ProviderId, assetId – streamId – recordingId – EAS_URI – mediaGuid – mediaId – assetContentId – programId – platformId
  • 5. Data and Schema Governance to the Rescue!
  • 6. Questions: Schema Governance • I want to join your data with my data—but what does your data mean??? • Is it safe to join my data with yours (Avoid “Frankendata”)? • What attributes in your data match attributes in mine? (ie potential join fields) • Can I change my schema without breaking systems of others that rely on it?
  • 7. Questions: Data Governance • Where can I find data about X? – How is this data structured? – Who produced it? • How has the data changed in its journey from ingest to where I’m viewing it? • Where are the derivatives of my original data to be found?
  • 8. Outline • Avro for Schema Governance – What is Apache Avro – How Comcast BPs make Avro even stronger • Atlas for Data Governance – Metadata Browser and Schema registry – Comcast extensions to Atlas – Platforms and Atlas work together for lineage
  • 9. Apache Avro • A data serialization system – A JSON-based schema language – A compact serialized format • APIs in a bunch of languages • Benefits: – Cross-language support for dynamic data access – Simple but expressive schema definition and evolution
  • 10. Avro Types • primitive – string – bytes – int & long – float & double – boolean – Null (enables optional fields) • complex – record – array – map: string -> Type – union (aka “choice”) – fixed<N> (byte arrays, eg uuid) – enum: [“larry”|“moe”|“curly”]
  • 11. Simple Avro Example—Generic Application Log Event { "name": "loggerName", "type": "string“, "default": “unknown”, "doc": "The name of the logger class that generated the message.”}, { "name": "message", "type": ["null", "string“], "default": null, "doc": "The formatted message of the log event."}
  • 12. I want to change my schema…will it mess others up?? • AVRO serialized data is never used without its schema. • We have empirically determined a complete set of rules for schema evolution and backward compatibility, for example: – You may add an optional field to your schema – You may not change the type of any field
  • 13. I want to change my schema…will it mess others up?? • Some chains of compatible changes produce incompatible results – remove optional string field X – add new optional int field X – This sequence has changed the type of field X from string to int! • We can use this data in reasoning about schema evolution, e.g. “is this schema compatible with the schema 3 versions back?”
  • 14. Avro Schema Creation Best Practices • Each schema is checked for compliance with Comcast conventions on compile • Each schema is reviewed and approved by at least one human being • Comcast conventions: – doc comments required to document every attribute – All attributes must have default values – Unnecessary complexity is discouraged (YAGNI principle)
  • 15. Avro Schema Creation Best Practices Data governance policy on updates: – Data must always match a schema in the schema registry or be traceable to such a schema – Updates to schemas of data “in flight” or “at rest” are not permitted, though re-publication of enriched data is permitted. – Schemas in the registry may be evolved at will, but non-compatible changes should be kept to a bare minimum AND MOST IMPORTANTLY…
  • 16. Philosophy of modular core subschema reuse • Github repo of core subschemas – application (running on a device) – customerAccount – device – error – geolocation – header (timestamp, uuid, hostname) – logEvent – moneyTrace (ie distributed message tracing) – monitoringEvent – networkInterface – more added as needed
  • 17. Recap: Avro benefits • Benefits to the Business – Data integration to answer business questions • Benefits to data and schema governance – Data standardization – Data integrity – Standardized documentation – Data lineage – Clear compatibility criteria
  • 18. Avro in the context of data governance: Apache Atlas
  • 21. Athene noctua, “little owl” Companion of Athena, Greek goddess of Wisdom, Justice, and Good Governance.
  • 22. • Where can I find data about X? • How has the data changed in its journey from ingest to where I’m viewing it? • Where are the derivatives of my original data to be found? • Can I control who sees/changes my data? (esp. PII) Questions: Data Governance Data Discovery Data Lineage Data Security
  • 23. Apache Atlas • Data Discovery, Lineages – Browser UI – Rest and Java APIs – Synchronous and Asynchonous messaging • Integrated Security (Apache Ranger) • Schema Registry as well as Metadata Repo Open Source Extensible
  • 24. Atlas Extensions for Schema Registry • New typedefs: avro_schema, avro_record, avro_enum, etc • Extensions to Kafka topic type – sizing parameters • Reciprocal links between topics and schemas • Schema evolution – Versioning – schema lineage process – B/F/Full compatibility • Better handling of JSON-valued attributes in UI
  • 25. Avro Schema Registry for Data at Ingest • Browsable hierarchical schemas • Avro schemas corresponding to kafka topics (or no kafka topics) • Schema lineage using Atlas processes • Schema evolution and compatibility • Kafka topic search capabilities • Contact person for each kafka topic/schema • API for fully-expanded avro schemas for serde
  • 26. Data Lineage for Data at Rest • Compressed avro binary files in long-term object storage • Hive or parquet tables derived directly from avro data • Data derived/aggregated from hive tables • Avro-serialized JSON data in NoSQL • etc For any data item in any community data store, we must be able to identify the corresponding avro schema in the registry
  • 27. Atlas Extensions for Data Lineage • S3 objects (also linked to schemas) • Hive tables and avro schemas linked • Link to slack help channel in UI • Java library to facilitate creation and linking of entities, using asynchronous messaging – create S3pseudo, link it to avro – Update kafka sizing
  • 28. batch data with avro schemas ATHENE: schemas, metadata, lineage HEADWATERS: stream data with avro schemas CLOUDBRIDGE: transformation creates Stream data Parquet file; Lineage links Hive table; Lineage links OBJECT STORAGE Platforms work together for data governance S3 pseudodir; Lineage links
  • 30.
  • 32. What’s Next • More types – Independent kafka2S3 service that messages Atlas – More transforms as first-class processes • Generic transform library with avro schemas expressing inputs and outputs – Zeppelin and other notebooks – Other data sources in our hybrid environment • Circonus checks
  • 33. Summary • Avro for Schema Governance – Our lingua franca for end-to-end metadata – Comcast BPs make Avro even stronger • Atlas for Data Governance – Metadata Browser and Schema registry – Comcast typedef extensions to Atlas • Avro schemas, records, enums, maps, arrays, etc • Add attributes to already existing types (eg kafka topics) • Data sources outside Hadoop ecosystem • Transforms, schema versioning, lineage, compatibility – Platforms and Atlas work together for lineage
  • 34. Challenges and Solutions Challenge Solution Creating conformant avro schemas is not trivial Detailed documentation, sample code in Java, Python, C#, GO, etc; team of reviewers Avro schemas are annoying to create in a text editor Avro schema builder UI—in beta now Avro manual schema review process was originally too slow Trained more reviewers, streamlined processes; Avro schema reviewer UI—in beta now Not everyone can publish data in Avro, but they can in JSON Use Apache NiFi to convert from JSON to JSON- serialized avro No one likes to be “governed”; Some developers saw creating Avro schemas as a drag on their productivity Top-down mandate; came to recognize the benefits of a schema in data handling: excitement from analysts, end-user applications Finding data governance tooling to handle metadata and lineage for our hybrid environment Apache Atlas: Open source, extensible solution! Need a schema registry that allows browsing, highlights our BPs, also serves up avro for serde Atlas: Create avro_schema type, extend kafka topic type, represent schema evolution Informing Atlas of changes where there aren’t built-in “hooks” Extensibility!! Java library hides Atlas-specific syntax, makes it easy for other platforms
  • 35. My collaborators Vadim Vaks Sr. Solutions Architect Hortonworks

Notas del editor

  1. 10s of Millions of Customers 100s of Millions of Devices
  2. I want to join your data - Most of the time in analytics is spent gathering, integrating, understanding data (“futzing with data”). Is it safe to join my data with yours (Avoid “Frankendata”)? -is my device_id the same as your “device_id”? -Without common schemas, data integration is nearly impossible Can I change my schema without breaking systems of others that rely on it? -Schemas can change in unpredictable and non-backward compliant ways
  3. Serialized in either binary or JSON APIs in: Java, Python, Go, C, C++, C#, …