SlideShare una empresa de Scribd logo
1 de 257
Descargar para leer sin conexión
Hadoop Application
Architectures:
Architecting a Next
Generation Data Platform
Strata Data Conference, Singapore 2017
Ted Malaska | @tedmalaska
Jonathan Seidman | @jseidman
Questions?
tiny.cloudera.com/sgquestions
Logistics
▪ Break at 3:00 – 3:30 PM
▪ Questions at the end of each section
▪ Slides at tiny.cloudera.com/app-arch-singapore-2017
▪ Code at https://github.com/hadooparchitecturebook/Taxi360
Questions?
tiny.cloudera.com/sgquestions
About the book
▪ @hadooparchbook
▪ hadooparchitecturebook.com
▪ github.com/hadooparchitecturebook
▪ slideshare.com/hadooparchbook
Questions?
tiny.cloudera.com/sgquestions
About the presenters
▪ BNet Group Architect at Blizzard
▪ Cloudera Principal Solution Architect
▪ Architect at FINRA
▪ Contributor to:
▪ Apache Spark,
▪ Hadoop,
▪ Hive,
▪ Sqoop,
▪ Yarn,
▪ Flume,
▪ Etc.
Ted Malaska
Questions?
tiny.cloudera.com/sgquestions
About the presenters
▪ Software Engineer at Cloudera
▪ Contributor to Apache Sqoop
▪ Previously Technical Lead on the big data team at
Orbitz
▪ Co-founder of the Chicago Hadoop User Group and
Chicago Big Data
Jonathan Seidman
Case Study Overview
Internet of Things and Entity 360
Questions?
tiny.cloudera.com/sgquestions
Customer 360
Questions?
tiny.cloudera.com/sgquestions
Connected Cars
Questions?
tiny.cloudera.com/sgquestions
Entity (Taxi) 360 View
Geo-location/
Traffic Data
Customer Data
Maintenance
Data
Other Data
Sources
Streaming
Vehicle Data
Questions?
tiny.cloudera.com/sgquestions
In the old days
Data
Warehouse
Extract Transform
(SQL)
Load Data Mart
Questions?
tiny.cloudera.com/sgquestions
At the time of our Book
SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE
ERP,	CRM,	RDBMS,	MACHINES FILES,	IMAGES,	VIDEOS,	LOGS,	CLICKSTREAMS EXTERNAL	DATA	SOURCES
Questions?
tiny.cloudera.com/sgquestions
Today...
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
Questions?
tiny.cloudera.com/sgquestions
Enabling a Range of New Use Cases…
Fraud Detection Market
Transactions
Internet of
Things
Network Security
Questions?
tiny.cloudera.com/sgquestions
Challenges
And so many more
Questions?
tiny.cloudera.com/sgquestions
Challenges – Architectural Considerations
▪ Reliable and scalable ingress of multiple data types and sources:
- High volume event data? Batch data?
▪ Reliable and scalable storage to support multiple workloads and access patterns
- Historical data? Real-time search? Analytics?
▪ Processing engines (for background processing):
- Stream processing? Batch processing?
▪ Data Modeling
- Modeling data for real-time random access? Analytic access? Batch access?
- Deployment Considerations
- On-premise, cloud?
Case Study
Requirements
Overview
Questions?
tiny.cloudera.com/sgquestions
Requirements
▪ Allow users (technical and non-technical) to analyze and visualize data…
Questions?
tiny.cloudera.com/sgquestions
Requirements
▪ Provide analysts with query capabilities via a standard interface…
Questions?
tiny.cloudera.com/sgquestions
Requirements
▪ Provide developers the ability to perform batch processing on historical data…
Questions?
tiny.cloudera.com/sgquestions
Requirements
▪ To support all this, we need:
- Reliable ingestion of streaming and batch data.
- Ability to perform transformations on streaming data in flight.
- Ability to perform sophisticated processing of historical data.
- Reliable and scalable storage to support modeling and processing of multiple data
formats.
High level architecture
Walkthrough
Questions?
tiny.cloudera.com/sgquestions
High level architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
Data Sources
And Data Acquisition
Considerations
Questions?
tiny.cloudera.com/sgquestions
High level architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
Questions?
tiny.cloudera.com/sgquestions
Key to Customer 360 Success
Your project is only as good as the quality and variety of data sources
Geo-location/
Traffic Data
Customer DataMaintenance
Data
Other Data
Sources
Streaming
Vehicle Data
Files
CSV? XML?
JSON?
Twitter?
Mainframe?
Database Salesforce?
MQTT
Questions?
tiny.cloudera.com/sgquestions
High Level Architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
Questions?
tiny.cloudera.com/sgquestions
Producers – Push and Pull
▪ Code
- Agents
- Log Aggregation
Questions?
tiny.cloudera.com/sgquestions
Producers – Push and Pull
▪ Code
- Directly to pipeline
- To aggregation layer
- REST
- Kafka Producer
Questions?
tiny.cloudera.com/sgquestions
Producers – Push and Pull
▪ Agents
- Snap
- Prometheus
- Telegraf
Questions?
tiny.cloudera.com/sgquestions
Producers – Push and Pull
▪ Log Aggregations
- Syslog
- Logger Implementation
Questions?
tiny.cloudera.com/sgquestions
Kafka Clients
Apache Kafka Clients Ecosystem Clients
Questions?
tiny.cloudera.com/sgquestions
REST Proxy
Talking to Non-native Kafka Apps and Outside the Firewall
REST Proxy
Non-Java Applications
Native Kafka Java Applications
REST / HTTP
Simplifies administrative
actions
Simplifies message creation
and consumption
Provides a RESTful
interface to a Kafka
cluster
Questions?
tiny.cloudera.com/sgquestions
Kafka Connect
Streaming Data Capture
JDBC
Logs
MQTT
RDBMS
Key/Value
HDFS
Kafka Connect API
Kafka
Sources Sinks
Fault tolerant
Manage hundreds of data
sources and sinks
Preserves data schema
Part of Apache Kafka
project
Includes simple
transformations
Questions?
tiny.cloudera.com/sgquestions
Ecosystem of Connectors
Databases
This image cannot currently be displayed.
Datastore/File Store
Analytics Applications / Other
Questions?
tiny.cloudera.com/sgquestions
High level architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
Questions?
tiny.cloudera.com/sgquestions
Why Schema Validation?
▪ Schema declaration
- Regulation (GDPR)
- Consistency
▪ Common formats
▪ Additional Information
- Routing
- Filtering
- Transformations
Questions?
tiny.cloudera.com/sgquestions
How to Implement Schema Validation
▪ Kafka Schema Registry
▪ Custom implementation
▪ Other solutions
Questions?
tiny.cloudera.com/sgquestions
Schema Registry
Elastic
Cassandra
HDFS
Example Consumers
Serializer
Source 1
Serializer
Source 2
!
Kafka Topic!
Schema Registry
Define the expected fields for each Kafka topic
Automatically handle schema changes (e.g. new fields)Prevent backwards incompatible changes
Get different data sources to talk the same language
Questions?
tiny.cloudera.com/sgquestions
High level architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
Questions?
tiny.cloudera.com/sgquestions
Goals for our Transport Layer
▪ Buffering
▪ Partitioning
▪ Speed
▪ Replayability
▪ High throughput
▪ Reliable
Questions?
tiny.cloudera.com/sgquestions
Goals for our Transport Layer
▪ To meet these goals we want some kind of publish-subscribe queue:
- Kafka
- Kinesis
- RabbitMQ
- Azure Queues
- Azure Service Bus
- Google Pub/Sub
- etc…
Questions?
tiny.cloudera.com/sgquestions
Goals for our Transport Layer
▪ In general you have two decisions:
- Cloud service or your own managed component
- Guarantee Levels
Questions?
tiny.cloudera.com/sgquestions
Buffering Data
▪ What do we mean by “buffering” and why do we need it?
event,event,event,event,event,event…
This is bad!
▪ Network partitions happen
▪ Producers and
Consumers work at
different rates
▪ Reliable storage is hard
Stream processing is hard
Lets do one at a time
Questions?
tiny.cloudera.com/sgquestions
Buffering Data – Message Brokers
Publisher
Publisher
Publisher
Message
Queue
Subscriber
Subscriber
Subscriber
Questions?
tiny.cloudera.com/sgquestions
What is Kafka?
▪ It’s like a message queue, right?
- Actually, it’s a “distributed commit log”
- Or “streaming data platform”
0 1 2 3 4 5 6 7 8
Data
Source
Data
Consumer
A
Data
Consumer
B
Questions?
tiny.cloudera.com/sgquestions
Topics and Partitions
▪ Messages are organized into topics, and each topic is split into partitions.
- Each partition is an immutable, time-sequenced log of messages on disk.
- Note that time ordering is guaranteed within, but not across, partitions.
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
Partition 0
Partition 1
Partition 2
Data
Source
Topic
Questions?
tiny.cloudera.com/sgquestions
Consumers
In Our Architecture
Taxi Trip Data
Producer
Kafka
taxi-trip-input
Topic
Stream
Processing
(Analytic)
Stream
Processing
(Lookup)
Stream
Processing
(Search)
Stream
Processing
(Long Term)
Questions?
tiny.cloudera.com/sgquestions
Input Events
CMT,2009-01-05 08:31:55,2009-01-05 8:37:50,1,0.90000000000000002,-73.977936999999997,
40.745919000000001,,,-
73.983609000000001,40.755051000000002,Credit,5.2999999999999998,0,,0.79000000000000004,0,6.0
899999999999999
vendor_name,Trip_Pickup_DateTime,Trip_Dropoff_DateTime,Passenger_Count,Trip_Distance,
Start_Lon,Start_Lat,Rate_Code,store_and_forward,End_Lon,End_Lat,Payment_Type,Fare_Amt,
surcharge,mta_tax,Tip_Amt,Tolls_Amt,Total_Amt
Questions?
tiny.cloudera.com/sgquestions
Kafka Considerations – Reliability
▪ But remember there are tradeoffs…
Questions?
tiny.cloudera.com/sgquestions
Kafka Considerations – Reliability
▪ Different reliability levels for topics:
Taxi Trip Data
Kafka
taxi-trip-input
Twitter customer-sentiment
100% – dups
are ok
(“At least
once”)
<=100%
(“At most
once”)
News Flash:
Kafka’s Exactly Once
Producer is on the way
Questions?
tiny.cloudera.com/sgquestions
Kafka Reliability – Replication
ProducerBroker
Partition1
Partition2
Partition3
Leader
Questions?
tiny.cloudera.com/sgquestions
Kafka Reliability – Replication
ProducerBroker
Partition1
Partition2
Partition3
Questions?
tiny.cloudera.com/sgquestions
Kafka Reliability – Replication
ProducerBroker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Leader
Questions?
tiny.cloudera.com/sgquestions
Kafka Reliability– Replication
ProducerBroker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Leader
Leader
Questions?
tiny.cloudera.com/sgquestions
Kafka Reliability – Replication
ProducerBroker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Leader
Questions?
tiny.cloudera.com/sgquestions
View from the Consumer
Kafka Broker Cluster
Partition A
Partition B
Partition C
Consumer Group A
Consumer Group B
Consumer
Consumer
Consumer
Consumer
A
B
A
B
A
B
Questions?
tiny.cloudera.com/sgquestions
View from the Consumer
Kafka Broker Cluster
Partition A
Partition B
Partition C
Consumer Group A
Consumer Group B
Consumer
Consumer
Consumer
Consumer
A
B
A
B
A
B
Questions?
tiny.cloudera.com/sgquestions
Kafka Reliability – Replication
▪ So how does this relate to our application?
kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 3 
--create --topic taxi-trip-input
kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 1 
--create –topic customer-sentiment
Questions?
tiny.cloudera.com/sgquestions
Kafka Reliability – Producers
Taxi Trip Data
Kafka
taxi_trip_input
Partition 1
Partition 2
Partition 3
Topic B
Partition 1
Partition 2
Partition 3
Message
failure?
Producer
Resend
message
acks=all
Questions?
tiny.cloudera.com/sgquestions
Kafka Reliability – Producers
▪ What about duplicates?
Taxi Trip Data
Kafka
taxi_trip_input
Partition 1
Partition 2
Partition 3
Topic B
Partition 1
Partition 2
Partition 3
Producer
ID Message
1000 2009-01-04 03:02:00,1,2.629,...
1001 2009-01-04 03:38:00,3,4.549…
1001 2009-01-04 03:38:00,3,4.549…
Questions?
tiny.cloudera.com/sgquestions
Kafka Scaling – Partitions
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Questions?
tiny.cloudera.com/sgquestions
Kafka Scaling – Partitions
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Partition 4
Partition 5
Consumer
Consumer
Higher
throughput
Higher
throughput
More
resources
(memory)
More
resources
(file handles)
Producer
Questions?
tiny.cloudera.com/sgquestions
How many partitions?
▪ Adding partitions late in the game is painful
▪ Basic formula:
total desired throughput / throughput of slowest consumer or producer
▪ Or ~25GB disk space
▪ Not too many because:
- Each partition takes broker heap memory and file handles
- Each partition slows down node shutdown / recovery
- 1000 – 4000 partitions per broker max
- Producers will produce smaller batches – lower throughput
Questions?
tiny.cloudera.com/sgquestions
Kafka Scaling – Producers
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Partition 4
Partition 5
Consumer
Consumer
Producer
Questions?
tiny.cloudera.com/sgquestions
Guarding Against Message Loss
▪ Producer – What happens if the producer loses connection to Kafka and the buffer
overflows?
- You get an exception. You can choose to… block? Write to file?
▪ Source – What happens if events are lost before getting sent to producer?
- Once again use some kind of buffer to provide sufficient retention of data.
Stream Processing
Considerations
Questions?
tiny.cloudera.com/sgquestions
High level architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
Questions?
tiny.cloudera.com/sgquestions
Streaming agenda
▪ What do we mean by streaming?
▪ Streaming use-cases
▪ Streaming semantics
▪ Which streaming engine to choose?
▪ Streaming in our use-case
What do we mean by
streaming?
Questions?
tiny.cloudera.com/sgquestions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
Questions?
tiny.cloudera.com/sgquestions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
Questions?
tiny.cloudera.com/sgquestions
But, there’s no free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
“Difficult” architectures, lower
latency
“Easier” architectures, higher
latency
Streaming use-cases
Questions?
tiny.cloudera.com/sgquestions
What are Streaming Use Cases
- Atomic Enrichment
- Notifications
- Joining
- Partitioning
- Windowing
- Ingestion
- Aggregation & Counting
- Incremental Machine Learning
- Progressive Analytics
Questions?
tiny.cloudera.com/sgquestions
Atomic Enrichment
- Isolated Transforms
- Transform with context
- FlatMap & Filtering & Routing
Questions?
tiny.cloudera.com/sgquestions
Atomic Enrichment
- Isolated Transforms
Normal Queue
Basic Transform
Incoming Events
Transform Logic
Transformed Events
Queue
Questions?
tiny.cloudera.com/sgquestions
Atomic Enrichment
- Transform with context
Normal Queue
Context Transform
Incoming Events
Transform Logic
Transformed Events
Queue
Cached Context
Cached Source
Questions?
tiny.cloudera.com/sgquestions
Atomic Enrichment
- Flap Map & Filtering
Normal Queue
FlapMap Transform
Incoming Events
Transform Logic
Transformed Events
Queue
Cached Context
Transformed Events
Queue
Cached Source
Questions?
tiny.cloudera.com/sgquestions
Notifications
- Making a decision on a event
- Send out a notification to an action taker
Questions?
tiny.cloudera.com/sgquestions
Notifications
Normal Queue
Processor
Incoming Events
Transform Logic
Notification Queue
Cached Context
Rules & Logic
Recording Decisions
Queue
Cached Store
Storage
Action Taker
Research & Machine
Learning
Questions?
tiny.cloudera.com/sgquestions
Joining
- Joining big to small table
- Broadcast join
- Joining two large tables
- Getting latest on same key from different feeds
- Windowing
Questions?
tiny.cloudera.com/sgquestions
Windowing
- Being able to process a set of events between a start and ending time period
- A sliding window
Questions?
tiny.cloudera.com/sgquestions
Windowing
- Real Data Example
- ???
Questions?
tiny.cloudera.com/sgquestions
Windowing
- Processing Time
- Time the event reaches your system
- Event Time
- Time the event was created
Questions?
tiny.cloudera.com/sgquestions
Windowing
- Window Length
- Slide Interval
- Watermarks
Questions?
tiny.cloudera.com/sgquestions
Ingestion
- Taking values from a stream into a data store
- File System
- Object Store
- NoSQL
- NoSQL Time Series
- Document Store
- Lucene based System
- Spanner System
Questions?
tiny.cloudera.com/sgquestions
Ingestion
- File Systems & Object Store
- Normally you want larger files
- Normally you want high compression
- Normally you want deduping
- Window of deduping
- 100% deduping in this case is difficult
- Think about sequence numbering from source
Questions?
tiny.cloudera.com/sgquestions
Ingestion
- NoSQL, Lucene, and Spanner Systems
- If you use the key deduping is given for free
- Different levels of latency and through put
- NoSQL
- Document
- Spanner
- Lucene
- Partitioning on the way in
Questions?
tiny.cloudera.com/sgquestions
Ingestion
- NoSQL Time Series
- Dumb inserts
- You may need Aggregation across metrics
- To increase throughput you may want to buffer writes
- Context
- Row is a key time and data points There
- Row1 Time1, DataPoint1
- Row1 Time2, DataPoint2
- Is slower then
- Row1 Time1, DataPoint1, Time2, DataPoint2
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Counting value with respect to an entity ID
- Counting totals and with respect to window times
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- This is were we talk about Lamdba
- There are many definitions
- Common but not correct: Jobs that involve both Batch & Streaming
- Correct: Perfect count is not possible with streaming so we use a combination of streaming and
batch to show the right value
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Why is streaming not perfect
Incrementing Speed Layer NoSQL
Get Event to Process
Increment
Acknowledge Event
Get Event Again
Increment
Acknowledge Event
Value: 10
Value: 12
Value: 12
Value: 12
Value: 14
Value: 14
+2
+2
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Why is streaming not perfect (more)
- Duping in general
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- What is Lamdba
Time Window of Events
Streaming Results
So over time lots of errors
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- What is Lamdba
Time Window of Events
Streaming Results
Batch Corrections Batch Corrections Batch Corrections Batch Corrections
Less room for error.
So in theory more correct
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- How is lamdba Resolved with out batch?
- Failure problem
- Deduping problem
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Failure Problem Solved by Adding internal State
- <Spark Streaming Example>
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Failure Problem Solved by Adding internal State
Micro Batch Layer
Get Batch Y
Count Batch Y
Update StateByKey
Foreach Value Put
Acknowledge Batch
Reset State to Start of X Batch
Get Batch Y
Count Batch Y Value
Update StateByKey
Foreach Value Put
Acknowledge Batch
NoSQL
Value: 12
Value: 12
Value: 12
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Value: 14
Micro Batch Layer
Result as Batch X
Value = 10
Result as Batch X
Value = 12
Micro Batch Layer
Result as Batch X
Value = 10
Result as Batch X
Value = 12
Micro Batch Layer
Result as Batch X
Value = 10
Result as Batch X
Value = 12
Put(14)
Put(14)
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Deduping Problem Solved by
- Some deduping can happen with in a batch
- But this is not good enough
- Source Sequence Numbers
- With Order, Partitioning, and Internal State
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Deduping with Sequence Numbers
Source Sequence Value
A 1 10
B 1 100
A 2 10
B 2 100
A 3 10
B 2 100
B 3 100
Seq of A Value of A Seq of B Value of B
1 10 - -
1 10 1 100
2 20 1 100
2 20 2 200
3 30 2 200
3 30 2 200
3 30 3 300
Questions?
tiny.cloudera.com/sgquestions
Aggregation & Counting
- Closing Thoughts on Lamdba
Questions?
tiny.cloudera.com/sgquestions
Incremental Machine Learning
- Clustering
- Recommendations
- Neural Networks
Questions?
tiny.cloudera.com/sgquestions
Incremental Machine Learning
- Clustering
- ???
Questions?
tiny.cloudera.com/sgquestions
Incremental Machine Learning
- Recommendations
- ???
Questions?
tiny.cloudera.com/sgquestions
Incremental Machine Learning
- Neural Networks
- ???
Questions?
tiny.cloudera.com/sgquestions
Progressive Analytics
- Running SQL on a stream
- The idea of streams and tables being one
Questions?
tiny.cloudera.com/sgquestions
Progressive Analytics
- Table and Streaming
- Tables are a set of columns and rows
- They represent the current state of things
- Stream is a set of change longs
- The combined total of the changed logs equals a table
Questions?
tiny.cloudera.com/sgquestions
Progressive Analytics
- Table and Streaming
Source Sequence Value
A 1 10
B 1 100
A 2 10
B 2 100
A 3 10
B 2 100
B 3 100
Source Sequence Value
A 2 20
B 1 100
Questions?
tiny.cloudera.com/sgquestions
Progressive Analytics
- Table and Streaming
Source Sequence Value
A 1 10
B 1 100
A 2 10
B 2 100
A 3 10
B 2 100
B 3 100
Source Sequence Value
A 3 30
B 2 200
Streaming semantics
Questions?
tiny.cloudera.com/sgquestions
Delivery Types
▪ At most once
- Not good for many cases
- Only where performance/SLA is more important than accuracy
▪ Exactly once
- Expensive to achieve but desirable
▪ At least once
- Easiest to achieve
Questions?
tiny.cloudera.com/sgquestions
Semantics of our architecture
Source System 1
Destination
systemSource System 2
Source System 3
Ingest Extract Streaming
engine
Push
Message broker
Questions?
tiny.cloudera.com/sgquestions
Classification of storage systems
▪ File based
- S3
- HDFS
▪ NoSQL
- HBase
- Cassandra
- MongoDB
- Duid.io
▪ Document based
- ES & Solr
▪ NoSQL-SQL
- Kudu, CockRoachdb
Questions?
tiny.cloudera.com/sgquestions
Classification of storage systems
▪ File based
- S3
- HDFS
▪ NoSQL
- HBase
- Cassandra
- MongoDB
- Duid.io
▪ Document based
- ES & Solr
▪ NoSQL-SQL
- Kudu, CockRoachdb
De-duplication at file level
Semantics at key/record level
Which streaming
engine to choose?
Questions?
tiny.cloudera.com/sgquestions
Requirements
▪ Fault-tolerant and distributed
▪ Effectively once semantics
▪ Handle processing time vs. event time
▪ Allow stateful transformations
Questions?
tiny.cloudera.com/sgquestions
Spark Streaming
▪ Micro batch based architecture
▪ Allows stateful transformations
▪ Feature rich
- Windowing
- Sessionization
- ML
- SQL (Structured Streaming)
Questions?
tiny.cloudera.com/sgquestions
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First
Batch
Second
Batch
Questions?
tiny.cloudera.com/sgquestions
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1
Questions?
tiny.cloudera.com/sgquestions
Spark Streaming - Gaps
▪ Not as low of a latency
- Efforts towards reducing latency e.g. RISElab’s Drizzle
▪ Global consistent execution state
- Stop overall execution of distributed computation
- Eagerly persist records in transit meaning larger snapshots
Questions?
tiny.cloudera.com/sgquestions
Flink
▪ True “streaming” system, but not as feature rich as Spark
▪ Much better event time handling
▪ Good built-in backpressure support
▪ Allows stateful transformations
▪ Lower Latency
- No Micro Batching
- Asynchronous Barrier Snapshotting (ABS)
Questions?
tiny.cloudera.com/sgquestions
Flink - ABS
Operator
Buffer
Questions?
tiny.cloudera.com/sgquestions
Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A Hit
Barrier 1B
Still Behind
Questions?
tiny.cloudera.com/sgquestions
Operator
Buffer
Flink - ABS
Both Barriers
Hit
Operator
Buffer
Barrier 1A Hit
Barrier 1B
Still Behind
Questions?
tiny.cloudera.com/sgquestions
Operator
Buffer
Flink - ABS Both Barriers
Hit
Operator
Buffer Barrier is
combined and
can move on
Buffer can be
flushed out
Questions?
tiny.cloudera.com/sgquestions
Storm
▪ Old school
▪ Didn’t manage state – had to use Trident
▪ No good support for batch processing
Questions?
tiny.cloudera.com/sgquestions
Samza
▪ Good integration with Kafka
▪ Doesn’t support batch
▪ Forked by Kafka Streams
Questions?
tiny.cloudera.com/sgquestions
Flume
▪ Well integrated with the Hadoop ecosystem
▪ Allowed interceptors (for simple transformations)
▪ Supports buffering
- Memory
- File
- Kafka
▪ But no real fault-tolerance
▪ No state management
Questions?
tiny.cloudera.com/sgquestions
Kafka Streams
▪ Good integration with Kafka
▪ Light-weight library (not a framework)
▪ No micro-batching, uses Kafka as internal messaging layer
▪ Maintains local state per node (in RocksDB, or in memory hash map)
▪ Handles late events
▪ Stream-to-stream joins
Questions?
tiny.cloudera.com/sgquestions
Topic
Partition 1
Partition 2
Task 1 Re-partition topic
Partition 1
Partition 2
Task 3
Task 2
Task 4
Kafka Streams architecture
Questions?
tiny.cloudera.com/sgquestions
Apache Beam
▪ Abstraction on top of Streaming Engines
▪ Best support for Google Dataflow
Questions?
tiny.cloudera.com/sgquestions
Others
▪ Apache Apex
▪ Heron
Storage Layer
Considerations
Questions?
tiny.cloudera.com/sgquestions
High level architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
Data Modeling
Questions?
tiny.cloudera.com/sgquestions
Structured Landing Zones
- Long Term Storage
- Local - Remote
- Styles
- Relational
- Nested
- Speed Layer
- Time Series
- Reverse Index
- Graph
HDFS
Questions?
tiny.cloudera.com/sgquestions
Basic of GFS => HDFS
- NameNode
- Metadata of all the files/blocks
- Which data node they are assigned too
- Replication management
- Data Nodes
- Metadata for each block location on disk
Questions?
tiny.cloudera.com/sgquestions
Basic of GFS => HDFS
Client
NameNode
DataNode A
DataNode B
DataNode C
A
B
C
Write Path
A. Ask Name Node for Location to Write
B. Write to DataNode with NN Instructions
C. DataNode does replication
D. Confirms file is persisted to client
Questions?
tiny.cloudera.com/sgquestions
Basic of GFS => HDFS
- File are immutable
- File can be of any type
- Files are block up into Blocks (128MB -> 1GB)
- Metadata cost is at the Block not the data size
- File may be splittable or may not be when reading
Object Stores
Questions?
tiny.cloudera.com/sgquestions
Object Store (Like and not like HDFS)
- Like a HDFS
- Contains files
- Break up large files
- Not like a HDFS
- Not really a file system and is more Key value like a NoSQL
- Doesn’t have any metadata limit problem
- Traversing Folder directories is more work
- There is no rename, only copy and delete
- Eventually consist issues with listing files
- (seen with things like MR and Spark)
- Can be mostly addressed with EMRFS
Questions?
tiny.cloudera.com/sgquestions
Object Store (Many Ideas Continue)
- Because it is like a File System
- Almost everything you learned from HDFS stores true
- Hive works the same
- Files and file formats are the same
Questions?
tiny.cloudera.com/sgquestions
Object Store (Thinking Remote)
- Unlike HDFS the storage is always remote
- Not on the same nodes as the execution
- Which allow you to save money in the cloud
- Execution nodes are expensive vs storage only
- Network will be used to Read and Write
- In fact you are normally throttled well before the network limit of your node
- You will want the highest rates of compression possible
- To save money on storage
- To read and write faster
Background Information
Questions?
tiny.cloudera.com/sgquestions
Compression Styles and Entropy
Columns
Rows
Questions?
tiny.cloudera.com/sgquestions
Compression Styles and Entropy
Block
Block
Block
Column
Column
Column
Column
Column
Column
Row
Group
Row
Group
Questions?
tiny.cloudera.com/sgquestions
Compression Codecs
- Snappy: 2x-3x : Fast Read, Fast Write
- Lzo : 2x-3x : Fast Read, Fast Write
- Gzip : ~8x: ~Fast Read, Normal Write
- Default : ~8x: ~Fast Read, Normal Write
- BZip2 : ~10x ~Fast Read, Slow Write
- Others ..
- Always be skeptical
- All data compresses differently
- Use your own data
Questions?
tiny.cloudera.com/sgquestions
Introducing the Hive Metastore
- Hive Metastore
- Adds a table like metadata layer over a file system, block store, NoSql, or other
- Allows for SQL access
- Allows for greater security options
- Allows for external metadata
- Allows for partitioning
Questions?
tiny.cloudera.com/sgquestions
Typical Hive Table
- ParentFolder
- TableFolder
- Date=20171212
- DataFiles
- DataFiles
- Date=20171211
- DataFiles
- DataFiles
Questions?
tiny.cloudera.com/sgquestions
Access Patterns
- Partitioning
- Filter push down
- Indexing should be considered poor
- Ideal for large scans
Relational Storage
Questions?
tiny.cloudera.com/sgquestions
Thinking about Object/Tables
1. Lets start off easy
1. Use Case: We are a Netflix type company and we have a log of users and movies watched
that looks something like this:
User ID Age Account Start
Date
Category Of User Movie Watched Movie Category Start Time Events List
Bob 42 12/12/2012 Basic Die Hard Action 5/4/2016 12:00 Play 0, pause at
15, FF at 40 to 55,
E at 90
Kat 31 12/12/2012 Platum Beauty and the
Beast
Family 5/4/2016 12:00 Play 0, pause at
15, FF at 40 to 55,
E at 90
Questions?
tiny.cloudera.com/sgquestions
Thinking about Object/Tables
1. To make this into objects we need to do some separation
User
User_id
Age
St_dt
Category
Movie
Movie_id
Title
Category
Watch_session
Watch_id
St_dt
En_dt
User_id
Movie_id
Watch_Events
Watch_id
St_dt
Type
Duration
Category_Typ
Category_id
Stream_rt
Is_feature_enabled
1 *
*
1
1
*
1*
Questions?
tiny.cloudera.com/sgquestions
Query Considerations
- Data is normally big so
- Partition respectively to access patterns
- Join with care
- Consider sampling or local testing before experimenting
- Data is files
- Latency to accessibility it high – seconds, minutes or more.
Questions?
tiny.cloudera.com/sgquestions
Look for big tables
User
User_id
Age
St_dt
Category
Movie
Movie_id
Title
Category
Watch_session
Watch_id
St_dt
En_dt
User_id
Movie_id
Watch_Events
Watch_id
St_dt
Type
Duration
Category_Typ
Category_id
Stream_rt
Is_feature_enabled
1 *
*
1
1
*
1*
Questions?
tiny.cloudera.com/sgquestions
Mutation Patterns
- File is written once and can not be mutated
- Fine for append or snapshot use cases
- Mutation will require a compaction
Questions?
tiny.cloudera.com/sgquestions
Compaction Recap
Key Time Value
A 1 101
B 1 101
C 1 101
D 1 101
E 1 101
F 1 101
G 1 101
Key Time Value
A 2 102
D 2 102
F 2 102
F 3 103
H 3 103
Key Time Value
A 2 102
B 1 101
C 1 101
D 2 102
E 1 101
F 3 103
G 1 101
H 3 103
Questions?
tiny.cloudera.com/sgquestions
View Strategies
Hive Relational Model
Hive Nested Model
Models
Hive Normal Views
Hive Materialized Table
Views
Use in the cases where the view requires
a join that is done through a shuffle
Use only for tables that filter
records/columns or use for marking fields
Nested Structures
Questions?
tiny.cloudera.com/sgquestions
Nested
▪ Less Space than Denormalization
▪ Still have tables but the cost of joins is all but gone
▪ Also great for cartesian joins
- N x M vs N + M
▪ Not really supported yet with Kudu or HBase with SQL
Questions?
tiny.cloudera.com/sgquestions
Nested Example
CREATE TABLE fact_contacts (id BIGINT, name STRING, address
STRING) STORED AS PARQUET;
CREATE TABLE dim_phones
(
contact_id BIGINT
, category STRING
, international_code STRING
, area_code STRING
, exchange STRING
, extension STRING
, mobile BOOLEAN
, carrier STRING
, current BOOLEAN
, service_start_date TIMESTAMP
, service_end_date TIMESTAMP
)
Questions?
tiny.cloudera.com/sgquestions
Nested Example
CREATE TABLE contacts_detailed_phones
(
id BIGINT, name STRING, address STRING
, phone ARRAY < STRUCT <
category: STRING
, international_code: STRING
, area_code: STRING
, exchange: STRING
, extension: STRING
, mobile: BOOLEAN
, carrier: STRING
, current: BOOLEAN
, service_start_date: TIMESTAMP
, service_end_date: TIMESTAMP
>>
) STORED AS PARQUET;
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html
Questions?
tiny.cloudera.com/sgquestions
De-normalized vs Nested
- Nested Pros
- Co-location
- Faster to group by
- Faster to window
- Joins are free
- Less data
- Better compression
- Tables and Columns can be read with out penalty from one not read
- Great for limiting the effort are Cartesian Joins
- Nested Cons
- Size limitation of parent row
- Adding child requires the re-write the the whole parent record
Questions?
tiny.cloudera.com/sgquestions
Options for appending Nested
- It is all about the parent record
- We can add more then one Partition key for the parent
- In our use case
- User & watch month or day
Questions?
tiny.cloudera.com/sgquestions
Storage and In Memory
- Also don’t limit the idea of nested to just tables
- In Spark they can be used as in memory constructs to
- conserve on networking
- In memory cost
Questions?
tiny.cloudera.com/sgquestions
Nested Writing Example in Spark
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" }
]
Questions?
tiny.cloudera.com/sgquestions
Nested Writing Example in Spark
val jsonDF = hiveContext.read.json(jsonRDD)
jsonDF.write.parquet("./parquet")
hiveContext.createExternalTable("jsonNestedTable", "./parquet")
Time Series
Questions?
tiny.cloudera.com/sgquestions
Time Series Options
▪ HBase
▪ Cassandra
▪ InfluxDB
▪ Data Dog
▪ Many More
Questions?
tiny.cloudera.com/sgquestions
Entity Centric Time Series
▪ Partition by Entity ID
▪ Order by Time
▪ Allows for free windowing
▪ Allows for fetching of single time window of single entity at web scale
Questions?
tiny.cloudera.com/sgquestions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Questions?
tiny.cloudera.com/sgquestions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Rest Call Short Scan
Questions?
tiny.cloudera.com/sgquestions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Mapper Mapper Mapper
Questions?
tiny.cloudera.com/sgquestions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Mapper
Mapper Mapper
Questions?
tiny.cloudera.com/sgquestions
What is meant by Bucketing and Sorting
- Partitioning on a Key
- Then sorting on that key + another field(s)
- Example
- User_id + Watch Event Time
Questions?
tiny.cloudera.com/sgquestions
Example of Bucketed Sorted
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-F, 10
Cust-F, 20
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Questions?
tiny.cloudera.com/sgquestions
Good for Appending Nested
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-F, 10
Cust-F, 20
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Cust-A, 50
Cust-A, 60
Cust-B, 50
Cust-B, 60
Cust-C, 50
Cust-D, 50
Cust-G, 50
Existing DataNew Data
Questions?
tiny.cloudera.com/sgquestions
Good for Appending Nested
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-F, 10
Cust-F, 20
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Cust-A, 50
Cust-A, 60
Cust-B, 50
Cust-B, 60
Cust-C, 50
Cust-D, 50
Cust-G, 50
Existing DataNew Data
Shuffle Join
Questions?
tiny.cloudera.com/sgquestions
Good for Appending Nested
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-F, 10
Cust-F, 20
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Cust-B, 50
Cust-B, 60
Existing DataNew Data
Cust-A, 50
Cust-A, 60
Cust-C, 50
Cust-D, 50
Cust-G, 50
Merge Join
Questions?
tiny.cloudera.com/sgquestions
Good for Appending Nested
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-A, 50
Cust-A, 60
Cust-C, 50
Merge Join
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-A, 50
Cust-A, 60
Cust-C, 50
Order Retained
NoSQL
Questions?
tiny.cloudera.com/sgquestions
What is a NoSQL
- It’s not NO SQL
- It’s not a Database
- Think of it more like a
- HashMap
- Log
- Bucketed and Ordered
Questions?
tiny.cloudera.com/sgquestions
Hash Map
- There is a Key and a Value
- It is really fast to grab a key/value
- It is really fast to add a key/value
- Iteration is also possible
Key Value
A 1
B 1
C 1
D 1
E 1
F 1
G 1
Client
Questions?
tiny.cloudera.com/sgquestions
Log with Compactions
- When new records come in they don’t rewrite the old
- They compact in
Key Time Value
A 1 101
B 1 101
C 1 101
D 1 101
E 1 101
F 1 101
G 1 101
Key Time Value
A 2 102
D 2 102
F 2 102
F 3 103
H 3 103
Key Time Value
A 2 102
B 1 101
C 1 101
D 2 102
E 1 101
F 3 103
G 1 101
H 3 103
Questions?
tiny.cloudera.com/sgquestions
HDFS
Log with Compactions
- Write Path
- Get Local for Record (Cached)
- First to WAL
- Then to Memstore
- Sorting & batching
- Flush to New Hfile
- Later Hfiles will be compacted
Client
Master
RegionServer
Memstore
HFiles New HFiles
HFiles
WAL
Questions?
tiny.cloudera.com/sgquestions
HDFS
Ordered
- All Records Columns are ordered
- Ordering allows for simpler indexing
- Ordering allows for simpler compactions
- We will also use this ordering
- Windowing
- Time series
- Local scanning
Client
Master
RegionServer
Memstore
HFiles New HFiles
HFiles
Questions?
tiny.cloudera.com/sgquestions
Bucketing or Partitions
- HBase
- Out of the Box:
- Range
- Desired:
- Salt
- Cassandra
- Out of the Box:
- HashMod
- Bucketed HashMod
Questions?
tiny.cloudera.com/sgquestions
So what about SQL
- Well SQL could totally work
- CQL for cassandra
- Hive and SparkSQL on HBase
- Why is it not the best idea
- Built more for point look ups
- Scans are not as fast as parquet
- However the mutability may be more important than speed
- Partitioning is not simple
- It must be put into the key
Questions?
tiny.cloudera.com/sgquestions
Let’s talk about CAP for a Minute
- Strong Consistency
- HBase & Kudu
- Variable Consistency
- Cassandra
Questions?
tiny.cloudera.com/sgquestions
HBase Model
Client
Master
Region Server 1
Region Server 2
- Region Server owns range splits
- Region Server 1 fails
- Master needs to figure that out
- Master needs to assign new Region Server to own splits
- Region Server 2 has to get organized
- Region Server 2 is read to server reads and writes
Questions?
tiny.cloudera.com/sgquestions
Cassandra Model
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Random Node)
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Random Node)
Questions?
tiny.cloudera.com/sgquestions
Cassandra Model
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Client
Questions?
tiny.cloudera.com/sgquestions
Cassandra Model (Common Models)
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Client
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Client
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Client
3 Write - 1 Read
1 Write - 3 Read
1 Write - 1 Read
Questions?
tiny.cloudera.com/sgquestions
What do they share (Head Types)
- Master and a HA
- Leader Election
Master Hot Standby
Client
Leader Elected Follower Follower
Client
ZooKeeper
Elections
Questions?
tiny.cloudera.com/sgquestions
What do they share (Headless)
Headless Headless Headless
Client
Headless Headless Headless
Client
Questions?
tiny.cloudera.com/sgquestions
What do they share (CAP theorem)
Consistency
Questions?
tiny.cloudera.com/sgquestions
What do they share (CAP theorem)
Consistency
Eventual Consistency
Questions?
tiny.cloudera.com/sgquestions
What do they share (CAP theorem)
Consistency
Eventual Consistence
- Cheating the CAP Theorem
- Cassandra is a good model
- Where they expand the definition of failure
with variable consistence
- CAP still holds but …
Indexed Search
Questions?
tiny.cloudera.com/sgquestions
Lucene Indexing (Features)
- We don’t have enough time in this whole class
- Ordering logic
- NGrams
- Weights
- Text Indexing
- Translations
- Facets *
Questions?
tiny.cloudera.com/sgquestions
Lucene Indexing (Facets)
- Facets are a side effect of out wonderful indexes
- It allows us to counts all the document that below to given indexes to produce
- Grouped Counts
- Charts and Graphs (kibana or Banana)
- People will also call this access pattern cubing a dataset
Questions?
tiny.cloudera.com/sgquestions
Lucene Indexing (Kibana & Banana)
Questions?
tiny.cloudera.com/sgquestions
Lucene Indexing (Facets Example)
- Time Series Example
Document
ID
Hour of Day User State Event
1 12 4201 MD click
2 12 4202 VA click
3 12 4203 VA click
4 1 4201 MD click
5 1 4202 VA view
6 2 4204 CA click
7 2 4205 VA view
8 2 4201 MD click
Questions?
tiny.cloudera.com/sgquestions
Lucene Indexing (Facets Example)
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Document
ID
Hour of
Day
User State Event
1 12 4201 MD click
2 12 4202 VA click
3 12 4203 VA click
4 1 4201 MD click
5 1 4202 VA view
6 2 4204 CA click
7 2 4205 VA view
8 2 4201 MD click
9 2 4204 CA click
User
4201 1 4 8
4202 2 5
4203 3
4204 6 9
4205 7
State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Event
click 1 2 3 4 6 8 9
view 5 7
Questions?
tiny.cloudera.com/sgquestions
Lucene Indexing (Facets Example)
- Events per hour
- Simple array count
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Questions?
tiny.cloudera.com/sgquestions
- Events per hour by State
- Simple array count
Lucene Indexing (Facets Example)
State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Questions?
tiny.cloudera.com/sgquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Questions?
tiny.cloudera.com/sgquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 CA
Questions?
tiny.cloudera.com/sgquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1
VA
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 MD
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 CA
Questions?
tiny.cloudera.com/sgquestions
Partitioning
- SolR and Elastic Search partition the document o land on all nodes
- This means
- You have the power of the cluster when querying
- This mean you are accessing the cluster when querying
Questions?
tiny.cloudera.com/sgquestions
Writing Latency
- Lucene Indexing is more expensive then NoSQL work
- Think of it as micro batching
- Larger batches ~= better throughput
- Compaction is also invalid
- Deletes impact storage and performance until they are compacted
Questions?
tiny.cloudera.com/sgquestions
Storage Cost
- TTL is your friend
- Think of Lucene based systems as great if
- You dataset is manageable in size
- You have a good TTL strategy
- You have a boat load of money
Graphs
Questions?
tiny.cloudera.com/sgquestions
Thinking in terms of Graphs
- Nodes and Edges
Node:1
…
Node:3
…
Node:2
…
Node:0
…
Friend
Child
Father
CoachWife
Questions?
tiny.cloudera.com/sgquestions
Thinking in terms of Graphs
- Use cases
- Querying
- Cassandra with Sparkle
- Neo4j
- Batch operations
- Giraph
- GraphX
- GraphLab
Questions?
tiny.cloudera.com/sgquestions
BSP Bulk Synchronous Parallel
- Process every Node Atomically
- Node gets all messages sent to it
- Nodes can mutate them selves and their edges
- Nodes can send messages to other nodes
- But nothing is received yet
- BSP waits until all the Node processing is done
- Then send messages to the right partition
- Repeat
Honorable Mentions
Kudu
CockroachDB
Druid.io
Questions?
tiny.cloudera.com/sgquestions
Why Honorable Mentions
Questions?
tiny.cloudera.com/sgquestions
Kudu
1. Replace the Region Servers with Tablet Servers
2. Replace block format HFile files with a parquet like TFiles
3. Replace the byte array focused HBase API with one that is more JDBC friendly
4. Tight integration with Spark SQL and Impala for SQL
5. Completely rewrite the compaction process to make for perfectly sized files with our having major
compactions but always manageable micro compactions.
Questions?
tiny.cloudera.com/sgquestions
Kudu
Master
Tablet Server
SSD
Tablet Server
SSD
Tablet Server
SSD
Tablet Server
SSD
Metadata
Splits
Questions?
tiny.cloudera.com/sgquestions
CoackroachDB
1. Think about Kudu but
1. Nested tables
2. Transitions
Questions?
tiny.cloudera.com/sgquestions
CockroachDB
SQL
Structured Data API
Distributed Monolithic KV Store
Tablet Server
Store 1
SSD
RocksD
B
Store 2
SSD
RocksD
B
Tablet Server
Store 3
SSD
RocksD
B
Store 4
SSD
RocksD
B
Tablet Server
Store 5
SSD
RocksD
B
Store 6
SSD
RocksD
B
Tablet Server
Store 7
SSD
RocksD
B
Store 8
SSD
RocksD
B
Questions?
tiny.cloudera.com/sgquestions
Druid.IO
1. Not a general SQL type store
2. More focused on metric
1. Where larger aggregations may be required
2. Many variables for same metric tag combinations
Questions?
tiny.cloudera.com/sgquestions
Druid.IO
Client
Broker Cluster
Broker
Node
Broker
Node
Broker
Node
Real Time Cluster
Real Time
Node
Real Time
Node
Real Time
Node
History Cluster
History
Node
History
Node
History
Node
Pluggable
Storage
Batch Ingestion
Streaming
Ingestion
ZookKeeper
Metadata Storage
Query Planning and
Response Preparation
Try to optimize read
path based on time
request of the query
Async large batching
Main Ingestion Path
Short TTL for hot
memory cache
House keeping services
Data Access
Questions?
tiny.cloudera.com/sgquestions
High level architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
Data Access
Batch Processing
Questions?
tiny.cloudera.com/sgquestions
Why have batch processing?
▪ When you need a larger context
- Say, to train a model
▪ Complex periodic job that does something
- Convert data to a nested structure for reduced number of shuffles
▪ For example,
- Kudu -> HDFS Nested is batch processing
- KMeans calculation, etc.
Questions?
tiny.cloudera.com/sgquestions
Types of Processing
▪ ETL
▪ Analytics
▪ Graph Analytics
▪ Machine Learning
▪ Deep Learning
Questions?
tiny.cloudera.com/sgquestions
Batch processing options
▪ Spark (+ MLlib)
▪ MapReduce (+ Mahout)
▪ Flink (+ Flink ML)
▪ Distributed query engines
Questions?
tiny.cloudera.com/sgquestions
Spark
▪ Pretty popular
▪ Much faster than MapReduce
▪ Thriving community
Questions?
tiny.cloudera.com/sgquestions
MapReduce
▪ Sloooooow
Questions?
tiny.cloudera.com/sgquestions
Flink
▪ Pretty popular
▪ Batch is a special case of Streaming
▪ Developing community
Questions?
tiny.cloudera.com/sgquestions
Just SQL
▪ Apache Spark SQL
▪ Hive
▪ MR, Spark, Tez
▪ Impala
▪ Presto
Questions?
tiny.cloudera.com/sgquestions
Distributed Machine Learning
▪ Apache Spark MlLib
▪ Apache Spark Custom
▪ H2o
▪ Data Robot
▪ Google Tensorflow
▪ Microsoft Congnitive
Questions?
tiny.cloudera.com/sgquestions
GPUs vs CPUs
▪ GPU are much better for floating point operations
▪ Consider Gradient Descent
▪ Encourages hardware to be separated from storage
Data Access
Interactive Access
Questions?
tiny.cloudera.com/sgquestions
Types of data access
▪ REST server/APIs for querying entities and aggregates
▪ UI for displaying search facets
▪ SQL engine
REST servers
Considerations
Questions?
tiny.cloudera.com/sgquestions
REST Servers
import org.mortbay.jetty.Server
import org.mortbay.jetty.servlet.{Context, ServletHolder}
…
val server = new Server(port)
val sh = new ServletHolder(classOf[ServletContainer])
sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass",
"com.sun.jersey.api.core.PackagesResourceConfig")
sh.setInitParameter("com.sun.jersey.config.property.packages",
"com.hadooparchitecturebook.taxi360.server.hbase")
sh.setInitParameter("com.sun.jersey.api.json.POJOMappingFeature", "true”)
val context = new Context(server, "/", Context.SESSIONS)
context.addServlet(sh, "/*”)
server.start()
server.join()
UI
Considerations
Questions?
tiny.cloudera.com/sgquestions
UI requirements
Something that can
▪ Represent search results really well
▪ Integrates with Apache Solr, query engines, etc.
Questions?
tiny.cloudera.com/sgquestions
UI options
▪ Hue
▪ Banana
▪ Kibana
Questions?
tiny.cloudera.com/sgquestions
Hue In Our Use Case
▪ Because it’s included
▪ Please look at the others
Data Access
SQL Engines
Questions?
tiny.cloudera.com/sgquestions
SQL engine criteria
▪ Low latency SQL access
▪ Allows for high concurrency
▪ JDBC/ODBC integration
▪ Capable of large scale aggregation
▪ Optionally integrates with Multiple Storage Systems for real-time updates to SQL
tables
Questions?
tiny.cloudera.com/sgquestions
Apache Hive
▪ Good JDBC integration
▪ Not really low latency, even when using Tez
▪ Doesn’t integrate with Kudu
▪ Can run with MapReduce, Spark, or Tez
Questions?
tiny.cloudera.com/sgquestions
Presto
▪ Low latency SQL engine from Facebook
▪ Provides JDBC/ODBC access
▪ Is only in-memory, large aggregations can lead to OOM errors
▪ Doesn’t integrate with Kudu
Questions?
tiny.cloudera.com/sgquestions
Apache Impala
▪ Low latency SQL access
▪ Provides JDBC/ODBC access
▪ Excellent concurrency support
▪ Integrates with Kudu for real-time SQL
Questions?
tiny.cloudera.com/sgquestions
Apache Drill
▪ Similar in architecture to Impala
▪ Provides JDBC/ODBC access
▪ Doesn’t integrate with Kudu
Questions?
tiny.cloudera.com/sgquestions
Spark SQL
▪ Builds on top of Spark
▪ JDBC/ODBC access only via Spark Thrift Server
- Doesn’t scale well with larger number of concurrent users
- Doesn’t fully provide secure access.
Summary
Questions?
tiny.cloudera.com/sgquestions
High level architecture
Data Sources Streaming Pipes
Schema Validation
Enrichment
Stream Processing
Routing
StorageProducers
Transport
Replication
Long Term SQL
Speed Layer SQL
Time Series
Reverse Indexed
Stream
Access
SQL
Machine Learning
Request Response
Batch Processing
Code
Agents
Log Aggregators
Where else to find us?
Questions?
tiny.cloudera.com/sgquestions
Other Sessions
▪ Top Five Mistakes When Writing Streaming Applications – Wednesday, 11:15
Thank you!
@hadooparchbook
tiny.cloudera.com/app-arch-newyork
Mark Grover | @mark_grover
Gwen Shapira | @gwenshap
Jonathan Seidman | @jseidman

Más contenido relacionado

La actualidad más candente

Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRmarkgrover
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016StampedeCon
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsMark Rittman
 
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroCloudera, Inc.
 
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsReal-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsMichael Rainey
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 

La actualidad más candente (20)

Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
 
The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroHBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend Micro
 
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsReal-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 

Similar a Architecting a Next Generation Data Platform – Strata Singapore 2017

Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
 
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...Databricks
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond RelationalLynn Langit
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDatabricks
 
Stargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data APIStargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data APIData Con LA
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsCloudera, Inc.
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empowerDurga Gadiraju
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaGuido Schmutz
 
Webinar How to Achieve True Scalability in SaaS Applications
Webinar How to Achieve True Scalability in SaaS ApplicationsWebinar How to Achieve True Scalability in SaaS Applications
Webinar How to Achieve True Scalability in SaaS ApplicationsTechcello
 

Similar a Architecting a Next Generation Data Platform – Strata Singapore 2017 (20)

Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
Productionizing Machine Learning with Apache Spark, MLflow and ONNX from the ...
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ DevicesDelivering Insights from 20M+ Smart Homes with 500M+ Devices
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
 
Stargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data APIStargate, the gateway for some multi-models data API
Stargate, the gateway for some multi-models data API
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
 
Webinar How to Achieve True Scalability in SaaS Applications
Webinar How to Achieve True Scalability in SaaS ApplicationsWebinar How to Achieve True Scalability in SaaS Applications
Webinar How to Achieve True Scalability in SaaS Applications
 

Más de Jonathan Seidman

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Jonathan Seidman
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_finalJonathan Seidman
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Jonathan Seidman
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Jonathan Seidman
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 

Más de Jonathan Seidman (14)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 

Último

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Último (20)

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Architecting a Next Generation Data Platform – Strata Singapore 2017