SlideShare una empresa de Scribd logo
1 de 14
Always-on Ingestion
Con$nuous	
  Inges$on	
  for	
  Data	
  at	
  Scale	
  
	
  ©	
  2015	
  StreamSets	
  Inc.,	
  All	
  rights	
  reserved	
  
Arvind	
  Prabhakar	
  
Big	
  Data	
  Day	
  LA,	
  June	
  2015	
  
© 2015 StreamSets, Inc.
About me 	
  
	
  
❏  Founder/CTO	
  
	
  Apache	
  So?ware	
  FoundaBon	
  
❏  Flume	
  -­‐	
  PMC	
  Chair	
  
❏  Sqoop	
  -­‐	
  PMC	
  Chair	
  
❏  Storm	
  -­‐	
  PMC,	
  CommiFer	
  
❏  MetaModel	
  -­‐	
  Mentor	
  
❏  Sentry	
  -­‐	
  Mentor	
  
❏  NiFi	
  -­‐	
  Mentor	
  
❏  ASF	
  Member	
  
	
  Previously...	
  
❏  Cloudera	
  
❏  InformaKca	
  
	
  
@aprabhakar
© 2015 StreamSets, Inc.
Some Background
What is Data Ingestion?
Why do we need Data Ingestion?
❏  Acquiring data from various sources
❏  Storing acquired data where it can be processed
❏  Data is consumed away from where it is produced
❏  Consuming systems are often distributed and remote
❏  Manually, with scripts, with rudimentary automation
❏  Higher level frameworks like Flume, Kafka, etc
How is Data Ingestion Implemented?
Logs	
  
Files	
  
Click	
  Streams	
  
Sensors	
  
Devices	
  
Database	
  
Logs	
  
Social	
  Data	
  
Streams	
  
Feeds	
  
Other	
  
Raw	
  Storage	
  
(HDFS,	
  S3)	
  
EDW,	
  NoSQL	
  
(Hive,	
  Impala,	
  
HBase,	
  Cassandra,	
  
RedShiY)	
  
Search	
  
(Solr,	
  
ElasKcSearch)	
  
Enterprise	
  Data	
  
Infrastructure	
  
© 2015 StreamSets, Inc.
Data Ingestion Challenge
Ever Increasing data volumes
and rates...
Data sources are physically
distributed and transient...
╳
╳
╳
╳
╳
╳
Data structures and semantics
are constantly changing...
© 2015 StreamSets, Inc.
Lot more than moving data!
Data Ingest should be agile
Data Ingest should be safe and reliable
❏  Welcome new data sources as they emerge
❏  Incorporate changes to existing sources as needed
❏  Protect your downstream from silent data corruption
❏  Ensure that there is no data loss in your infrastructure
Data Ingest should scale as needed
❏  Data ingest must never become a bottleneck
❏  Data ingest must scale without significant cost or effort
RELIABLE
© 2015 StreamSets, Inc.
»	
  	
  Design	
  Wisely	
  
	
  
	
  
»	
  	
  Operate	
  CauKously	
  
	
  
	
  
»	
  	
  Update	
  Liberally	
  
What can you do?
●  Pick	
  the	
  right	
  technology	
  
and	
  toolset	
  
	
  
●  Instrument	
  and	
  monitor	
  
mercilessly	
  
	
  
●  AnKcipate	
  and	
  understand	
  
the	
  changes	
  in	
  your	
  
environment	
  
Here is how...
© 2015 StreamSets, Inc.
Picking the right technology
Manual/Scripted
Batch Transport
Micro-batching
Pipelining
Message-Queue
File copying using CLI or GUI interface Cloudera HUE, Hadoop FS client
Ingest Mode Description Example
Bulk data transport using tools Sqoop, DistCp
Transport of small batches of data Sqoop/Sqoop2 (Storm, etc...)
Flow-like transport of event streams Flume, Scribe
Publish-Subscribe like transport of events Kafka, Kinesis
© 2015 StreamSets, Inc.
Sqoop
Overview Advantages Disadvantages
❏  Propagates	
  metadata	
  
❏  Cluster	
  based	
  parallel	
  
scaling	
  capability	
  
❏  Simple	
  and	
  easy	
  to	
  
understand/operate	
  
❏  Rich	
  set	
  of	
  connectors	
  
available	
  for	
  use	
  
❏  Supports	
  popular	
  formats	
  
like	
  Avro,	
  sequence	
  file	
  
etc.	
  
❏  Not	
  a	
  service	
  
❏  Direct	
  access	
  to	
  
producKon	
  data	
  stores	
  
from	
  cluster	
  
❏  Requires	
  access	
  to	
  data	
  
store	
  credenKals	
  	
  
❏  Connector	
  funcKonality	
  is	
  
not	
  consistent	
  between	
  
different	
  connectors	
  
❏  CLI	
  Tool	
  
❏  Oriented	
  towards	
  
structured	
  data	
  stores	
  
❏  Runs	
  map-­‐only	
  job	
  to	
  
transport	
  data	
  
	
  
© 2015 StreamSets, Inc.
Sqoop 2
Overview Advantages
❏  Propagates	
  metadata	
  
❏  Cluster	
  based	
  parallel	
  
scaling	
  capability	
  
❏  Simple	
  and	
  easy	
  to	
  
understand/operate	
  
❏  Supports	
  popular	
  formats	
  
like	
  Avro,	
  sequence	
  file	
  
etc.	
  
❏  Consistent	
  funcKonality	
  
across	
  connectors	
  
❏  Secure	
  handling	
  of	
  
credenKals	
  with	
  RBAC	
  
security	
  
❏  Considered	
  pre-­‐
producKon	
  quality	
  before	
  
2.0.0	
  release.	
  Currently	
  at	
  
1.99.6.	
  
❏  May	
  not	
  have	
  connecKvity	
  
at	
  par	
  with	
  Sqoop	
  1.	
  
❏  Sqoop	
  Service	
  with	
  CLI	
  
and	
  JSON/REST	
  interface	
  
❏  Oriented	
  towards	
  
structured	
  data	
  stores	
  
❏  Runs	
  chained	
  Map-­‐
Reduce	
  jobs	
  for	
  data	
  
transport	
  and	
  conversion	
  
	
  
Disadvantages
© 2015 StreamSets, Inc.
Flume
Overview Advantages
❏  Guaranteed	
  delivery	
  
semanKcs	
  
❏  Low-­‐latency	
  reliable	
  data	
  
transfer	
  
❏  DeclaraKve	
  configuraKon	
  
with	
  no	
  coding	
  necessary	
  
for	
  common	
  use-­‐cases	
  
❏  Fully	
  extendable	
  and	
  
customizable	
  
❏  Integrates	
  with	
  most	
  
commonly	
  used	
  end-­‐
points	
  
❏  Non-­‐trivial	
  configuraKon	
  
❏  Complex	
  topology	
  
configuration	
  can	
  be	
  hard	
  
to	
  build	
  and	
  maintain	
  
❏  Custom	
  end-­‐point	
  
implementaKon	
  requires	
  
significant	
  code	
  
complexity	
  
❏  Distributed	
  pipeline	
  
system	
  for	
  efficient	
  
transport	
  of	
  large	
  
volumes	
  of	
  data	
  
❏  Built	
  in	
  support	
  for	
  
contextual	
  rouKng,	
  
filtering,	
  replicaKon	
  and	
  
mulKplexing	
  
	
  
	
  
	
  
Disadvantages
© 2015 StreamSets, Inc.
Kafka
Overview Advantages
❏  Strong	
  retenKon	
  and	
  
ordering	
  semanKcs	
  
❏  Dynamic	
  cluster	
  based	
  
scalability	
  and	
  throughput	
  
❏  Low-­‐level	
  APIs	
  for	
  building	
  
consumers	
  and	
  producers	
  
❏  Variety	
  of	
  open	
  source	
  
producers	
  and	
  consumers	
  
available	
  on	
  GitHub	
  
❏  Allows	
  reprocessing	
  of	
  
consumed	
  data	
  
❏  Distributed	
  and	
  efficient	
  
publish-­‐subscribe	
  
messaging	
  system	
  	
  
❏  Used	
  for	
  democraKzaKon	
  
of	
  data	
  between	
  
applicaKons	
  
	
  
	
  
	
  
Disadvantages
❏  Delivery	
  guarantee	
  owned	
  
by	
  producers	
  and	
  
consumers	
  
❏  Opaque	
  pub-­‐sub	
  design	
  
can	
  cause	
  applicaKons	
  to	
  
be	
  	
  highly	
  coupled	
  
❏  Minimal	
  metadata	
  
support	
  
© 2015 StreamSets, Inc.
Typical Examples
For Structured Data
Simple	
  
❏  Sqoop	
  for	
  Batch	
  transport	
  
❏  Sqoop	
  2	
  for	
  micro-­‐batch	
  transport	
  
	
  
	
  
Intermediate	
  
❏  Flume	
  for	
  Directory	
  Spooling	
  	
  
	
  
	
  
Advanced	
  
❏  Custom	
  Database	
  Log	
  Shipping	
  
implementaKon	
  
Simple	
  
❏  Flume	
  based	
  AggregaKon	
  
❏  Kaia	
  based	
  pub-­‐sub	
  for	
  applicaKons	
  
	
  
	
  
Intermediate	
  
❏  Flume	
  +	
  Kaia	
  based	
  aggregaKon	
  and	
  
pub-­‐sub	
  
	
  
	
  
Advanced	
  
❏  Kaia	
  +	
  Storm	
  for	
  pub-­‐sub	
  and	
  
preparaKon	
  
For Streaming Event Data
© 2015 StreamSets, Inc.
	
  
❏  Apache	
  Sqoop:	
  hFp://sqoop.apache.org	
  
❏  Current	
  Version:	
  Sqoop1	
  -­‐	
  1.4.6;	
  	
  Sqoop2	
  -­‐	
  1.99.6	
  
	
  
❏  Apache	
  Flume:	
  hFp://flume.apache.org	
  
❏  Current	
  Version:	
  Flume	
  1.6.0	
  
	
  
❏  Apache	
  Kaia:	
  hFp://kaia.apache.org	
  
❏  Current	
  Version:	
  Kaia	
  0.8.2.1	
  
	
  
	
  
For more information...
© 2015 StreamSets, Inc.
My	
  Contact	
  InformaKon:	
  
●  Email:	
  	
  arvind	
  at	
  streamsets	
  dot	
  com	
  
●  TwiFer:	
  @aprabhakar	
  
●  Website:	
  www.streamsets.com	
  
	
  
	
  
	
  
Thank You!

Más contenido relacionado

La actualidad más candente

Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestionVinod Nayal
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexDataWorks Summit
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata StreamingZoomdata
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 

La actualidad más candente (20)

Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
In Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging serviceIn Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging service
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata Streaming
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 

Destacado

IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...Kay Lerch
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseAldrin Piri
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologiesSachin Aggarwal
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureKhalid Salama
 
Developing Connected Applications with AWS IoT - Technical 301
Developing Connected Applications with AWS IoT - Technical 301Developing Connected Applications with AWS IoT - Technical 301
Developing Connected Applications with AWS IoT - Technical 301Amazon Web Services
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 
Getting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics servicesGetting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics servicesVladimir Bychkov
 
London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)Landoop Ltd
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsKonrad Malawski
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter
 
Processing IoT Data with Apache Kafka
Processing IoT Data with Apache KafkaProcessing IoT Data with Apache Kafka
Processing IoT Data with Apache KafkaMatthew Howlett
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging ChallengesAaron Irizarry
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected BreweryJason Hubbard
 
Five Cool Use Cases for the Spring Component of the SOA Suite 11g
Five Cool Use Cases for the Spring Component of the SOA Suite 11gFive Cool Use Cases for the Spring Component of the SOA Suite 11g
Five Cool Use Cases for the Spring Component of the SOA Suite 11gGuido Schmutz
 
Fusion Middleware Live Application Development Demo Oracle Open World 2012
Fusion Middleware Live Application Development Demo Oracle Open World 2012Fusion Middleware Live Application Development Demo Oracle Open World 2012
Fusion Middleware Live Application Development Demo Oracle Open World 2012Lucas Jellema
 

Destacado (20)

IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
IoT Innovation Lab Berlin @relayr - Kay Lerch on Getting basics right for you...
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Comparison of various streaming technologies
Comparison of various streaming technologiesComparison of various streaming technologies
Comparison of various streaming technologies
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Developing Connected Applications with AWS IoT - Technical 301
Developing Connected Applications with AWS IoT - Technical 301Developing Connected Applications with AWS IoT - Technical 301
Developing Connected Applications with AWS IoT - Technical 301
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
Getting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics servicesGetting started with Azure Event Hubs and Stream Analytics services
Getting started with Azure Event Hubs and Stream Analytics services
 
Blr hadoop meetup
Blr hadoop meetupBlr hadoop meetup
Blr hadoop meetup
 
London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)London Apache Kafka Meetup (Jan 2017)
London Apache Kafka Meetup (Jan 2017)
 
Storm over gearpump
Storm over gearpumpStorm over gearpump
Storm over gearpump
 
Kafka connect
Kafka connectKafka connect
Kafka connect
 
Not Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabsNot Only Streams for Akademia JLabs
Not Only Streams for Akademia JLabs
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Processing IoT Data with Apache Kafka
Processing IoT Data with Apache KafkaProcessing IoT Data with Apache Kafka
Processing IoT Data with Apache Kafka
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
 
Five Cool Use Cases for the Spring Component of the SOA Suite 11g
Five Cool Use Cases for the Spring Component of the SOA Suite 11gFive Cool Use Cases for the Spring Component of the SOA Suite 11g
Five Cool Use Cases for the Spring Component of the SOA Suite 11g
 
Fusion Middleware Live Application Development Demo Oracle Open World 2012
Fusion Middleware Live Application Development Demo Oracle Open World 2012Fusion Middleware Live Application Development Demo Oracle Open World 2012
Fusion Middleware Live Application Development Demo Oracle Open World 2012
 

Similar a Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets

Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSOpenstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSSadique Puthen
 
Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Sadique Puthen
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Impetus Technologies
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterBest Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterInfluxData
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeTimothy Spann
 
QCon New York 2014 - Apache Stratos
QCon New York 2014  - Apache StratosQCon New York 2014  - Apache Stratos
QCon New York 2014 - Apache StratosSamisa Abeysinghe
 
Scylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi KivityScylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi KivityScyllaDB
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
Big data conference europe real-time streaming in any and all clouds, hybri...
Big data conference europe   real-time streaming in any and all clouds, hybri...Big data conference europe   real-time streaming in any and all clouds, hybri...
Big data conference europe real-time streaming in any and all clouds, hybri...Timothy Spann
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Piyush Kumar
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesLookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesScyllaDB
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...Timothy Spann
 
Apache Stratos tutorial WSO2Con Europe-2014
Apache Stratos tutorial WSO2Con Europe-2014Apache Stratos tutorial WSO2Con Europe-2014
Apache Stratos tutorial WSO2Con Europe-2014Lakmal Warusawithana
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming ArchitecturesCloudera, Inc.
 

Similar a Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets (20)

Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSOpenstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
 
Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterBest Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
QCon New York 2014 - Apache Stratos
QCon New York 2014  - Apache StratosQCon New York 2014  - Apache Stratos
QCon New York 2014 - Apache Stratos
 
Scylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi KivityScylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2019 Keynote - Avi Kivity
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Big data conference europe real-time streaming in any and all clouds, hybri...
Big data conference europe   real-time streaming in any and all clouds, hybri...Big data conference europe   real-time streaming in any and all clouds, hybri...
Big data conference europe real-time streaming in any and all clouds, hybri...
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million DevicesLookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million Devices
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
 
Apache Stratos tutorial WSO2Con Europe-2014
Apache Stratos tutorial WSO2Con Europe-2014Apache Stratos tutorial WSO2Con Europe-2014
Apache Stratos tutorial WSO2Con Europe-2014
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 

Más de Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Más de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Último

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabhakar of Streamsets

  • 1. Always-on Ingestion Con$nuous  Inges$on  for  Data  at  Scale    ©  2015  StreamSets  Inc.,  All  rights  reserved   Arvind  Prabhakar   Big  Data  Day  LA,  June  2015  
  • 2. © 2015 StreamSets, Inc. About me     ❏  Founder/CTO    Apache  So?ware  FoundaBon   ❏  Flume  -­‐  PMC  Chair   ❏  Sqoop  -­‐  PMC  Chair   ❏  Storm  -­‐  PMC,  CommiFer   ❏  MetaModel  -­‐  Mentor   ❏  Sentry  -­‐  Mentor   ❏  NiFi  -­‐  Mentor   ❏  ASF  Member    Previously...   ❏  Cloudera   ❏  InformaKca     @aprabhakar
  • 3. © 2015 StreamSets, Inc. Some Background What is Data Ingestion? Why do we need Data Ingestion? ❏  Acquiring data from various sources ❏  Storing acquired data where it can be processed ❏  Data is consumed away from where it is produced ❏  Consuming systems are often distributed and remote ❏  Manually, with scripts, with rudimentary automation ❏  Higher level frameworks like Flume, Kafka, etc How is Data Ingestion Implemented? Logs   Files   Click  Streams   Sensors   Devices   Database   Logs   Social  Data   Streams   Feeds   Other   Raw  Storage   (HDFS,  S3)   EDW,  NoSQL   (Hive,  Impala,   HBase,  Cassandra,   RedShiY)   Search   (Solr,   ElasKcSearch)   Enterprise  Data   Infrastructure  
  • 4. © 2015 StreamSets, Inc. Data Ingestion Challenge Ever Increasing data volumes and rates... Data sources are physically distributed and transient... ╳ ╳ ╳ ╳ ╳ ╳ Data structures and semantics are constantly changing...
  • 5. © 2015 StreamSets, Inc. Lot more than moving data! Data Ingest should be agile Data Ingest should be safe and reliable ❏  Welcome new data sources as they emerge ❏  Incorporate changes to existing sources as needed ❏  Protect your downstream from silent data corruption ❏  Ensure that there is no data loss in your infrastructure Data Ingest should scale as needed ❏  Data ingest must never become a bottleneck ❏  Data ingest must scale without significant cost or effort RELIABLE
  • 6. © 2015 StreamSets, Inc. »    Design  Wisely       »    Operate  CauKously       »    Update  Liberally   What can you do? ●  Pick  the  right  technology   and  toolset     ●  Instrument  and  monitor   mercilessly     ●  AnKcipate  and  understand   the  changes  in  your   environment   Here is how...
  • 7. © 2015 StreamSets, Inc. Picking the right technology Manual/Scripted Batch Transport Micro-batching Pipelining Message-Queue File copying using CLI or GUI interface Cloudera HUE, Hadoop FS client Ingest Mode Description Example Bulk data transport using tools Sqoop, DistCp Transport of small batches of data Sqoop/Sqoop2 (Storm, etc...) Flow-like transport of event streams Flume, Scribe Publish-Subscribe like transport of events Kafka, Kinesis
  • 8. © 2015 StreamSets, Inc. Sqoop Overview Advantages Disadvantages ❏  Propagates  metadata   ❏  Cluster  based  parallel   scaling  capability   ❏  Simple  and  easy  to   understand/operate   ❏  Rich  set  of  connectors   available  for  use   ❏  Supports  popular  formats   like  Avro,  sequence  file   etc.   ❏  Not  a  service   ❏  Direct  access  to   producKon  data  stores   from  cluster   ❏  Requires  access  to  data   store  credenKals     ❏  Connector  funcKonality  is   not  consistent  between   different  connectors   ❏  CLI  Tool   ❏  Oriented  towards   structured  data  stores   ❏  Runs  map-­‐only  job  to   transport  data    
  • 9. © 2015 StreamSets, Inc. Sqoop 2 Overview Advantages ❏  Propagates  metadata   ❏  Cluster  based  parallel   scaling  capability   ❏  Simple  and  easy  to   understand/operate   ❏  Supports  popular  formats   like  Avro,  sequence  file   etc.   ❏  Consistent  funcKonality   across  connectors   ❏  Secure  handling  of   credenKals  with  RBAC   security   ❏  Considered  pre-­‐ producKon  quality  before   2.0.0  release.  Currently  at   1.99.6.   ❏  May  not  have  connecKvity   at  par  with  Sqoop  1.   ❏  Sqoop  Service  with  CLI   and  JSON/REST  interface   ❏  Oriented  towards   structured  data  stores   ❏  Runs  chained  Map-­‐ Reduce  jobs  for  data   transport  and  conversion     Disadvantages
  • 10. © 2015 StreamSets, Inc. Flume Overview Advantages ❏  Guaranteed  delivery   semanKcs   ❏  Low-­‐latency  reliable  data   transfer   ❏  DeclaraKve  configuraKon   with  no  coding  necessary   for  common  use-­‐cases   ❏  Fully  extendable  and   customizable   ❏  Integrates  with  most   commonly  used  end-­‐ points   ❏  Non-­‐trivial  configuraKon   ❏  Complex  topology   configuration  can  be  hard   to  build  and  maintain   ❏  Custom  end-­‐point   implementaKon  requires   significant  code   complexity   ❏  Distributed  pipeline   system  for  efficient   transport  of  large   volumes  of  data   ❏  Built  in  support  for   contextual  rouKng,   filtering,  replicaKon  and   mulKplexing         Disadvantages
  • 11. © 2015 StreamSets, Inc. Kafka Overview Advantages ❏  Strong  retenKon  and   ordering  semanKcs   ❏  Dynamic  cluster  based   scalability  and  throughput   ❏  Low-­‐level  APIs  for  building   consumers  and  producers   ❏  Variety  of  open  source   producers  and  consumers   available  on  GitHub   ❏  Allows  reprocessing  of   consumed  data   ❏  Distributed  and  efficient   publish-­‐subscribe   messaging  system     ❏  Used  for  democraKzaKon   of  data  between   applicaKons         Disadvantages ❏  Delivery  guarantee  owned   by  producers  and   consumers   ❏  Opaque  pub-­‐sub  design   can  cause  applicaKons  to   be    highly  coupled   ❏  Minimal  metadata   support  
  • 12. © 2015 StreamSets, Inc. Typical Examples For Structured Data Simple   ❏  Sqoop  for  Batch  transport   ❏  Sqoop  2  for  micro-­‐batch  transport       Intermediate   ❏  Flume  for  Directory  Spooling         Advanced   ❏  Custom  Database  Log  Shipping   implementaKon   Simple   ❏  Flume  based  AggregaKon   ❏  Kaia  based  pub-­‐sub  for  applicaKons       Intermediate   ❏  Flume  +  Kaia  based  aggregaKon  and   pub-­‐sub       Advanced   ❏  Kaia  +  Storm  for  pub-­‐sub  and   preparaKon   For Streaming Event Data
  • 13. © 2015 StreamSets, Inc.   ❏  Apache  Sqoop:  hFp://sqoop.apache.org   ❏  Current  Version:  Sqoop1  -­‐  1.4.6;    Sqoop2  -­‐  1.99.6     ❏  Apache  Flume:  hFp://flume.apache.org   ❏  Current  Version:  Flume  1.6.0     ❏  Apache  Kaia:  hFp://kaia.apache.org   ❏  Current  Version:  Kaia  0.8.2.1       For more information...
  • 14. © 2015 StreamSets, Inc. My  Contact  InformaKon:   ●  Email:    arvind  at  streamsets  dot  com   ●  TwiFer:  @aprabhakar   ●  Website:  www.streamsets.com         Thank You!