2. Something About Me
• Big Data Delivery Lead at Optum (UHG)
• Previously at IBM and FAO of the UN
• Current fields of expertise are Big Data, ML/DL and DevOps
• Past experience in JVM languages development (Java,
Groovy, Scala), test automation, CI/CD
• Author of the upcoming book “Hands-on Deep Learning
with Apache Spark”
• I love preparing
home-made pizza
4. Agenda
• Challenges of Data Ingestion from the Edge
• Streamsets Data Collector
– Features
– Core concepts
• Streamsets Data Collector Edge
– Overview
• Demos
– SDC (Kafka enablement)
– SDC Edge (Android + ElasticSearch + Kibana or Grafana)
5. Challenges of Data Ingestion from
the Edge
• Every day increasing amount of data being generated from
outside the data center or cloud.
• New scenarios (Industry 4.0, IoT, connected cars, smartphones).
• It isn’t always easy to get data out of source systems or perform
analytics right where it’s generated.
• Getting data into central big data systems is an arduous task
involving a large number of disjointed, poorly instrumented and
often hand coded technologies.
• Limited resources (memory, CPU, connectivity).
• Unexpected changes (Data Drift).
• Live management of thousands of edge pipelines: difficult to
operate at scale.
6. What’s Streamsets Data Collector
(SDC)?
• It is a tool to design complex data flows with minimal coding
and the maximum flexibility.
• It provides real-time data flow statistics and metrics for each
flow stage.
• It provides automated error handling and alerting.
• It is easy to use (drag-and-drop from a web UI).
• It ensures zero-downtime when upgrading the underlying
infrastructure.
• It handles data serialization.
• It is Open Source.
7. SDC Use Cases
• Apache Kafka Enablement
– Connecting applications to Kafka without writing a single line of code.
• Hadoop Ingestion
– Easy continuously data ingestion into Hadoop and its surrounding
ecosystem.
• Cloud Migration
– Data migrate onto or across cloud providers.
• Search Enablement
– Easy population of your search solution of choice with data from any
source.
8. SDC Core Concepts
• Origin
– Represents the source for the pipeline.
• Processor
– It's a stage that represents a type of data processing that you want to
perform.
• Destination
– Represents the target for a pipeline.
• Executor
– It’s a stage that triggers a task when it receives an event.
9. SDC Origins
• Cloud platforms
• Local and remote file systems
• HTTP and REST API
• Kafka
• Hadoop
• Relational
Databases
• MQTT
14. What’s SDC Edge?
• It is an ultra lightweight agent that can run pipelines
designed in SDC to ship data in and out of systems.
• It is written in Go and compiles down to a <5MB executable
that has no dependencies.
• It is Open Source.
• No dependency on external IoT Gateways.
• Can perform routing and filtering logic on edge pipelines
(architected for Edge Analytics).
15. What’s SDC Edge?
• It runs natively on different platforms:
• It supports leading messaging protocols including HTTP,
MQTT, CoAP, WebSockets and Kafka.
• It can Detect and handle data drift.
• Multiple pipelines can run at the same time per agent.
16. SDC Edge Use Cases
• Internet of Things (IoT)
– Reliably ingest and apply machine learning and other analytic
techniques to data aggregated from huge populations of IoT sensors
and devices.
• Cybersecurity
– Ingest and apply advanced analytics to the vast quantities of data
collected across a corporate network in order to detect imminent threats
or attacks in progress.
19. SDC Edge: other topics
• Performance
• Security
• CI/CD
• REST API
• Logging
• Pipelines deployment
20. Useful Links
Streamsets Data Collector docs:
https://streamsets.com/documentation/datacollector/latest/help/#dataco
llector/UserGuide/GettingStarted
Streamsets Data Collector on GitHub:
https://github.com/streamsets/datacollector
Streamsets Data Collector Edge docs:
https://streamsets.com/products/sdc-edge
Streamsets Data Collector Edge on GitHub:
https://github.com/streamsets/datacollector-edge
Sdc-user Google group:
https://groups.google.com/a/streamsets.com/forum/#!forum/sdc-user
Ask Streamsets: https://ask.streamsets.com/questions/