Data Pipelines With Streamsets

•

1 recomendación•335 vistas

Jowanza Joseph

How to create data pipelines with Streamsets.

Tecnología

Data Pipelines With
Streamsets
Jowanza Joseph
@jowanza

Agenda
About me
The Problem Space
Streaming
StreamSets
Demo
Questions

About Me
Software Engineer at One ClickRetail
Scala / Spark / Mesos / Kubernetes
Author: Apache Spark Fieldbook
Cyclist
Husband and father

Scalability
Complexity
Observability
Extendability

Goals
Data Provenance
Guaranteed Delivery
Conﬁgurable
Extendable
Multi-Protocol Support
DAG
Distribute

Más contenido relacionado

La actualidad más candente

Big data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can 'drift' due to infrastructure, OS and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an open source platform for building big data ingest pipelines that allows you to design, execute and monitor robust data flows. In this session we'll look at how SDC's "intent-driven" approach keeps the data flowing, whether you're processing data 'off-cluster', in Spark, or in MapReduce. StreamSets software delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data arrives on time and with quality, accelerating analysis and decision making. StreamSets Data Collector is in use at hundreds of companies where it brings unprecedented visibility into and control over data as it moves between an expanding variety of sources and destinations. Speakers: Pat Patterson has been working with Internet technologies since 1997, building software and working with communities at Sun Microsystems, Huawei, Salesforce and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. Part of the developer evangelism team at Salesforce, Pat focused on identity, integration and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community.

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Yahoo Developer Network

Netflix processes trillions of events and petabytes of data a day in the Keystone data pipeline, which is built on top of Apache Flink. As Netflix has scaled up original productions annually enjoyed by more than 150 million global members, data integration across the streaming service and the studio has become a priority. Scalably integrating data across hundreds of different data stores in a way that enables us to holistically optimize cost, performance and operational concerns presented a significant challenge. Learn how we expanded the scope of the Keystone pipeline into the Netflix Data Mesh, our real-time, general-purpose, data transportation platform for moving data between Netflix systems. The Keystone Platform’s unique approach to declarative configuration and schema evolution, as well as our approach to unifying batch and streaming data and processing will be covered in depth.

Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...

Flink Forward

As presented: Sajith Appukuttan, Solution Architect, Databricks Sept 12, 2019 at Vancouver Spark Meetup Abstract: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

Delta Lake: Open Source Reliability w/ Apache Spark

George Chow

Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience. In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses. We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.

SQL Analytics Powering Telemetry Analysis at Comcast

Databricks

"Comcast has made a concerted effort to transform itself from a cable/ISP company to a technology company. Data-driven decision making is at the heart of this transformation, and we use data to understand how customers interact with our products, and we see data as the most truthful representation of the voice of our customer. My team, Product Analytics & behavior science (PABS) team plays the role as interpreter, transforming data into consumable insights. The X1 entertainment operating system, is one of the largest video streaming platforms in the world, and our customers consume more than a billion hours of content a week on X1. Our team consumes X1 telemetry at a rate of more than 25TBs of data per day and uses this data to inform our product teams members about the performance of and engagement with the platform. We also use this data to research customer behaviors to help better inform our product team members about areas of opportunity in our products, which range from fixing bugs to creating new features. To power these insights, we need to have a reliable real-time data pipelines to deliver these insights, and we need our data scientists and data engineers to be able to quickly and efficiently be able to develop and commit new code to ensure we can measure new features the product teams are developing. To do this in an environment at this scale, we have been using Databricks, and Databricks delta to gain operational efficiencies, optimization and cost savings. Some of the features from delta that we took advantage of to achieve the desired levels of efficiencies, optimization and cost savings are: · Distributed writes to s3 (essentially eliminating 500 errors) · s3 log with fast reads and ACID transactions (massive increases in s3 scans/reads, and enabling consistent views of the bucket/table) · Vacuum · Pptimize (which has allowed us to reduce a 640 node job to 40, and massively increase efficiencies of our clusters as well as our DS/DE’s)"

Building Sessionization Pipeline at Scale with Databricks Delta

Databricks

"Combining Databricks, the unified analytics platform with Snowflake, the data warehouse built for the cloud is a powerful combo. Databricks offers the ability to process large amounts of data reliably, including developing scalable AI projects. Snowflake offers the elasticity of a cloud-based data warehouse that centralizes the access to data. Databricks brings the unparalleled utility of being based on a mature distributed big data processing and AI-enabled tool to the table, capable of integrating with nearly every technology, from message queues (e.g. Kafka) to databases (e.g. Snowflake) to object stores (e.g. S3) and AI tools (e.g. Tensorflow). Key Takeaways: How Databricks & Snowflake work; Why they're so powerful; How Databricks + Snowflake symbiotically catalyze analytics and AI initiatives"

Databricks + Snowflake: Catalyzing Data and AI Initiatives

Databricks

Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. Workday is a “pure SaaS” company, providing a suite of Financial and HCM (Human Capital Management) apps to about 2000 companies around the world, including more than 30% from Fortune-500 list. There are significant business and technical challenges to support millions of concurrent users and hundreds of millions daily transactions. Using memory-centric graph-based architecture allowed to overcome most of these problems. As Workday grew, data transactions from existing and new customers generated vast amounts of valuable and highly sensitive data. The next big challenge was to provide in-app analytics platform, which for the multiple types of accumulated data, and also would allow using blend in external datasets. Workday users wanted it to be super-fast, but also intuitive and easy-to-use both for the financial and HR analysts and for regular, less technical users. Existing backend technologies were not a good fit, so we turned to Apache Spark. In this presentation, we will share the lessons we learned when building highly scalable multi-tenant analytics service for transactional data. We will start with the big picture and business requirements. Then describe the architecture with batch and interactive modules for data preparation, publishing, and query engine, noting the relevant Spark technologies. Then we will dive into the internals of Prism’s Query Engine, focusing on Spark SQL, DataFrames and Catalyst compiler features used. We will describe the issues we encountered while compiling and executing complex pipelines and queries, and how we use caching, sampling, and query compilation techniques to support interactive user experience. Finally, we will share the future challenges for 2018 and beyond.

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...

Databricks

Redash: Open Source SQL Analytics on Data Lakes

Databricks

In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance. We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.

Data Privacy with Apache Spark: Defensive and Offensive Approaches

Databricks

Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...

Data Con LA

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...

Spark Summit

Intro to databricks delta lake

Mykola Zerniuk

Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...

Databricks

Organizations are adopting data digitization and data-driven decision making is at the heart of this transformation. Cloud Data Lakes and Datawarehouses provide great flexibility to proto-type and roll out applications continuously at much lower costs. Transactional databases are optimized for processing huge volumes of transactions in real-time, whereas the cloud data lake needs to be optimized for analyzing huge volumes of data quickly. This brings about a challenge in creating a streamlined data flow process from capturing realtime transactions into a cloud datawarehouse to drive realtime insights in a scalable and cost effective manner. In this session, we’ll show how organizations can easily overcome that challenge by adopting a robust platform with StreamSets and Delta Lake. StreamSets provides a no-code framework to automate ingestion of transactional data and data processing on Spark, while Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.

Power Your Delta Lake with Streaming Transactional Changes

Databricks

Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole

Vasu S

Spark Streaming with Azure Databricks

Dustin Vannoy

<p>In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.</p>

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...

Databricks

Democratizing Data

Databricks

At Abnormal Security, Spark has played a fundamental role in helping us create an ML system that detects thousands of sophisticated email threats every day. Initially, we set up our Spark infrastructure using YARN on EMR because we had previous experience with it. But after growing very quickly in a short amount of time, we found ourselves spending too much time solving problems with our Spark infrastructure and less time solving problems for our customers. Given we’re in a high growth environment where the only constant is change, we asked ourselves: aren’t these problems only going to get worse as we add more employees, more products, and more data? Over the past few months, Abnormal Security executed a full migration of our Spark infrastructure to Databricks, not only improving cost, operational overhead, and developer productivity, but simultaneously laying the foundation for a modern Data Platform via the Lakehouse architecture. In this talk, we’ll cover how we executed the migration in a few months’ time, from pre-Databricks-POC, through the POC, through the migration itself. We’ll talk about how to really figure out exactly what it is that you care about when evaluating Databricks, splitting out the must-haves from the nice-to-haves, so that you can best utilize the Databricks trial and have concrete and measurable results. We’ll talk about the work we did to execute the migration and the tooling we created to minimize downtime and properly measure performance and cost. Finally, we’ll talk about how the move to Databricks not only solved problems with our legacy Spark infrastructure, but will save us huge amounts of time in the long-run from adopting a Lakehouse architecture.

Migrating Your Data Platform At a High Growth Startup

Databricks

Intuit Analytics Cloud 101

DataWorks Summit/Hadoop Summit

La actualidad más candente (20)

August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector

Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...

Delta Lake: Open Source Reliability w/ Apache Spark

SQL Analytics Powering Telemetry Analysis at Comcast

Building Sessionization Pipeline at Scale with Databricks Delta

Databricks + Snowflake: Catalyzing Data and AI Initiatives

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...

Redash: Open Source SQL Analytics on Data Lakes

Data Privacy with Apache Spark: Defensive and Offensive Approaches

Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...

Intro to databricks delta lake

Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...

Power Your Delta Lake with Streaming Transactional Changes

Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole

Spark Streaming with Azure Databricks

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...

Democratizing Data

Migrating Your Data Platform At a High Growth Startup

Intuit Analytics Cloud 101

Último

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

Evaluating the top large language models.pdf

ChristopherTHyatt

Sara Mae O’Brien Scott and Tatiana Baquero Cakici, Senior Consultants at Enterprise Knowledge (EK), presented “AI Fast Track to Search-Focused AI Solutions” at the Information Architecture Conference (IAC24) that took place on April 11, 2024 in Seattle, WA. In their presentation, O’Brien-Scott and Cakici focused on what Enterprise AI is, why it is important, and what it takes to empower organizations to get started on a search-based AI journey and stay on track. The presentation explored the complexities of enterprise search challenges and how IA principles can be leveraged to provide AI solutions through the use of a semantic layer. O’Brien-Scott and Cakici showcased a case study where a taxonomy, an ontology, and a knowledge graph were used to structure content at a healthcare workforce solutions organization, providing personalized content recommendations and increasing content findability. In this session, participants gained insights about the following: Most common types of AI categories and use cases; Recommended steps to design and implement taxonomies and ontologies, ensuring they evolve effectively and support the organization’s search objectives; Taxonomy and ontology design considerations and best practices; Real-world AI applications that illustrated the value of taxonomies, ontologies, and knowledge graphs; and Tools, roles, and skills to design and implement AI-powered search solutions.

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Enterprise Knowledge

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

BooK Now Call us at +918448380779 to hire a gorgeous and seductive call girl for sex. Take a Delhi Escort Service. The help of our escort agency is mostly meant for men who want sexual Indian Escorts In Delhi NCR. It should be noted that any impersonator will get 100 attention from our Young Girls Escorts in Delhi. They will assume the position of reliable allies. VIP Call Girl With Original Photos Book Tonight +918448380779 Our Cheap Price 1 Hour not available 2 Hours 5000 Full Night 8000 TAG: Call Girls in Delhi, Noida, Gurgaon, Ghaziabad, Connaught Place, Greater Kailash Delhi, Lajpat Nagar Delhi, Mayur Vihar Delhi, Chanakyapuri Delhi, New Friends Colony Delhi, Majnu Ka Tilla, Karol Bagh, Malviya Nagar, Saket, Khan Market, Noida Sector 18, Noida Sector 76, Noida Sector 51, Gurgaon Mg Road, Iffco Chowk Gurgaon, Rajiv Chowk Gurgaon All Delhi Ncr Free Home Deliver

08448380779 Call Girls In Friends Colony Women Seeking Men

Delhi Call girls

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Scaling API-first – The story of a global engineering organization

Radu Cotescu

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

Presentation on how to chat with PDF using ChatGPT code interpreter

naman860154

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

How to convert PDF to text with Nanonets

naman860154

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

[2024]Digital Global Overview Report 2024 Meltwater.pdf

hans926745