Building A Self Service Analytics Platform on Hadoop

•

1 recomendación•803 vistas

These slides were presented by Avinash Ramineni of Clairvoyant to the Atlanta Apache Spark User Group on Wednesday, March 22, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238109721/

Tecnología

1Page
Building a Self Service Analytics
Platform on Hadoop
Avinash Ramineni

4Page
Quick Poll
• Big Data Deployments in Prod
• Hadoop Distributions
• People use Ecosystems rather than tools
• Architecture was implemented on Cloudera
• Cloud Experience – AWS ?

5Page
Challenges
• Data in Silos
• Acquires Perspectives as data is moved
• Data availability delays
• Legacy Systems handling the Volume , Veracity and Velocity
• Extracting data from legacy systems
• Lack of Self-Service Capabilities
• Knowledge becomes tribal – instead of institutional
• Security / Compliance Requirements

6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management

8Page
Self-Service at all Levels
Ingest Organize Enrich Analyze Dashboards
AnalyzeIngest Organize Enrich Insights

9Page
Key Design Tenets
• Separation of Compute and Storage
• Independently scale compute and storage
• Data Democratization and Governance
• Bring your own Compute (BYOC)
• HA / DR
• Open Source Stack

1
0
Page
Separation of Compute and Storage
• Scale storage and compute independently
• Shifts bottleneck from Disk IO to Network
• Centralized Data Storage
• Data Democratization
• No data duplication
• Easier Hardware upgrade paths
• Flexible Architecture
• DR Simplified

1
1
Page
BYOC (Bring Your Own Cluster)
• Each department/application can bring its own Hadoop cluster
• Eliminates the need for very large clusters
• Easier to administer and maintain
• Reduces multi-tenancy issues
• Clusters can be upgraded independently
• Enables usage based cost model
Centralized / Common S3 Storage
Marketing
Cluster
Centralized
Storage
Personalization
Cluster
Main
Cluster

1
3
Page
Architecture – Data Ingestion Layer
• DB Ingestor
• Stream Ingestor
• Kafka and Spark Streaming
• File Ingestor
• FTP / SFTP / Logs
• Ingestion using Service API

1
4
Page
Architecture – Data Processing Layer
• Storage layer carved into logical buckets
• Landing, Raw, Derived and Delivery
• Schema stored with data (no guesswork)
• Platform Jobs
• Converting text to Parquet
• Saving streaming data Parquet
• Derivatives
• Compaction
• Standardization

1
5
Page
Architecture – Data Delivery Layer
• Data Delivery
• SQL - Spark Thrift Server / Impala
• Tableau, SQL IDE, Applications
• Self Service
• Derivatives
• Represented Via SQL on Delivery Layer
• Stored in Derived Storage Layer
• Metadata driven
• Derived Layer Generators
• Long running Spark Job
• Derivative Refresh

1
6
Page
Key Takeaways - Cloud
• Hadoop Cloud ready-ness
• Cloudera Director Limitations
• Multi-Availability zone, regions
• Storage
• Instance Storage
• EBS Volumes
• gp2 vs st1
• S3 Eventual Consistency

1
7
Page
Key Takeaways - Spark Thrift Server
• Spark Thrift Server Support
• Performance Tuning
• Concurrency
• partition strategy
• Cache Tables
• Compression Codec for Parquet
• Snappy vs gzip

1
8
Page
Key Takeaways - Security
• Secure by Design, Secure by Default
• Access to Data on S3
• IAM Roles
• Sentry
• Support for Spark
• Kerberos
• Spark Thrift Server
• Navigator
• Support for Spark

1
9
Page
Key Takeaways - General
• Rapidly Changing Technology
• Feature addition
• Documentation
• Bugs
• Jar hell
• Small files
• Performance Issues
• Compaction

2
0
Page
Key Takeaways - General
• Partition Strategy
• Parquet Files
• Balancing parallelism and throughput
• Table Partitions
• Cluster sizing, optimization and tuning
• Integrating with Corporate infrastructure
• Deployment practices
• Monitoring and Alerting
• Information Security Policies

2
2
Page
Questions
• Principal @ Clairvoyant
• Email: avinash@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/avinashramineni

Más contenido relacionado

La actualidad más candente

Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric

Cambridge Semantics

When it comes to creating an enterprise AI strategy: if your company isn’t good at analytics, it’s not ready for AI. Succeeding in AI requires being good at data engineering AND analytics. Unfortunately, management teams often assume they can leapfrog best practices for basic data analytics by directly adopting advanced technologies such as ML/AI – setting themselves up for failure from the get-go. This presentation explains how to get basic data engineering and the right technology in place to create and maintain data pipelines so that you can solve problems with AI successfully.

Creating an Enterprise AI Strategy

AtScale

Retail banks are moving beyond the data warehouse and data lake and are now implementing data fabric architectures to address data discovery and integration challenges. These are the slides from our webinar "Modern Data Discovery and Integration in Retail Banking" in which we explore the role of the data discovery and integration layer in a data fabric with special focus on evolution from data warehouse to data fabric, semantics and graph data models in data fabric and example use cases in retail banks and B2C financial services.

Modern Data Discovery and Integration in Retail Banking

Cambridge Semantics

With the emergence of regulations such as the General Data Protection Regulation from the European Union (effective May 2018), with fines up to 20m Euro, Data Lakes are emerging as the data architecture of choice amongst financial institutions. Banks are embarking on a journey to enable data scientists to unlock the value of the data silo'ed in many disparate data systems. By enabling self service data access and merging multiple streams of data by using data clustering, entity extraction, identity resolution and other techniques - we will show how banks have used Analytics to uncover business value without falling into the abyss of data swamps. The build out of the data lake requires the ingestion of data from multiple operational systems . By leveraging an automated Data Cataloging service, organizations are able to search, profile, discover, tag, track lineage and capture tribal knowledge delivered on the FICO Analytics Cloud enabling the data scientists to build innovative models, make automated decisions, track fraudulent usage, make intelligent marketing campaigns and improve the top line and bottom line for the financial institution. Speaker: Rohit Valia, Product Management and Strategy, Fico

Necessity of Data Lakes in the Financial Services Sector

DataWorks Summit

In this webinar by Cambridge Semantics' VP of Solution Engineering, Ben Szekely, you will learn more about how the Enterprise Data Fabric prevails as the bedrock of enterprise digital strategy. Connected and highly available data is the new normal - powering analytics and AI. The data lake itself is commoditized, like raw compute or disk, and becomes an unseen part of the stack. Semantic graph technology is central to Data Fabric initiatives that meaningfully contribute to digital transformation. We share our vision for digital innovation - a shift to something powerful, expedient and future-proof. The Data Fabric connects enterprise data for unprecedented access in an overlay fashion that does not disrupt current investments. Interconnected and reliable data drives business outcomes by automating scalable AI and ML efforts. Graph technology is the way forward to realize this future.

Accelerate Digital Transformation with an Enterprise Big Data Fabric

Cambridge Semantics

Slides from Joe Caserta's Keynote at MIT CDOIQ Symposium 2018 As we continue to shift into a data-driven digital society, it’s crucial to ensure a cohesive strategy between the chief data officer and chief digital officer. In this talk, Joe Caserta will discuss the convergence between data and digital, addressing the interdependencies, ambiguities, and complications between the two. Joe will outline a cohesive strategy to enhance enterprise operations and improve your bottom line.

The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...

Remy Rosenbaum

Solution architecture for big data projects

Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW

Sustainability Investment Research Using Cognitive Analytics

Cambridge Semantics

Data integration is paramount, in this presentation you will find three different paradigms: using client-side tools, creating traditional data warehouses and the data virtualization solution - the logical data warehouse, comparing each other and positioning data virtualization as an integral part of any future-proof IT infrastructure. This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/1q94Ka.

Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB

Denodo

Data Mining and Data Warehousing

Amdocs

The data services marketplace is enabled by a data abstraction layer that supports rapid development of operational applications and single data view portals. In this presentation yo will learn services-based reference architecture, modality, and latency of data access. - Reference architecture for enterprise data services marketplace - Modality and latency of data access - Customer use cases and demo This presentation is part of the Denodo Educational Seminar , and you can watch the video here goo.gl/vycYmZ.

Data Services Marketplace

Denodo

Why Data Virtualization? An Introduction by Denodo

Justo Hidalgo

Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning

Cambridge Semantics

Big Data and Data Virtualization

Kenneth Peeples

Julie Strauss. This session introduces the newest services in the Cortana Analytics family. The Azure Data Catalog is an enterprise-wide metadata catalog that enables self-service data source discovery. Data Catalog is a fully managed service that stores, describes, indexes, and provides information on how to access any registered data source in your organization. This session presents an overview of the Data Catalog and how – by using it to register, enrich, discover, understand and consume data sources – you can close the gap between those seeking information and those creating it.

Cortana Analytics Workshop: Azure Data Catalog

MSAdvAnalytics

Original event: https://www.meetup.com/NYC-Machine-Learning/events/256605862/ -- "Doing large scale ML in production is hard" – Everyone who's tried This talk is focussed on ML Systems. Especially the less obvious pitfalls, which have caused us troubles at Spotify. This talk assumes a certain level of familiarity with ML: You'll get the most out of if you've some experience with applied ML, ideally on production systems. Romain Yon is a Staff ML Engineer at Spotify. Over the years, Romain has worked on many of the core ML systems that power Spotify today (Music Recommendation, Catalog Quality, Search Ranking, Ads, ..). During the past year, Romain has been mostly focusing on designing reusable ML Infrastructure that can be leveraged throughout Spotify. Prior to Spotify, Romain co-founded the startup https://linkurio.us while getting his MSc in ML from Georgia Tech.

ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup

Romain Yon

In our webinar "A Data Fabric Market Update with Guest Speaker, VP, Principal Analyst Noel Yuhanna" Ben Szekely, Cambridge Semantics’ Co-founder and SVP of Field Operations, and guest speaker, Noel Yuhanna, VP and Principal Analyst at Forrester and author of the “The Forrester Wave™: Enterprise Data Fabric, Q2 2020”, discuss the state of the Data Fabric Market. These are Ben's slides from that webinar.

Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...

Cambridge Semantics

Data is treated truly as an asset at Guardian Life. We have created a Data Services Marketplace which contains valuable data from the underlying sources and is used by business users for day-to-day operations. In this presentation, you will see how Data Virtualization can be used to support the marketplace with real-time data services, provision non real-time data into Hadoop, and swap underlying sources without effecting business users. This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/PZ2uFj.

Supporting Data Services Marketplace using Data Virtualization

Denodo

Denodo Data Virtualization - IT Days in Luxembourg with Oktopus

Denodo

AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...

Cambridge Semantics

La actualidad más candente (20)

Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric

Creating an Enterprise AI Strategy

Modern Data Discovery and Integration in Retail Banking

Necessity of Data Lakes in the Financial Services Sector

Accelerate Digital Transformation with an Enterprise Big Data Fabric

The Convergence of Data & Digital: Mapping Out a Cohesive Strategy for Maximu...

Solution architecture for big data projects

Sustainability Investment Research Using Cognitive Analytics

Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB

Data Mining and Data Warehousing

Data Services Marketplace

Why Data Virtualization? An Introduction by Denodo

Risk Analytics Using Knowledge Graphs / FIBO with Deep Learning

Big Data and Data Virtualization

Cortana Analytics Workshop: Azure Data Catalog

ML Infra @ Spotify: Lessons Learned - Romain Yon - NYC ML Meetup

Graph-driven Data Integration: Accelerating and Automating Data Delivery for ...

Supporting Data Services Marketplace using Data Virtualization

Denodo Data Virtualization - IT Days in Luxembourg with Oktopus

AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...

Destacado

Big data it’s impact on the finance function

Mike Davis

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...

Data Con LA

Building enterprise advance analytics platform

Haoran Du

The enormous legacy of EDW experience and best practices can be adapted to the unique capabilities of the Hadoop environment. In this webinar, in a point-counterpoint format, Dr. Kimball will describe standard data warehouse best practices including the identification of dimensions and facts, managing primary keys, and handling slowly changing dimensions (SCDs) and conformed dimensions. Eli Collins, Chief Technologist at Cloudera, will describe how each of these practices actually can be implemented in Hadoop.

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Cloudera, Inc.

A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.

Data Warehouse Design and Best Practices

Ivo Andreev

With the proliferation of Big Data-oriented technology and its accompanying applications of advanced statistical techniques, asset managers are enabling their sales and marketing teams with more insight into the preferences and proclivities of their clients, both advisors and investors. This webinar will give attendees a general understanding of Big Data’s technologies and techniques especially as they pertain to using predictive analytics for more effective and targeted marketing and distribution. Desired Outcomes: Understanding Big Data and how it is enabling adopters to use data more effectively than in the past Familiarity with some of the technological and analytical approaches Big Data enables Understanding of attribution models for measuring advisor and investor responsiveness Knowledge of how to prioritize campaigns and contacts by combining measures of valuation and responsiveness Grasp of some of the more effective way to adopt predictive analysis for sales and marketing Understanding basics of recommender systems and how next best action is determined

Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...

NICSA

With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!

Big data architectures and the data lake

James Serra

Destacado (7)

Big data it’s impact on the finance function

Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...

Building enterprise advance analytics platform

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals

Data Warehouse Design and Best Practices

Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...

Big data architectures and the data lake

Similar a Building A Self Service Analytics Platform on Hadoop

Enterprises have been rapidly adopting data lakes as a complement or replacement of data warehouses. Many of the Data lake implementations are ignoring the inherent drawbacks and limitations of Data Lakes and ending up as data swamps with little or no benefit to the businesses. In this session we will go through some of challenges and the key aspects that need to be considered for successful Data lake implementations.

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...

Avinash Ramineni

Presentation Presentation Presentation Presentation Presentation

bangel105

Managing storage on Prem and in Cloud

Howard Marks

Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...

Cloudian

With the boom in data; the volume and its complexity, the trend is to move data to the cloud. Where and How do we do this? Azure gives you the answer. In this session, I will give you an introduction to Azure Data Lake and Azure Data Factory, and why they are good for the type of problem we are talking about. You will learn how large datasets can be stored on the cloud, and how you could transport your data to this store. The session will briefly cover Azure Data Lake as the modern warehouse for data on the cloud,

Move your on prem data to a lake in a Lake in Cloud

CAMMS

SQL, NoSQL, Distributed SQL: Choose your DataStore carefully

Md Kamaruzzaman

Architecting a datalake

Laurent Leturgez

Apache Geode Meetup, Cork, Ireland at CIT

Apache Geode

Spark volume requirements 2018

Rachit Arora

Drupal performance

Gabi Lee

In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications. This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence. This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done

Storage Requirements and Options for Running Spark on Kubernetes

DataWorks Summit

Keynote oracle days final 16x9 v3.alain

Doina Draganescu

Teradata Loom Introductory Presentation

mlang222

CC -Unit4.pptx

Revathiparamanathan

Amazon Redshift with Full 360 Inc.

Amazon Web Services

What are clouds made from

John Garbutt

Apache Geode Meetup, London

Apache Geode

What is Cloud computing?

Richard Harvey

Business Intelligence (BI) solutions need to move at the speed of business. Unfortunately, roadblocks related to availability of resources and deployment often present an issue. What if you could accelerate the deployment of an entire BI infrastructure to just a couple hours and start loading data into it by the end of the day. In this session, we'll demonstrate how to leverage Microsoft tools and the Azure cloud environment to build out a BI solution and begin providing analytics to your team with tools such as Power BI. By end of the session, you'll gain an understanding of the capabilities of Azure and how you can start building an end to end BI proof-of-concept today.

Accelerating Business Intelligence Solutions with Microsoft Azure pass

Jason Strate

Data Science Day New York: Data Science: A Personal History

Cloudera, Inc.

Similar a Building A Self Service Analytics Platform on Hadoop (20)

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...

Presentation Presentation Presentation Presentation Presentation

Managing storage on Prem and in Cloud

Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...

Move your on prem data to a lake in a Lake in Cloud

SQL, NoSQL, Distributed SQL: Choose your DataStore carefully

Architecting a datalake

Apache Geode Meetup, Cork, Ireland at CIT

Spark volume requirements 2018

Drupal performance

Storage Requirements and Options for Running Spark on Kubernetes

Keynote oracle days final 16x9 v3.alain

Teradata Loom Introductory Presentation

CC -Unit4.pptx

Amazon Redshift with Full 360 Inc.

What are clouds made from

Apache Geode Meetup, London

What is Cloud computing?

Accelerating Business Intelligence Solutions with Microsoft Azure pass

Data Science Day New York: Data Science: A Personal History

Último

AXA XL - Insurer Innovation Award Americas 2024

The Digital Insurer

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

ICT role in 21st century education and its challenges

rafiqahmad00786416

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

Architecting Cloud Native Applications

WSO2

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Manulife - Insurer Transformation Award 2024

The Digital Insurer

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Scalable LLM APIs for AI and Generative AI Application Development Ettikan Karuppiah, Director/Technologist - NVIDIA Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

apidays

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Zilliz

Building A Self Service Analytics Platform on Hadoop

1. 1Page Building a Self Service Analytics Platform on Hadoop Avinash Ramineni

2. 2Page Clairvoyant

3. 3Page Clairvoyant Services

4. 4Page Quick Poll • Big Data Deployments in Prod • Hadoop Distributions • People use Ecosystems rather than tools • Architecture was implemented on Cloudera • Cloud Experience – AWS ?

5. 5Page Challenges • Data in Silos • Acquires Perspectives as data is moved • Data availability delays • Legacy Systems handling the Volume , Veracity and Velocity • Extracting data from legacy systems • Lack of Self-Service Capabilities • Knowledge becomes tribal – instead of institutional • Security / Compliance Requirements

6. 6Page Data Lake Attributes • Data Democratization • Data Discovery • Data Lineage • Self-Service capabilities • Metadata Management

7. 7Page Without Self-Service

8. 8Page Self-Service at all Levels Ingest Organize Enrich Analyze Dashboards AnalyzeIngest Organize Enrich Insights

9. 9Page Key Design Tenets • Separation of Compute and Storage • Independently scale compute and storage • Data Democratization and Governance • Bring your own Compute (BYOC) • HA / DR • Open Source Stack

10. 1 0 Page Separation of Compute and Storage • Scale storage and compute independently • Shifts bottleneck from Disk IO to Network • Centralized Data Storage • Data Democratization • No data duplication • Easier Hardware upgrade paths • Flexible Architecture • DR Simplified

11. 1 1 Page BYOC (Bring Your Own Cluster) • Each department/application can bring its own Hadoop cluster • Eliminates the need for very large clusters • Easier to administer and maintain • Reduces multi-tenancy issues • Clusters can be upgraded independently • Enables usage based cost model Centralized / Common S3 Storage Marketing Cluster Centralized Storage Personalization Cluster Main Cluster

12. 1 2 Page Architecture

13. 1 3 Page Architecture – Data Ingestion Layer • DB Ingestor • Stream Ingestor • Kafka and Spark Streaming • File Ingestor • FTP / SFTP / Logs • Ingestion using Service API

14. 1 4 Page Architecture – Data Processing Layer • Storage layer carved into logical buckets • Landing, Raw, Derived and Delivery • Schema stored with data (no guesswork) • Platform Jobs • Converting text to Parquet • Saving streaming data Parquet • Derivatives • Compaction • Standardization

15. 1 5 Page Architecture – Data Delivery Layer • Data Delivery • SQL - Spark Thrift Server / Impala • Tableau, SQL IDE, Applications • Self Service • Derivatives • Represented Via SQL on Delivery Layer • Stored in Derived Storage Layer • Metadata driven • Derived Layer Generators • Long running Spark Job • Derivative Refresh

16. 1 6 Page Key Takeaways - Cloud • Hadoop Cloud ready-ness • Cloudera Director Limitations • Multi-Availability zone, regions • Storage • Instance Storage • EBS Volumes • gp2 vs st1 • S3 Eventual Consistency

17. 1 7 Page Key Takeaways - Spark Thrift Server • Spark Thrift Server Support • Performance Tuning • Concurrency • partition strategy • Cache Tables • Compression Codec for Parquet • Snappy vs gzip

18. 1 8 Page Key Takeaways - Security • Secure by Design, Secure by Default • Access to Data on S3 • IAM Roles • Sentry • Support for Spark • Kerberos • Spark Thrift Server • Navigator • Support for Spark

19. 1 9 Page Key Takeaways - General • Rapidly Changing Technology • Feature addition • Documentation • Bugs • Jar hell • Small files • Performance Issues • Compaction

20. 2 0 Page Key Takeaways - General • Partition Strategy • Parquet Files • Balancing parallelism and throughput • Table Partitions • Cluster sizing, optimization and tuning • Integrating with Corporate infrastructure • Deployment practices • Monitoring and Alerting • Information Security Policies

21. 2 1 Page Data Security

22. 2 2 Page Questions • Principal @ Clairvoyant • Email: avinash@clairvoyantsoft.com • LinkedIn: https://www.linkedin.com/in/avinashramineni

Building A Self Service Analytics Platform on Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Building A Self Service Analytics Platform on Hadoop

Similar a Building A Self Service Analytics Platform on Hadoop (20)

Último

Último (20)

Building A Self Service Analytics Platform on Hadoop