Demonstration

•

0 recomendaciones•664 vistas

Sean Murphy

Tecnología

Outline
● Some comments on what we're trying to
show
○ high level cluster configuration
○ an example application that might use this config
■ based on a Gowalla data set
● Launch cluster nodes on EC2
● Launch/configure Cassandra on cluster
● Demonstrate use of Cassandra
○ cassandra-cli, pycassa scripts to interact with db
● Demonstrate use of Hadoop
● Demonstrate use of Pig on the real data

Cluster configuration
● Four EC2 nodes
○ m1.medium instances
■ realistically a bit small for real world
● 3 nodes part of Cassandra
○ data can be input dynamically into db via Thrift API
● All nodes run Hadoop Tasktracker
● MapReduce runs close to (Cassandra) data
● JobTracker on separate node

Cluster config

Job Tracker Cassandra

Task Tracker

Cassandra Cassandra

Task Tracker Task Tracker

All nodes m1.small for demo

Let's get the cluster up...
...over to Lamine!

Let's get Cassandra
running...
...and show the basic cli...

Application data
● Used Gowalla data in this test application
● Gowalla provide anonymized data for
test/research purposes:
○ Graph of UID connections
○ List of checkins - UID, LocID
● Size of data set:
○ 400MB checkins
■ 6.4m checkins
○ ~200k users
● Also generated simpler variant of this data
for demonstration
○ more real user information
○ more real location information

Application data - User Graph

Simple graph structure -
unidirectional graph with
UIDs as nodes

How this data can be used
● Application interested in:
○ my checkins
○ list my friends
○ checkins at given location
○ my friends checkins
● Analytics:
○ top ten most active users - most checkins
○ aggregate checkins per week
○ aggregate checkins per week per city

Cassandra data models
● The following data models were used:
○ User
○ Location
○ Checkin
○ FriendRels
■ graph of friend relationships
○ UserCheckins
■ checkins by user
○ LocationCheckins
■ checkins by location
○ FriendCheckins
■ checkins by friends

$Cassandra data models ● Use of valueless columns ○ FriendRels, UserCheckins, LocationCheckins, FriendCheckins are just sets of valueless columns ● FriendRel: ○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...} ■ row_key is a uid ● UserCheckins: ○ row_key: {checkinid1: '', checkinid2: '', ...} ■ row_key is uid ● LocationCheckins use LocID as row key ● FriendCheckins use my UID to get my friend's checkins$

Using Hadoop and Pig
...and we can do some analytics...

Más contenido relacionado

La actualidad más candente

Elasticsearch avoiding hotspots

Christophe Marchal

Object multifunctional indexing with an open API

akvalex

Ted Dunning – Very High Bandwidth Time Series Database Implementation This talk will describe our work in creating time series databases with very high ingest rates (over 100 million points / second) on very small clusters. Starting with openTSDB and the off-the-shelf version of MapR-DB, we were able to accelerate ingest by >1000x. I will describe our techniques in detail and talk about the architectural changes required. We are also working to allow access to openTSDB data using SQL via Apache Drill. In addition, I will talk about how this work has implications regarding the much fabled Internet of Things. And tell some stories about the origins of open source big data in the 19th century at sea.

Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...

NoSQLmatters

Streaming data to s3 using akka streams

Mikhail Girkin

NBITSearch. Features.

Novosib-BIT LLC

DEBS Grand Challenge is a yearly, real-life data based event processing challenge posted by Distributed Event Based Systems conference. The 2015 challenge uses a taxi trip data set from New York city that includes 173 million events col- lected over the year 2013. This paper presents how we used WSO2 CEP, an open source, commercially available Complex Event Processing Engine, to solve the problem. With a 8-core commodity physical machine, our solution did about 350,000 events/second throughput with mean latency less than one millisecond for both queries. With a VirtualBox VM, our solution did about 50,000 events/second throughput with mean latency of 1 and 6 milliseconds for the first and second queries respectively. The paper will outline the solution, present results, and discuss how we optimized the solution for maximum performance.

ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...

Srinath Perera

DB reading group may 16, 2018

Keisuke Suzuki

Locality Sensitive Hashing By Spark

Spark Summit

Introduction to MongoDB

Raghunath A

The world in which we monitor software is growing more complex every year. There are increasingly more ways to run server-side software, with many more independent services and more points of failures, the list goes on! On the plus side, there’s a lot of great tools and patterns being developed to try and make things simple to assess and understand. This talk covers how metrics and monitoring can be leveraged in a variety of different ways, auto-discovering applications and their usage of databases, caches, load balancers, etc, setting up and tearing down dashboards and monitoring automatically for services and instances, and more. We’ll also talk about how you can accomplish all this with a global view of your systems using both Prometheus and Graphite with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, distributed aggregation with the M3 aggregator and the M3 Kubernetes operator to horizontally scale a metrics platform in a way that doesn’t cost outrageous amounts to run with a system that’s still sane to operate with petabytes of metrics data.

FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...

Rob Skillington

Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак

Ontico

Data Step Hash Object vs SQL Join

Geoff Ness

Pain points with M3, some things to address them and how replication works

Rob Skillington

Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On

Apache Flink Taiwan User Group

Mongo nyc nyt + mongodb

Deep Kapadia

MongoDB Workshop Universidad de Huelva

Juan Antonio Roy Couto

How to store billions of time series points and access them within a few milliseconds? Chronix! Chronix is a young but mature open source project that allows one for example to store about 15 GB (csv) of time series in 238 MB with average query times of 21 ms. Chronix is built on top of Apache Solr a bulletproof distributed NoSQL database with impressive search capabilities. In this code-intense session we show how Chronix achieves its efficiency in both respects by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with (pre-computed) attributes, and by specialized query functions.

OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...

NETWAYS

Amazon Web Services lection 4

Binary Studio

R user group 2011 09

MapR Technologies

Data Lessons Learned at Scale

Charlie Reverte

La actualidad más candente (20)

Elasticsearch avoiding hotspots

Object multifunctional indexing with an open API

Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...

Streaming data to s3 using akka streams

NBITSearch. Features.

ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...

DB reading group may 16, 2018

Locality Sensitive Hashing By Spark

Introduction to MongoDB

FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...

Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак

Data Step Hash Object vs SQL Join

Pain points with M3, some things to address them and how replication works

Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On

Mongo nyc nyt + mongodb

MongoDB Workshop Universidad de Huelva

OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...

Amazon Web Services lection 4

R user group 2011 09

Data Lessons Learned at Scale

Destacado

No sql course introduction

Rocco pres-v1

Rss announcements

Rss talk

Overview of no sql

Hadoop pig

Introduction to cassandra

Tarun Garg

Destacado (7)

No sql course introduction

Rocco pres-v1

Rss announcements

Rss talk

Overview of no sql

Hadoop pig

Introduction to cassandra

Similar a Demonstration

Real-time analytics with Druid at Appsflyer

Michael Spector

Native container monitoring

Rohit Jnagal

Native Container Monitoring

Anushree Narasimha

Druid

Dori Waldman

NetflixOSS Meetup season 3 episode 1

Ruslan Meshenberg

I don't think it's hyperbole when I say that Facebook, Instagram, Twitter & Netflix now define the dimensions of our social & entertainment universe. But what kind of technology engines purr under the hoods of these social media machines? Here is a tech student's perspective on making the paradigm shift to "Big Data" using innovative models: alphabet blocks, nesting dolls, & LEGOs! Get info on: - What is Cassandra (C*)? - Installing C* Community Version on Amazon Web Services EC2 - Data Modelling & Database Design in C* using CQL3 - Industry Use Cases

Cassandra NoSQL Tutorial

Michelle Darling

For this upcoming meetup, we welcome Patrick Eaton PhD, Systems Architect at Stackdriver, and Joey Imbasciano, Cloud Platform Engineer at Stackdriver. What You'll Learn At This Meetup: • Why Stackdriver chose Cassandra over other DB offerings • Stackdriver's data pipeline that runs into Cassandra • Operating Cassandra Running on AWS • Stackdriver's approach to disaster recovery Patrick and Joey will be presenting their use of Apache Cassandra at Stackdriver, some lesson's learned, technical tips and a Q&A to end the evening.

Running Cassandra in AWS

DataStax Academy

Streamsets and spark in Retail

Hari Shreedharan

Streamsets Data Collector is designed to make data ingest and processing easy. SDC integrates at several levels with Apache Spark to make data analysis using Spark very easy. SDC works with Databricks Cloud to trigger jobs based on incoming data. In this talk, you will learn how a larger retail player with thousands of outlets is utilizing StreamSets to power Spark jobs on the Databricks cloud, combining real-time foot traffic data and historic behavioral & transaction data for analytic insights that improve revenue per square foot.

Analytic Insights in Retail Using Apache Spark with Hari Shreedharan

Databricks

Spark & Cassandra - DevFest Córdoba

Jose Mº Muñoz

Presentation

Dimitris Stripelis

OpenSearch.pdf

Abhi Jain

MongoDB FabLab León

Juan Antonio Roy Couto

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu. Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com.. Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.

Data Science in the Cloud @StitchFix

C4Media

At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent. The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues. About the Speaker Jeffrey Berger Lead Database Engineer, Knewton Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.

Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...

DataStax

Avoiding Pitfalls for Cassandra.pdf

Cédrick Lunven

AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company The video: https://youtu.be/l5KmaZNQxaU dont forget to subcribe to the youtube channel The website: https://amazon-aws-big-data-demystified.ninja/ The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/ The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/

AWS Big Data Demystified #1: Big data architecture lessons learned

Omid Vahdaty

Opera chose Scylla over Cassandra to sync the data of millions of browsers to a back-end data repository. The results of the migration and further optimizations they made in their stack helped Opera to gain better latency/throughput and lower resources usage beyond their expectations. Attend this session to learn how to Migrate your data in a sane way, without any downtime Connect a Python+Django web app to Scylla, how to use intranode sharding to improve your application

How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night

ScyllaDB

Peer sim (p2p network)

Hein Min Htike

Mongo db improve the performance of your application codemotion2016

Juan Antonio Roy Couto

Similar a Demonstration (20)

Real-time analytics with Druid at Appsflyer

Native container monitoring

Native Container Monitoring

Druid

NetflixOSS Meetup season 3 episode 1

Cassandra NoSQL Tutorial

Running Cassandra in AWS

Streamsets and spark in Retail

Analytic Insights in Retail Using Apache Spark with Hari Shreedharan

Spark & Cassandra - DevFest Córdoba

Presentation

OpenSearch.pdf

MongoDB FabLab León

Data Science in the Cloud @StitchFix

Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...

Avoiding Pitfalls for Cassandra.pdf

AWS Big Data Demystified #1: Big data architecture lessons learned

How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night

Peer sim (p2p network)

Mongo db improve the performance of your application codemotion2016

Último

Histor y of HAM Radio presentation slide

vu2urc

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Tech Trends Report 2024 Future Today Institute.pdf

hans926745

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Demonstration

1. Demonstration

2. Outline ● Some comments on what we're trying to show ○ high level cluster configuration ○ an example application that might use this config ■ based on a Gowalla data set ● Launch cluster nodes on EC2 ● Launch/configure Cassandra on cluster ● Demonstrate use of Cassandra ○ cassandra-cli, pycassa scripts to interact with db ● Demonstrate use of Hadoop ● Demonstrate use of Pig on the real data

3. Cluster configuration ● Four EC2 nodes ○ m1.medium instances ■ realistically a bit small for real world ● 3 nodes part of Cassandra ○ data can be input dynamically into db via Thrift API ● All nodes run Hadoop Tasktracker ● MapReduce runs close to (Cassandra) data ● JobTracker on separate node

4. Cluster config Job Tracker Cassandra Task Tracker Cassandra Cassandra Task Tracker Task Tracker All nodes m1.small for demo

5. Let's get the cluster up... ...over to Lamine!

6. Let's get Cassandra running... ...and show the basic cli...

7. Application data ● Used Gowalla data in this test application ● Gowalla provide anonymized data for test/research purposes: ○ Graph of UID connections ○ List of checkins - UID, LocID ● Size of data set: ○ 400MB checkins ■ 6.4m checkins ○ ~200k users ● Also generated simpler variant of this data for demonstration ○ more real user information ○ more real location information

8. Application data - User Graph Simple graph structure - unidirectional graph with UIDs as nodes

9. Application Data - Checkin info

10. How this data can be used ● Application interested in: ○ my checkins ○ list my friends ○ checkins at given location ○ my friends checkins ● Analytics: ○ top ten most active users - most checkins ○ aggregate checkins per week ○ aggregate checkins per week per city

11. Cassandra data models ● The following data models were used: ○ User ○ Location ○ Checkin ○ FriendRels ■ graph of friend relationships ○ UserCheckins ■ checkins by user ○ LocationCheckins ■ checkins by location ○ FriendCheckins ■ checkins by friends

12. Cassandra data models ● Use of valueless columns ○ FriendRels, UserCheckins, LocationCheckins, FriendCheckins are just sets of valueless columns ● FriendRel: ○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...} ■ row_key is a uid ● UserCheckins: ○ row_key: {checkinid1: '', checkinid2: '', ...} ■ row_key is uid ● LocationCheckins use LocID as row key ● FriendCheckins use my UID to get my friend's checkins

13. Let's import the data into Cassandra...

14. You deserve a coffee...

15. Using Hadoop and Pig ...and we can do some analytics...

Demonstration

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Demonstration

Similar a Demonstration (20)

Último

Último (20)

Demonstration