This document discusses duplicate and near duplicate detection at scale. It begins with an outline of the topics to be covered, including a search system assessment, text extraction assessment, duplicates and near duplicates case study, and exploration of near duplicates using minhash. It then discusses experiments with detecting duplicates and near duplicates in a web index of over 12.5 million documents. The results found over 2.8 million documents had the same text digest, and different techniques like text profiles and minhash were able to further group documents. However, minhash was found to be prohibitively slow at this scale. The document explores reasons for this and concludes more work is needed to efficiently detect near duplicates at large scale.
Change Data Capture - Scale by the Bay 2019Petr Zapletal
The document discusses change data capture (CDC) in distributed systems. CDC allows capturing changes to data and notifying actors so they can react accordingly. Some considerations for CDC include reliability, scalability, performance, consistency, and fault tolerance. Common strategies involve using timestamps, triggers, diffs, or log scraping. CDC has various use cases such as caching, search indexing, offline processing, and simplifying monolithic applications. Popular frameworks for implementing CDC include pg2k4j, Maxwell, AWS Database Migration Service, SpinalTap, DataBus, and Debezium.
Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert
In the summer of 2016, XpertSea decided to migrate its operations to AWS and to build a data processing system that is able to scale to the extent of our ambitions. Come see how we built our platform inspired by Kappa Architecture, able to support connected devices located all-around the globe and state-of-the-art machine learning algorithms.
Lambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing.
In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Flink.
Streaming Data from Cassandra into KafkaAbrar Sheikh
Yelp has built a robust stream processing ecosystem called Data Pipeline. As part of this system we created a Cassandra Source Connector, which streams data updates made to Cassandra into Kafka in real time. We use Cassandra CDC and leverage the stateful stream processing of Apache Flink to produce a Kafka stream containing the full content of each modified row, as well as its previous value.
https://www.datastax.com/accelerate/agenda?session=Streaming-Cassandra-into-Kafka
Principles in Data Stream Processing | Matthias J Sax, ConfluentHostedbyConfluent
Data stream processing is, for many of us, a new paradigm with which you process data and build applications. In this talk, we will take you on a journey through the theoretical foundations of stream processing and discuss the underlying principles and unique problems that need to be addressed. What actually is a data stream anyway? And how do I use it? How do streams relate to application state and when do I use the one or the other?
ksqlDB and Kafka Streams are both, at their core, designed to help build stream processing applications and we will explain how stream processing principles are reflected in the design of each system and what trade-offs were chosen (and - more importantly! - why). Finally, we take a look into the future how the stream processing space, and in particular ksqlDB and Kafka Streams, may evolve over the next few years as we outline extensions and improvements to the underlying conceptual model. So, bring your thinking hats and notepads and prepare to learn WHY these systems are the way they are!
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
Structured Streaming in Apache Spark 2.0 introduces a continuous data flow programming model. It processes live data streams using a streaming query that is expressed similarly to batch queries on static data. The streaming query continuously appends incoming data to an unbounded table and performs incremental aggregations. This allows for exactly-once processing semantics without users needing to handle micro-batching or fault tolerance. Structured Streaming queries can be written using the Spark SQL DataFrame/Dataset API and output to sinks like files, databases, and dashboards. It is still experimental but provides an alternative to the micro-batch model of earlier Spark Streaming.
Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)
Change Data Capture - Scale by the Bay 2019Petr Zapletal
The document discusses change data capture (CDC) in distributed systems. CDC allows capturing changes to data and notifying actors so they can react accordingly. Some considerations for CDC include reliability, scalability, performance, consistency, and fault tolerance. Common strategies involve using timestamps, triggers, diffs, or log scraping. CDC has various use cases such as caching, search indexing, offline processing, and simplifying monolithic applications. Popular frameworks for implementing CDC include pg2k4j, Maxwell, AWS Database Migration Service, SpinalTap, DataBus, and Debezium.
Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert
In the summer of 2016, XpertSea decided to migrate its operations to AWS and to build a data processing system that is able to scale to the extent of our ambitions. Come see how we built our platform inspired by Kappa Architecture, able to support connected devices located all-around the globe and state-of-the-art machine learning algorithms.
Lambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing.
In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Flink.
Streaming Data from Cassandra into KafkaAbrar Sheikh
Yelp has built a robust stream processing ecosystem called Data Pipeline. As part of this system we created a Cassandra Source Connector, which streams data updates made to Cassandra into Kafka in real time. We use Cassandra CDC and leverage the stateful stream processing of Apache Flink to produce a Kafka stream containing the full content of each modified row, as well as its previous value.
https://www.datastax.com/accelerate/agenda?session=Streaming-Cassandra-into-Kafka
Principles in Data Stream Processing | Matthias J Sax, ConfluentHostedbyConfluent
Data stream processing is, for many of us, a new paradigm with which you process data and build applications. In this talk, we will take you on a journey through the theoretical foundations of stream processing and discuss the underlying principles and unique problems that need to be addressed. What actually is a data stream anyway? And how do I use it? How do streams relate to application state and when do I use the one or the other?
ksqlDB and Kafka Streams are both, at their core, designed to help build stream processing applications and we will explain how stream processing principles are reflected in the design of each system and what trade-offs were chosen (and - more importantly! - why). Finally, we take a look into the future how the stream processing space, and in particular ksqlDB and Kafka Streams, may evolve over the next few years as we outline extensions and improvements to the underlying conceptual model. So, bring your thinking hats and notepads and prepare to learn WHY these systems are the way they are!
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
Structured Streaming in Apache Spark 2.0 introduces a continuous data flow programming model. It processes live data streams using a streaming query that is expressed similarly to batch queries on static data. The streaming query continuously appends incoming data to an unbounded table and performs incremental aggregations. This allows for exactly-once processing semantics without users needing to handle micro-batching or fault tolerance. Structured Streaming queries can be written using the Spark SQL DataFrame/Dataset API and output to sinks like files, databases, and dashboards. It is still experimental but provides an alternative to the micro-batch model of earlier Spark Streaming.
Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)
My slides from the Big Data Applications meetup on 27th of July, talking about FlinkML. Also some things about open-source ML development and an illustation of interactive Flink machine learning with Apache Zeppelin.
At the beginning of 2021, Shopify Data Platform decided to adopt Apache Flink to enable modern stateful stream-processing. Shopify had a lot of experience with other streaming technologies, but Flink was a great fit due to its state management primitives.
After about six months, Shopify now has a flourishing ecosystem of tools, tens of prototypes from many teams across the company and a few large use-cases in production.
Yaroslav will share a story about not just building a single data pipeline but building a sustainable ecosystem. You can learn about how they planned their platform roadmap, the tools and libraries Shopify built, the decision to fork Flink, and how Shopify partnered with other teams and drove the adoption of streaming at the company.
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent
Lambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing.
In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Kafka Streams and Apache Flink.
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
Enforcing format, changing schema, introducing privacy filters have always been a challenge with the classical Kafka-API. In this talk we'll cover how to extend existing applications with webassembly, allowing developers to change the shape of data at runtime, per application without creating additional topics. By leveraging WebAssembly, we can extend the capabilities of the Kafka-API beyond what it was initially imagined. Come and learn about the future of the Kafka-API
What Your Tech Lead Thinks You Know (But Didn't Teach You)Chris Riccomini
The document summarizes advice from two tech leads, Chris Riccomini and Dmitriy Ryaboy, on things they think engineers know but were not explicitly taught, including how to properly ask questions, handle on-call responsibilities, understand software dependencies, and discuss promotions with managers. Specific tips are provided in each area, such as doing research before asking questions, setting time limits, tracking work while on-call, pinning dependency versions to avoid conflicts, and assessing one's own skills and accomplishments when considering a promotion.
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
University program - writing an apache apex applicationAkshay Gore
This presentation was delivered to engineering students from Computer, IT, Electronics background. This was lab hands on session on Apache Apex. The lab session was conducted after having lecture on introduction to Apex.
Advanced users of Apex/experts may not find this relevant.
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Till Rohrmann
In our fast moving world it becomes more and more important for companies to gain near real-time insights from their data to make faster decisions. These insights do not only provide a competitve edge over ones rivals but also enable a company to create completely new services and products. Amongst others, predictive user interfaces and online recommendation can be implemented when being able to process large amounts of data in real-time.
Apache Flink, one of the most advanced open source distributed stream processing platforms, allows you to extract business intelligence from your data in near real-time. With Apache Flink it is possible to process billions of messages with milliseconds latency. Moreover, its expressive APIs allow you to quickly solve your problems, ranging from classical analytical workloads to distributed event-driven applications.
In this talk, I will introduce Apache Flink and explain how it enables users to develop distributed applications and process analytical workloads alike. Starting with Flink’s basic concepts of fault-tolerance, statefulness and event-time aware processing, we will take a look at the different APIs and what they allow us to do. The talk will be concluded by demonstrating how we can use Flink’s higher level abstractions such as FlinkCEP and StreamSQL to do declarative stream processing.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
Fabian Hueske - Stream Analytics with SQL on Apache FlinkVerverica
Fabian Hueske presented on stream analytics using SQL on Apache Flink. Flink provides a scalable platform for stream processing that is fast, accurate, and reliable. Its relational APIs allow querying both batch and streaming data using standard SQL or a LINQ-style Table API. Queries on streaming data produce continuously updating results. Windows can be used to compute aggregates over tumbling time intervals. The dynamic tables representing streaming data can be converted to output streams encoding updates as insertions and deletions. While not all queries can be supported, techniques like limiting state size allow bounding computational resources. Use cases like continuous ETL, dashboards, and event-driven architectures were discussed.
This case study describes an approach to building quasi real-time OLAP cubes in Microsoft SQL Server Analysis Services to enable daily comparisons of production forecasts and outcomes. The cubes are partitioned by time to allow independent and frequent updates. Initial attempts failed due to deadlocks from simultaneous partition updates. The working solution takes advantage of a 6 day work week by switching partition dates on Saturdays only and reprocessing partitions then. This allows real-time and historical partition updates without gaps or overlaps in the data.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
To remain competitive, organizations need to democratize access to fast analytics, not only to gain real-time insights on their business but also to power smart apps that need to react in the moment. In this session, you will learn how Kafka and SingleStore enable modern, yet simple data architecture to analyze both fast paced incoming data as well as large historical datasets. In particular, you will understand why SingleStore is well suited process data streams coming from Kafka.
Measure your app internals with InfluxDB and Symfony2Corley S.r.l.
This document discusses using InfluxDB, a time-series database, to measure application internals in Symfony. It describes sending data from a Symfony app to InfluxDB using its PHP client library, and visualizing the data with Grafana dashboards. Key steps include setting up the InfluxDB client via dependency injection, dispatching events from controllers, listening for them to send data to InfluxDB, and building Grafana dashboards to view measurements over time.
The document introduces the Kafka Streams Processor API. It provides more fine-grained control over event processing compared to the Kafka Streams DSL. The Processor API allows access to state stores, record metadata, and scheduled processing via punctuators. It can be used to augment applications built with the Kafka Streams DSL by providing capabilities like random access to state stores and time-based processing.
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...DataStax Academy
The document discusses how MetricsHub optimized its public cloud architecture using Cassandra to store massive amounts of telemetry data in a highly scalable and cost-effective way. It describes how MetricsHub grew to monitor over 8000 VMs and collect 200 million data points per hour. It explains how Cassandra provided a more scalable and reliable solution than Redis or SQL for MetricsHub's use case of aggregating and analyzing streaming telemetry data in real-time. The architecture diagram shows how MetricsHub uses a combination of Azure PaaS and IaaS resources, including a 32 node Cassandra cluster, to monitor customer resources.
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
The document discusses the role of data engineers and data pipelines. It begins with an introduction to big data and why data volumes are increasing. It then covers what data engineers do, including building data architectures, working with cloud infrastructure, and programming for data ingestion, transformation, and loading. The document also explains data pipelines, describing extract, transform, load (ETL) processes and batch versus streaming data. It provides an example of Credit OK's data pipeline architecture on Google Cloud Platform that extracts raw data from various sources, cleanses and loads it into BigQuery, then distributes processed data to various applications. It emphasizes the importance of data engineers in processing and managing large, complex data sets.
Evaluating Text Extraction at Scale: A case study from Apache TikaTim Allison
The document discusses Apache Tika, a framework for extracting text and metadata from various file formats. It describes some ways Tika parsing can fail, such as missing or garbled text. The author introduces the tika-eval module for comparing Tika extractions and identifying issues. Public corpora are being used to test Tika at scale and find regressions. The goal is improving Tika's robustness through large-scale evaluation.
Australian Open government and research data pilot survey 2017Jonathan Yu
Australian Open data pilot survey conducted October 2017 leveraging indexed datasets across government and research sources via the CSIRO Knowledge Network (http://kn.csiro.au). Please note, these are preliminary results using our prototype quantitative methodology to assess volume, variety and velocity of open data initiatives across Australia. Lots of sources missing (we'd love to hear feedback about which ones would be good to include in the future!). Future work include addressing gaps in sources list, de-duplication of cross-indexed datasets, quantifying web services data, and an online version of the analysis.
My slides from the Big Data Applications meetup on 27th of July, talking about FlinkML. Also some things about open-source ML development and an illustation of interactive Flink machine learning with Apache Zeppelin.
At the beginning of 2021, Shopify Data Platform decided to adopt Apache Flink to enable modern stateful stream-processing. Shopify had a lot of experience with other streaming technologies, but Flink was a great fit due to its state management primitives.
After about six months, Shopify now has a flourishing ecosystem of tools, tens of prototypes from many teams across the company and a few large use-cases in production.
Yaroslav will share a story about not just building a single data pipeline but building a sustainable ecosystem. You can learn about how they planned their platform roadmap, the tools and libraries Shopify built, the decision to fork Flink, and how Shopify partnered with other teams and drove the adoption of streaming at the company.
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent
Lambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing.
In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Kafka Streams and Apache Flink.
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
Enforcing format, changing schema, introducing privacy filters have always been a challenge with the classical Kafka-API. In this talk we'll cover how to extend existing applications with webassembly, allowing developers to change the shape of data at runtime, per application without creating additional topics. By leveraging WebAssembly, we can extend the capabilities of the Kafka-API beyond what it was initially imagined. Come and learn about the future of the Kafka-API
What Your Tech Lead Thinks You Know (But Didn't Teach You)Chris Riccomini
The document summarizes advice from two tech leads, Chris Riccomini and Dmitriy Ryaboy, on things they think engineers know but were not explicitly taught, including how to properly ask questions, handle on-call responsibilities, understand software dependencies, and discuss promotions with managers. Specific tips are provided in each area, such as doing research before asking questions, setting time limits, tracking work while on-call, pinning dependency versions to avoid conflicts, and assessing one's own skills and accomplishments when considering a promotion.
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
University program - writing an apache apex applicationAkshay Gore
This presentation was delivered to engineering students from Computer, IT, Electronics background. This was lab hands on session on Apache Apex. The lab session was conducted after having lecture on introduction to Apex.
Advanced users of Apex/experts may not find this relevant.
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Till Rohrmann
In our fast moving world it becomes more and more important for companies to gain near real-time insights from their data to make faster decisions. These insights do not only provide a competitve edge over ones rivals but also enable a company to create completely new services and products. Amongst others, predictive user interfaces and online recommendation can be implemented when being able to process large amounts of data in real-time.
Apache Flink, one of the most advanced open source distributed stream processing platforms, allows you to extract business intelligence from your data in near real-time. With Apache Flink it is possible to process billions of messages with milliseconds latency. Moreover, its expressive APIs allow you to quickly solve your problems, ranging from classical analytical workloads to distributed event-driven applications.
In this talk, I will introduce Apache Flink and explain how it enables users to develop distributed applications and process analytical workloads alike. Starting with Flink’s basic concepts of fault-tolerance, statefulness and event-time aware processing, we will take a look at the different APIs and what they allow us to do. The talk will be concluded by demonstrating how we can use Flink’s higher level abstractions such as FlinkCEP and StreamSQL to do declarative stream processing.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
Fabian Hueske - Stream Analytics with SQL on Apache FlinkVerverica
Fabian Hueske presented on stream analytics using SQL on Apache Flink. Flink provides a scalable platform for stream processing that is fast, accurate, and reliable. Its relational APIs allow querying both batch and streaming data using standard SQL or a LINQ-style Table API. Queries on streaming data produce continuously updating results. Windows can be used to compute aggregates over tumbling time intervals. The dynamic tables representing streaming data can be converted to output streams encoding updates as insertions and deletions. While not all queries can be supported, techniques like limiting state size allow bounding computational resources. Use cases like continuous ETL, dashboards, and event-driven architectures were discussed.
This case study describes an approach to building quasi real-time OLAP cubes in Microsoft SQL Server Analysis Services to enable daily comparisons of production forecasts and outcomes. The cubes are partitioned by time to allow independent and frequent updates. Initial attempts failed due to deadlocks from simultaneous partition updates. The working solution takes advantage of a 6 day work week by switching partition dates on Saturdays only and reprocessing partitions then. This allows real-time and historical partition updates without gaps or overlaps in the data.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
To remain competitive, organizations need to democratize access to fast analytics, not only to gain real-time insights on their business but also to power smart apps that need to react in the moment. In this session, you will learn how Kafka and SingleStore enable modern, yet simple data architecture to analyze both fast paced incoming data as well as large historical datasets. In particular, you will understand why SingleStore is well suited process data streams coming from Kafka.
Measure your app internals with InfluxDB and Symfony2Corley S.r.l.
This document discusses using InfluxDB, a time-series database, to measure application internals in Symfony. It describes sending data from a Symfony app to InfluxDB using its PHP client library, and visualizing the data with Grafana dashboards. Key steps include setting up the InfluxDB client via dependency injection, dispatching events from controllers, listening for them to send data to InfluxDB, and building Grafana dashboards to view measurements over time.
The document introduces the Kafka Streams Processor API. It provides more fine-grained control over event processing compared to the Kafka Streams DSL. The Processor API allows access to state stores, record metadata, and scheduled processing via punctuators. It can be used to augment applications built with the Kafka Streams DSL by providing capabilities like random access to state stores and time-based processing.
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cas...DataStax Academy
The document discusses how MetricsHub optimized its public cloud architecture using Cassandra to store massive amounts of telemetry data in a highly scalable and cost-effective way. It describes how MetricsHub grew to monitor over 8000 VMs and collect 200 million data points per hour. It explains how Cassandra provided a more scalable and reliable solution than Redis or SQL for MetricsHub's use case of aggregating and analyzing streaming telemetry data in real-time. The architecture diagram shows how MetricsHub uses a combination of Azure PaaS and IaaS resources, including a 32 node Cassandra cluster, to monitor customer resources.
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
The document discusses the role of data engineers and data pipelines. It begins with an introduction to big data and why data volumes are increasing. It then covers what data engineers do, including building data architectures, working with cloud infrastructure, and programming for data ingestion, transformation, and loading. The document also explains data pipelines, describing extract, transform, load (ETL) processes and batch versus streaming data. It provides an example of Credit OK's data pipeline architecture on Google Cloud Platform that extracts raw data from various sources, cleanses and loads it into BigQuery, then distributes processed data to various applications. It emphasizes the importance of data engineers in processing and managing large, complex data sets.
Evaluating Text Extraction at Scale: A case study from Apache TikaTim Allison
The document discusses Apache Tika, a framework for extracting text and metadata from various file formats. It describes some ways Tika parsing can fail, such as missing or garbled text. The author introduces the tika-eval module for comparing Tika extractions and identifying issues. Public corpora are being used to test Tika at scale and find regressions. The goal is improving Tika's robustness through large-scale evaluation.
Australian Open government and research data pilot survey 2017Jonathan Yu
Australian Open data pilot survey conducted October 2017 leveraging indexed datasets across government and research sources via the CSIRO Knowledge Network (http://kn.csiro.au). Please note, these are preliminary results using our prototype quantitative methodology to assess volume, variety and velocity of open data initiatives across Australia. Lots of sources missing (we'd love to hear feedback about which ones would be good to include in the future!). Future work include addressing gaps in sources list, de-duplication of cross-indexed datasets, quantifying web services data, and an online version of the analysis.
How to valuate and determine standard essential patentsMIPLM
This document discusses methods for valuing and determining standard essential patents (SEPs). It notes that while many patents are declared as SEPs, studies show that only 20-28% are actually essential. Determining essentiality is an expensive and subjective process. The document proposes using a data-driven approach combining semantic analysis of patent claims and standards documents with characteristics like inventor participation to predict SEP essentiality in a more objective manner. This allows identifying potential SEP portfolios and their probabilistic essentiality likelihoods.
Visualising the Australian open data and research data landscapeJonathan Yu
"Visualising the Australian open data and research data landscape" at C3DIS May 2018 in Melbourne. In this talk, we presented work around the visualisation of an survey of open government and research data in Australia. This features a first attempt at formalising a quantitative based approach to measuring the data ecosystem in Australia.
Web services provide programmatic interfaces to online services and tools, allowing for machine-to-machine communication across the web. They have become widely used in the life sciences, with major providers like EMBL-EBI, DDBJ, and NCBI offering hundreds of services. However, ensuring the sustainability, usability, and reliability of web services remains an ongoing challenge. Catalogs aim to help users discover and understand available services, but require community involvement to maintain accurate and up-to-date information.
This document summarizes Kx Systems, a company that provides a high-performance time-series database called kdb+. Kdb+ can process and analyze large volumes of real-time and historical time-series data extremely fast with low latency. It is widely used in financial services and is now being applied to other industries like manufacturing, utilities, and life sciences. Kx Systems offers software, consulting services, and can help clients integrate kdb+ with their existing technologies and scale their deployments.
An overview of Text and Data Mining (ContentMining) including live demonstrations. The fundamentals: discover, scrape, normalize , facet/index, analyze, publish are exemplified using the recent Zika outbreak. Mining covers textual and non-textual content and examples of chemistry and phylogenetic tress are given.
Published on Feb 29, 2016 by PMR
An overview of Text and Data Mining (ContentMining) including live demonstrations. The fundamentals: discover, scrape, normalize , facet/index, analyze, publish are exemplified using the recent Zika outbreak. Mining covers textual and non-textual content and examples of chemistry and phylogenetic tress are given.
MPLS/SDN 2013 Intercloud Standardization and Testbeds - SillAlan Sill
This talk givens an overview of several multi-SDO and cross-SDO activities to promote and spur innovation in cloud computing. The focus is on API development and standardization, including testbeds, test use cases, and collaborative activities between organizations to create and carry out development and testing in this area. The focus is on work being pursued through the Cloud and Autonomic Computing Center at Texas Tech University, which is part of the US National Science Foundation's Industry/University Cooperative Research Center, and on work being done by standards organizations such as the Open Grid Forum, Distributed Management Task Force, and Telecommunications Management Forum in which the CAC@TTU is involved. A summary is also given of work to produce a new round of more detailed use cases suitable for testing by the US National Institute of Standards and Technology's Standards Acceleration to Jumpstart Adoption of Cloud Computing (SAJACC) working group, with brief mention also given to other related work going on in this area in other parts of the world. Background and other standards work is also mentioned.
500 languages to English Machine Translation ModelThamme Gowda
While there are more than 7000 languages in the world, most translation research efforts have targeted a few high resource languages. Commercial translation systems support only one hundred languages or fewer, and do not make these models available for transfer to low resource languages. In this work, we present useful tools for machine translation research: MTData, NLCodec and RTG. We demonstrate their usefulness by creating a multilingual neural machine translation model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
The Allotrope Foundation is a consortium of major pharmaceutical companies and a partner network whose goal is to address challenges in the pharmaceutical industry by providing a set of public, non-proprietary standards for using and integrating analytical laboratory data. Current challenges in data management within the pharmaceutical industry often center around inconsistent or incomplete data and metadata and proprietary data formats. Because of a lack of standardization, several operations (e.g. integration of instruments/applications, transfer of methods or results, archiving for regulatory purposes) require unnecessary efforts. Further, higher level aggregation of data, e.g. regulatory filings, that are derived from multiple sources of laboratory data are costly to create. These unnecessary costs impact operations within a company’s laboratories, between partnering companies, and between a company and contract research organizations (CROs). Finally, the accelerating transition of laboratories from hybrid (paper + electronic) to purely electronic data streams, coupled with an ever-increasing regulatory scrutiny of electronic data management practices, further require a comprehensive solution. This talk will discuss how The Allotrope Foundation is providing a new framework for data standards through collaboration between numerous stakeholders.
An Overview of the iMicrobe Project and available tools in the iPlant Cyberinfrastructure. This talk was given at a workshop at ASLO in Granada, Spain focused on applications in Oceanography and Limnology.
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...WSO2
In this webinar, Srinath Perera, director of research at WSO2, will discuss
Big data landscape: concepts, use cases, and technologies
Real-time analytics with WSO2 CEP
Batch analytics with WSO2 BAM
Combining batch and real-time analytics
Introducing WSO2 Machine Learner
Getting Access to ALCF Resources and Servicesdavidemartin
This document provides an overview of the resources and services available at the Argonne Leadership Computing Facility (ALCF). It summarizes the ALCF's computing systems, including the Intrepid and Eureka supercomputers. It also describes the various allocation programs that provide computing time, such as INCITE and ALCC. Finally, it outlines the services and support offered by the ALCF, including programming models, libraries, performance engineering, and user support.
There is a pressing need to tackle the usability challenges in querying massive, ultra-heterogeneous entity graphs which use thousands of node and edge types in recording millions to billions of entities (persons, products, organizations) and their relationships. Widely known instances of such graphs include Freebase, DBpedia and YAGO. Applications in a variety of domains are tapping into such graphs for richer semantics and better intelligence. Both data workers and application developers are often overwhelmed by the daunting task of understanding and querying these data, due to their sheer size and complexity. To retrieve data from graph databases, the norm is to use structured query languages such as SQL, SPARQL, and those alike. However, writing structured queries requires extensive experience in query language and data model. The database community has long recognized the importance of graphical query interface to the usability of data management systems. Yet, relatively little has been done. Existing visual query builders allow users to build queries by drawing query graphs, but do not offer suggestions to users regarding what nodes and edges to include. At every step of query formulation, a user would be inundated with possibly hundreds of or even more options.
Towards improving the usability of graph query systems, Orion is a visual query interface that iteratively assists users in query graph construction by making suggestions using machine learning methods. In its active mode, Orion suggests top-k edges to be added to a query graph, without being triggered by any user action. In its passive mode, the user adds a new edge manually, and Orion suggests a ranked list of labels for the edge. Orion’s edge ranking algorithm, Random Correlation Paths (RCP), makes use of a query log to rank candidate edges by how likely they will match users’ query intent. Extensive user studies using Freebase demonstrated that Orion users have a 70% success rate in constructing complex query graphs, a significant improvement over the 58% success rate by users of a baseline system that resembles existing visual query builders. Furthermore, using active mode only, the RCP algorithm was compared with several methods adapting other machine learning algorithms such as random forests and naive Bayes classifier, as well as recommendation systems based on singular value decomposition. On average, RCP required only 40 suggestions to correctly reach a target query graph while other methods required 2-4 times as many suggestions.
The document discusses several US grid projects including campus and regional grids like Purdue and UCLA that provide tens of thousands of CPUs and petabytes of storage. It describes national grids like TeraGrid and Open Science Grid that provide over a petaflop of computing power through resource sharing agreements. It outlines specific communities and projects using these grids for sciences like high energy physics, astronomy, biosciences, and earthquake modeling through the Southern California Earthquake Center. Software providers and toolkits that enable these grids are also mentioned like Globus, Virtual Data Toolkit, and services like Introduce.
The document discusses the Semantic Web and declarative knowledge representation in information technology. It provides an introduction to key concepts including semantics, ontologies, rules, and logic-based knowledge representation. It also outlines technologies that make up the Semantic Web such as RDF, RDF Schema, OWL, and SPARQL. The goal of these technologies is to represent information on the web in a structured, machine-readable format in order to enable automated processing of data.
The title of this talk is a crass attempt to be catchy and topical, by referring to the recent victory of Watson in Jeopardy.
My point (perhaps confusingly) is not that new computer capabilities are a bad thing. On the contrary, these capabilities represent a tremendous opportunity for science. The challenge that I speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources. The solution, I suggest, is to get computation out of the lab—to outsource it to third party providers.
Abstract follows:
We have made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of vast quantities of data. But we now face a far greater challenge. Exploding data volumes and new research methodologies mean that many more--ultimately most?--researchers will soon require similar capabilities. How can we possible supply information technology (IT) at this scale, given constrained budgets? Must every lab become filled with computers, and every researcher an IT specialist?
I propose that the answer is to take a leaf from industry, which is slashing both the costs and complexity of consumer and business IT by moving it out of homes and offices to so-called cloud providers. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity, empowering investigators with new capabilities and freeing them to focus on their research.
I describe work we are doing to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date, and suggest a path towards large-scale delivery of these capabilities. I also suggest that these developments are part of a larger "revolution in scientific affairs," as profound in its implications as the much-discussed "revolution in military affairs" resulting from more capable, low-cost IT. I conclude with some thoughts on how researchers, educators, and institutions may want to prepare for this revolution.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
17. jpl.nasa.gov
How big of a problem are duplicates?
10/22/20 17
First “lesson
learned” in
Oleksiy
Kovyrin’s
recent
“Sprinting to a
crawl: Building
an effective
web crawler”
on ElastiCON
Global 2020