SlideShare una empresa de Scribd logo
1 de 46
Auditing data and answering the life long
question: Is it the end of the day yet?
Simona Meriam, Aidoc
A true story based off of my endeavours @ Nielsen
Agenda
● Nielsen’s architecture
● Little fires everywhere
● Designing our metadata & data
● Storing data and querying for
optimum
● Is it the end of the day yet?
● Alerts and add-ons
whoami
● Simona Meriam
● Big Data Engineer @ Aidoc
● Data lover
● Concert goer
● Japan enthusiast
Nielsen’s Architecture (AT THE TIME)
Little Fires Everywhere
● Data arrival pain points and recovering from failures
● When to process data?
● Is it the end of the day yet?
● Some more pain points
Data Arrival Pain Points
Data Arrival Pain Points
Recovering from failures
?
?
?
?
?
?
?
?
?
?
?
Recovering from failures
Recovering from failures
Is it the end of the day yet?
When do we process data?
● Let’s talk about this question, and possible answers
1. Data granularity
2. Time granularity
● So when do we process our data? When is it the end of the day?
● The implications of processing and reprocessing
Is it the end of day yet?
Legacy answers to a legacy problem
● Fixed time
● “aws s3 ls”
More Pain Points
And some more
Little Fires Everywhere
Auditing window? Let’s design our metadata
What should we keep in mind?
● Several kafka topics
● Data serving infrastructure
○ Our own “Nielsen Kafka Producer”
○ 2 JVMs on a single machine
○ Each JVM works against several topics
○ SLAs are very important!
● The use of AVRO
And then finally, what is a window?
Auditing Window
Key
● Topic
● Server
● Process
● Audit time
Value
● Counter
Auditing Header
Auditing Header Injection
In context
Shipping Audit Window to Collection Point
Kafka topic VS Amazon API Gateway
In context
Consuming Audit Data
Audit Window
● Kafka Consumer
● Straight through API
Audit Header
● S3 Consumer
One single ElasticMapReduce
In Context
Storing Data and Querying to Optimum
Designing Out Output Table
Questions we want answered
1. At what levels of granularity?
2. Arrival rates?
3. Arrival latency?
Designing Our Output Table
● Audit Timestamp
● Topic
● Server
● Process
● Location - Origin of data
● Event count
What about add ons?
● Region
● Insert time
insert_time window_timestamp topic_name server_name region process_id location event_count
2021-08-02T00:05:00.079000 2021-08-02T00:00:00 TOPIC1 server1.ams1.nielsen ams1 19862 kafka_windows 0
2021-08-02T00:05:00.082000 2021-08-02T00:00:00 TOPIC1 server1.slj1.nielsen slj1 4075 kafka_windows 98396
2021-08-02T00:05:00.082000 2021-08-02T00:00:00 TOPIC2 server1.slj1.nielsen slj1 4075 kafka_windows 31805
2021-08-02T00:05:00.082000 2021-08-02T00:00:00 TOPIC1 server1.slj1.nielsen slj1 4075 kafka_windows 98396
2021-08-02T00:32:12.082000 2021-08-02T00:00:00 TOPIC2 server1.slj1.nielsen slj1 4075 rdr_headers 12453
2021-08-02T00:05:00.132000 2021-08-02T00:00:00 TOPIC3 server2.ams1.nielsen ams1 31573 kafka_windows 84924
2021-08-02T00:05:00.131000 2021-08-02T00:00:00 TOPIC1 server2.ams1.nielsen ams1 31573 kafka_windows 0
2021-08-02T00:10:00.009700 2021-08-02T00:05:00 TOPIC2 server2.ams1.nielsen ams1 31571 kafka_windows 3177
Q & A With Apache Superset
Q & A With Apache Superset
SELECT window_timestamp AT TIME ZONE 'UTC' AS window_timestamp,
topic_name,
SUM(CASE WHEN location = 'kafka_windows' THEN event_count ELSE 0 END) AS producer_count,
SUM(CASE WHEN location = 'rdr_headers' AND
(insert_time AT TIME zone 'utc' - window_timestamp AT TIME zone 'utc' <= INTERVAL '3 HOURS')
THEN event_count ELSE 0 END) AS rdr_count
FROM audit.audit_data
WHERE window_timestamp <= CURRENT_TIMESTAMP - INTERVAL '2 HOURS' AND
window_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 MONTH' AND
event_count > 0
GROUP BY window_timestamp, topic_name
HAVING SUM(CASE WHEN location = 'kafka_windows' THEN event_count ELSE 0 END) > 0
Q & A With Apache Superset
Shout out to my dad….
Grisha Meriam - Senior Software Developer - Aviv
Advanced Solutions | LinkedIn
LinkedIn handler: grisha-meriam-4876b784
Optimizing PostgreSQL for Audit Queries
1. Weekly partitioning
2. Indexes - unique and complementary
3. Working in parallel
● max_worker_processes
● max_parallel_workers
● max_parallel_workers_per_gather
1. Tricking the optimizer, or how about some SQL hacks and using UNION
instead of IN
Managing Partitions with Apache Airflow
Creating partitions automatically with PostgreSQL 11
Managing Partitions with Airflow
Offloading Data to History
● Spark JOB
● Apache Airflow
● Monthly partitions - managed the same way
● Less granularity, more metrics
Is it the end of the day yet?
● A simple REST API to answer all your questions
● By topic, by region, by time spec
● What do we need to check?
Is it the end of the day yet?
Is it the end of the day yet?
1. Data arrival rate for the entire scope?
2. Number of audit windows for the entire scope?
3. Arrival rate for the last window?
Alerts and add-ons
● Alert granularity for different types of failures
○ Region
○ Topic
○ Server
● Detecting duplications
● More locations!
Auditing data and answering the life long question, is it the end of the day yet?
Auditing data and answering the life long question, is it the end of the day yet?
Auditing data and answering the life long question, is it the end of the day yet?

Más contenido relacionado

La actualidad más candente

Михаил Максимов ( Software engineer, DataArt. AWS certified Solution Architect)
Михаил Максимов ( Software engineer, DataArt. AWS certified Solution Architect)Михаил Максимов ( Software engineer, DataArt. AWS certified Solution Architect)
Михаил Максимов ( Software engineer, DataArt. AWS certified Solution Architect)
DataArt
 

La actualidad más candente (20)

Infrastructure as code (iac) - Terraform for AWS
Infrastructure as code (iac) - Terraform for AWSInfrastructure as code (iac) - Terraform for AWS
Infrastructure as code (iac) - Terraform for AWS
 
The great migration embracing serverless first
The great migration  embracing serverless first The great migration  embracing serverless first
The great migration embracing serverless first
 
Best practices for running Windows workloads on AWS - AWS Summit Stockholm (M...
Best practices for running Windows workloads on AWS - AWS Summit Stockholm (M...Best practices for running Windows workloads on AWS - AWS Summit Stockholm (M...
Best practices for running Windows workloads on AWS - AWS Summit Stockholm (M...
 
Serverless Computing @ x-celerate 2018
Serverless Computing @ x-celerate 2018Serverless Computing @ x-celerate 2018
Serverless Computing @ x-celerate 2018
 
Serverless Computing: Run code, not servers
Serverless Computing: Run code, not serversServerless Computing: Run code, not servers
Serverless Computing: Run code, not servers
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
 
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...
 
MongoDB World 2018: Using Puppet, Ansible and Ops Manager to Create Your Own ...
MongoDB World 2018: Using Puppet, Ansible and Ops Manager to Create Your Own ...MongoDB World 2018: Using Puppet, Ansible and Ops Manager to Create Your Own ...
MongoDB World 2018: Using Puppet, Ansible and Ops Manager to Create Your Own ...
 
The Problem is Data: Gwen Shapira, Confluent, Serverless NYC 2018
The Problem is Data: Gwen Shapira, Confluent, Serverless NYC 2018The Problem is Data: Gwen Shapira, Confluent, Serverless NYC 2018
The Problem is Data: Gwen Shapira, Confluent, Serverless NYC 2018
 
How to move a mission critical system to 4 AWS regions in one year?
How to move a mission critical system to 4 AWS regions in one year?How to move a mission critical system to 4 AWS regions in one year?
How to move a mission critical system to 4 AWS regions in one year?
 
Algolia's Fury Road to a Worldwide API
Algolia's Fury Road to a Worldwide APIAlgolia's Fury Road to a Worldwide API
Algolia's Fury Road to a Worldwide API
 
What is new in pass summit 2014
What is new in pass summit 2014What is new in pass summit 2014
What is new in pass summit 2014
 
Elk meetup
Elk meetupElk meetup
Elk meetup
 
Tracing Java Applications on Azure
Tracing Java Applications on AzureTracing Java Applications on Azure
Tracing Java Applications on Azure
 
Choosing the right messaging service for your serverless app [with lumigo]
Choosing the right messaging service for your serverless app [with lumigo]Choosing the right messaging service for your serverless app [with lumigo]
Choosing the right messaging service for your serverless app [with lumigo]
 
How we use the play framework
How we use the play frameworkHow we use the play framework
How we use the play framework
 
CICD in the World of Serverless
CICD in the World of ServerlessCICD in the World of Serverless
CICD in the World of Serverless
 
Михаил Максимов ( Software engineer, DataArt. AWS certified Solution Architect)
Михаил Максимов ( Software engineer, DataArt. AWS certified Solution Architect)Михаил Максимов ( Software engineer, DataArt. AWS certified Solution Architect)
Михаил Максимов ( Software engineer, DataArt. AWS certified Solution Architect)
 
Building the Serverless Container Experience: Kevin McGrath, Spotinst, Server...
Building the Serverless Container Experience: Kevin McGrath, Spotinst, Server...Building the Serverless Container Experience: Kevin McGrath, Spotinst, Server...
Building the Serverless Container Experience: Kevin McGrath, Spotinst, Server...
 
FME Cloud Tips for Success
FME Cloud Tips for SuccessFME Cloud Tips for Success
FME Cloud Tips for Success
 

Similar a Auditing data and answering the life long question, is it the end of the day yet?

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 

Similar a Auditing data and answering the life long question, is it the end of the day yet? (20)

Auditing your data and answering the life long question, is it the end of the...
Auditing your data and answering the life long question, is it the end of the...Auditing your data and answering the life long question, is it the end of the...
Auditing your data and answering the life long question, is it the end of the...
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-AriThinking DevOps in the Era of the Cloud - Demi Ben-Ari
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Using druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scaleUsing druid for interactive count distinct queries at scale
Using druid for interactive count distinct queries at scale
 
Unified Operations Vision
Unified Operations VisionUnified Operations Vision
Unified Operations Vision
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutesDruid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes
 
3 Keys to Performance Testing at the Speed of Agile
3 Keys to Performance Testing at the Speed of Agile3 Keys to Performance Testing at the Speed of Agile
3 Keys to Performance Testing at the Speed of Agile
 
Druid meetup @walkme
Druid meetup @walkmeDruid meetup @walkme
Druid meetup @walkme
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 

Último

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 

Último (20)

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 

Auditing data and answering the life long question, is it the end of the day yet?

  • 1. Auditing data and answering the life long question: Is it the end of the day yet? Simona Meriam, Aidoc A true story based off of my endeavours @ Nielsen
  • 2. Agenda ● Nielsen’s architecture ● Little fires everywhere ● Designing our metadata & data ● Storing data and querying for optimum ● Is it the end of the day yet? ● Alerts and add-ons
  • 3. whoami ● Simona Meriam ● Big Data Engineer @ Aidoc ● Data lover ● Concert goer ● Japan enthusiast
  • 5. Little Fires Everywhere ● Data arrival pain points and recovering from failures ● When to process data? ● Is it the end of the day yet? ● Some more pain points
  • 11. Is it the end of the day yet? When do we process data? ● Let’s talk about this question, and possible answers 1. Data granularity 2. Time granularity ● So when do we process our data? When is it the end of the day? ● The implications of processing and reprocessing
  • 12. Is it the end of day yet? Legacy answers to a legacy problem ● Fixed time ● “aws s3 ls”
  • 16. Auditing window? Let’s design our metadata What should we keep in mind? ● Several kafka topics ● Data serving infrastructure ○ Our own “Nielsen Kafka Producer” ○ 2 JVMs on a single machine ○ Each JVM works against several topics ○ SLAs are very important! ● The use of AVRO And then finally, what is a window?
  • 17. Auditing Window Key ● Topic ● Server ● Process ● Audit time Value ● Counter
  • 19.
  • 22. Shipping Audit Window to Collection Point Kafka topic VS Amazon API Gateway
  • 24. Consuming Audit Data Audit Window ● Kafka Consumer ● Straight through API Audit Header ● S3 Consumer One single ElasticMapReduce
  • 26. Storing Data and Querying to Optimum
  • 27. Designing Out Output Table Questions we want answered 1. At what levels of granularity? 2. Arrival rates? 3. Arrival latency?
  • 28. Designing Our Output Table ● Audit Timestamp ● Topic ● Server ● Process ● Location - Origin of data ● Event count What about add ons? ● Region ● Insert time
  • 29. insert_time window_timestamp topic_name server_name region process_id location event_count 2021-08-02T00:05:00.079000 2021-08-02T00:00:00 TOPIC1 server1.ams1.nielsen ams1 19862 kafka_windows 0 2021-08-02T00:05:00.082000 2021-08-02T00:00:00 TOPIC1 server1.slj1.nielsen slj1 4075 kafka_windows 98396 2021-08-02T00:05:00.082000 2021-08-02T00:00:00 TOPIC2 server1.slj1.nielsen slj1 4075 kafka_windows 31805 2021-08-02T00:05:00.082000 2021-08-02T00:00:00 TOPIC1 server1.slj1.nielsen slj1 4075 kafka_windows 98396 2021-08-02T00:32:12.082000 2021-08-02T00:00:00 TOPIC2 server1.slj1.nielsen slj1 4075 rdr_headers 12453 2021-08-02T00:05:00.132000 2021-08-02T00:00:00 TOPIC3 server2.ams1.nielsen ams1 31573 kafka_windows 84924 2021-08-02T00:05:00.131000 2021-08-02T00:00:00 TOPIC1 server2.ams1.nielsen ams1 31573 kafka_windows 0 2021-08-02T00:10:00.009700 2021-08-02T00:05:00 TOPIC2 server2.ams1.nielsen ams1 31571 kafka_windows 3177
  • 30.
  • 31. Q & A With Apache Superset
  • 32. Q & A With Apache Superset
  • 33. SELECT window_timestamp AT TIME ZONE 'UTC' AS window_timestamp, topic_name, SUM(CASE WHEN location = 'kafka_windows' THEN event_count ELSE 0 END) AS producer_count, SUM(CASE WHEN location = 'rdr_headers' AND (insert_time AT TIME zone 'utc' - window_timestamp AT TIME zone 'utc' <= INTERVAL '3 HOURS') THEN event_count ELSE 0 END) AS rdr_count FROM audit.audit_data WHERE window_timestamp <= CURRENT_TIMESTAMP - INTERVAL '2 HOURS' AND window_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 MONTH' AND event_count > 0 GROUP BY window_timestamp, topic_name HAVING SUM(CASE WHEN location = 'kafka_windows' THEN event_count ELSE 0 END) > 0 Q & A With Apache Superset
  • 34.
  • 35. Shout out to my dad…. Grisha Meriam - Senior Software Developer - Aviv Advanced Solutions | LinkedIn LinkedIn handler: grisha-meriam-4876b784
  • 36. Optimizing PostgreSQL for Audit Queries 1. Weekly partitioning 2. Indexes - unique and complementary 3. Working in parallel ● max_worker_processes ● max_parallel_workers ● max_parallel_workers_per_gather 1. Tricking the optimizer, or how about some SQL hacks and using UNION instead of IN
  • 37. Managing Partitions with Apache Airflow Creating partitions automatically with PostgreSQL 11
  • 39. Offloading Data to History ● Spark JOB ● Apache Airflow ● Monthly partitions - managed the same way ● Less granularity, more metrics
  • 40. Is it the end of the day yet? ● A simple REST API to answer all your questions ● By topic, by region, by time spec ● What do we need to check?
  • 41. Is it the end of the day yet?
  • 42. Is it the end of the day yet? 1. Data arrival rate for the entire scope? 2. Number of audit windows for the entire scope? 3. Arrival rate for the last window?
  • 43. Alerts and add-ons ● Alert granularity for different types of failures ○ Region ○ Topic ○ Server ● Detecting duplications ● More locations!