At Nielsen, data is very important. Being the core of our business, we love it and there’s lots of it. We don’t want to lose it, and at the same time, we don’t want to duplicate it.
Our data goes through a robust Kafka architecture, into several ETLs, receiving, transforming and storing the data.
While we clearly understood our ETLs’ workflow, we had no visibility into what parts of the data, if any, were lost or duplicated, and in which stage or stages of the workflow, from source to destination.
But how much do we know about the way our data makes though our systems? And what about the life long question, is it the end of the day yet?
In this talk I’m going to present to you the design process behind our Data Auditing system, Life Line. From tracking and producing , to analysing and storing auditing information, using technologies such as Kafka, Avro, Spark, Lambda functions and complex SQL queries. We’re going to cover:
* AVRO Audit header
* Auditing heart beat - designing your metadata
* Designing and optimising your auditing table - what does this data look like anyway?
* Creating an alert based monitoring system
* Answering the most important question of all - is it the end of the day yet?
2024 Q2 Orange County (CA) Tableau User Group Meeting
Auditing data and answering the life long question, is it the end of the day yet?
1. Auditing data and answering the life long
question: Is it the end of the day yet?
Simona Meriam, Aidoc
A true story based off of my endeavours @ Nielsen
2. Agenda
● Nielsen’s architecture
● Little fires everywhere
● Designing our metadata & data
● Storing data and querying for
optimum
● Is it the end of the day yet?
● Alerts and add-ons
3. whoami
● Simona Meriam
● Big Data Engineer @ Aidoc
● Data lover
● Concert goer
● Japan enthusiast
5. Little Fires Everywhere
● Data arrival pain points and recovering from failures
● When to process data?
● Is it the end of the day yet?
● Some more pain points
11. Is it the end of the day yet?
When do we process data?
● Let’s talk about this question, and possible answers
1. Data granularity
2. Time granularity
● So when do we process our data? When is it the end of the day?
● The implications of processing and reprocessing
12. Is it the end of day yet?
Legacy answers to a legacy problem
● Fixed time
● “aws s3 ls”
16. Auditing window? Let’s design our metadata
What should we keep in mind?
● Several kafka topics
● Data serving infrastructure
○ Our own “Nielsen Kafka Producer”
○ 2 JVMs on a single machine
○ Each JVM works against several topics
○ SLAs are very important!
● The use of AVRO
And then finally, what is a window?
27. Designing Out Output Table
Questions we want answered
1. At what levels of granularity?
2. Arrival rates?
3. Arrival latency?
28. Designing Our Output Table
● Audit Timestamp
● Topic
● Server
● Process
● Location - Origin of data
● Event count
What about add ons?
● Region
● Insert time
33. SELECT window_timestamp AT TIME ZONE 'UTC' AS window_timestamp,
topic_name,
SUM(CASE WHEN location = 'kafka_windows' THEN event_count ELSE 0 END) AS producer_count,
SUM(CASE WHEN location = 'rdr_headers' AND
(insert_time AT TIME zone 'utc' - window_timestamp AT TIME zone 'utc' <= INTERVAL '3 HOURS')
THEN event_count ELSE 0 END) AS rdr_count
FROM audit.audit_data
WHERE window_timestamp <= CURRENT_TIMESTAMP - INTERVAL '2 HOURS' AND
window_timestamp >= CURRENT_TIMESTAMP - INTERVAL '1 MONTH' AND
event_count > 0
GROUP BY window_timestamp, topic_name
HAVING SUM(CASE WHEN location = 'kafka_windows' THEN event_count ELSE 0 END) > 0
Q & A With Apache Superset
34.
35. Shout out to my dad….
Grisha Meriam - Senior Software Developer - Aviv
Advanced Solutions | LinkedIn
LinkedIn handler: grisha-meriam-4876b784
36. Optimizing PostgreSQL for Audit Queries
1. Weekly partitioning
2. Indexes - unique and complementary
3. Working in parallel
● max_worker_processes
● max_parallel_workers
● max_parallel_workers_per_gather
1. Tricking the optimizer, or how about some SQL hacks and using UNION
instead of IN
42. Is it the end of the day yet?
1. Data arrival rate for the entire scope?
2. Number of audit windows for the entire scope?
3. Arrival rate for the last window?
43. Alerts and add-ons
● Alert granularity for different types of failures
○ Region
○ Topic
○ Server
● Detecting duplications
● More locations!