This document summarizes Nubank's data infrastructure and access. It notes that Nubank has 55 squads processing 40 TB of data daily across 190 microservices and 29 models. The goals are to make data easy and safe to use for decisions and improve decision making. The infrastructure includes platforms like Datomic, Redshift, Spark and services like Metabase, Looker, Jupyter notebooks. Over 500 people access data tools daily. There are over 300 ETL job contributions monthly to extract, transform and load data from various sources into the data warehouse. Tools like Metabase, Looker and Jupyter notebooks are used for analysis and Databricks for complex tasks. Training, support channels and meetings help users.
9. 9
DB1 Log S0
DB1 Log S1
DB2 Log S0
Dataset Series
contract 1
contract 2
dataset 1
dataset 2
model
policy
ETL Jobs
10. ETL Jobs
• Anyone in the company can
contribute ETL jobs by opening a PR
in our monorepo
• Teams are responsible for writing
and maintaining their jobs
• Jobs are written in scala (sparkSQL);
some DSLs are provided
• Use databricks to iterate on logic
• Peer review to ensure quality and
consistency
• 100 contributors making 300+
contributions per month
12. Data Tools
Metabase and Looker: Simple
queries, sharing plots
Jupyter: In depth analysis,
complex plots, training models
Databricks: Dataset building
and computationally intensive
tasks
14. Data Services
Trainings: Weekly trainings on SQL,
python or scala, new employee
onboarding, new tool rollout
Support: Dedicated slack support
channels; community of users support
each other
Meetings: Forums for sharing data
scientist and analyst work, monthly
meetings to discuss state of Data
Data Analysts: Function focused to
improving data usage in the company
(not SQL slaves!)