SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Neo4j Data Loading
with Kettle
Matt Casters
Chief Solutions Architect / Kettle Project Founder
Agenda
➢ What is Kettle?
➢ The Neo4j plugins
➢ Data loading performance tips
➢ Streaming data integration
➢ Metadata driven data possibilities
➢ Kettle Execution lineage in a graph
➢ Roadmap update
➢ Q&A
What is Kettle?
3
Kettle: Introduction
➢ Pentaho Data Integration from Hitachi Vantara
➢ One of the most widely used ETL tools
➢ Ready for the most demanding tasks
➢ Open source Apache Public License 2.0
➢ Well maintained
➢ Large community, marketplace, ...
➢ Easy to embed, install, package, rebrand
➢ Download : Sourceforge / Pentaho / 8.2 / PDI-CE
Kettle: where is it used?
➢ On tiny and enormous systems, real or virtual
➢ Very small computers, Raspberry Pie sized
➢ Your laptop or browser
➢ Locally or in the cloud
➢ On Hadoop clusters, VMs, Docker, Serverless,
➢ At large and small companies
➢ In government
➢ In education
➢ In the Neo4j Solutions Reference Architecture
Kettle: Why is it used?
➢ Reduce costs!
➢ Answers the “build or buy?” question
build
buy
Time
Accum.
Cost
Kettle
Kettle: Architecture
➢ Metadata driven, engine based :
○ No code generation
○ Define what you need to happen
→ GUI, Web, code, rules, …
○ Clear and transparent, self documenting
➢ Types of work:
○ Jobs for workflows
○ Transformations for parallel data streaming
Kettle: Design
➢ 100% Exposure of our engine through UI elements
➢ Everyone should be able to play along: plugins!
➢ We built integration points for others: run everywhere!
➢ Allow the user to avoid programming anything
➢ Allow the user to program anything: JavaScript, Java,
Groovy, RegEx, Rules, Python, Ruby, R, …
➢ Transparency wins: best in class logging, data lineage,
execution lineage, debugging, data previewing, row
sniff testing, …
Kettle: things of note
➢ SpoonGit: UI integration with git
➢ WebSpoon: web interface to the full Spoon UI
➢ Data Sets: build transformation unit tests
➢ Huge list of other plugins available, including from
Neo4j, on a marketplace, …
➢ Support for the latest technology stacks
➢ Project on github has over 1,000 forks
https://github.com/pentaho/pentaho-kettle
Kettle: The Toolset
➢ Spoon: GUI
➢ Scripts
➢ Server(s)
➢ Java API & SDK
➢ Standard file format
➢ Plugin ecosystem
➢ Docker image(s)
➢ Documentation, books, ...
Neo4j Kettle Plugins
11
Neo4j Plugins: where to find?
➢ Started by the community, extended by Neo4j
➢ Releases/Download shortcut:
○ http://neo4j.kettle.be
➢ Project:
○ https://github.com/knowbi/knowbi-pentaho-
pdi-neo4j-output
Give us feedback!
Neo4j Cypher
➢ For reading and writing
➢ Dynamic Cypher
➢ Batching and UNWIND
➢ Parameters
➢ Return values
➢ Helpers
Neo4j Output
➢ Easy node creation
➢ Create/Merge of ()-[]-()
➢ Batching and UNWIND
➢ Dynamic labels
Neo4j Graph Output
➢ Update (parts of) a graph
➢ Using a logical model
➢ Using field mapping
➢ Auto-generate Cypher
Check Neo4j Connection
➢ Job Entry (workflow)
➢ Validate DBs are up
➢ Used in error diagnostic
➢ Defensive setup
➢ Pessimistic approach
Neo4j Cypher Script
➢ Job Entry (workflow)
➢ Executes series of Cypher statements
Neo4j Kettle Plugins v4
18
Plugins v4
➢ Bulk loading steps
➢ Performance options
➢ Encrypted/obfuscated password in variables
➢ Bug fixes & UI improvements
Neo4j Generate CSVs
➢ Generate CSV files for Neo4j Import
➢ Generates appropriate header
➢ Handles escaping, quoting, …
➢ Outputs file names
Neo4j Split Graph
➢ Splits a graph field into nodes and relationships
➢ Used for unique value calculation
Neo4j Importer
➢ Runs a neo4j-import command
➢ Accepts the filenames of CSV files
Data loading
Performance tips
23
Pre-processing in Kettle
➢ Do work in Kettle that can be avoided in Neo4j
➢ Calculate unique nodes
➢ Do required data conversions
➢ Data cleaning
Parallel loading & batching
➢ Parallel node creation
➢ Limit high parallelism in the general case
➢ UNWIND in Neo4j Cypher step
➢ Create option in Neo4j Output step
➢ Use larger batch sizes (>1000)
➢ Create indexes up-front or with the options
Importing data
➢ Bulk loading with import is much faster
➢ A few orders of magnitude faster
➢ Collect all the data in CSV files
➢ Use the new steps to load
➢ Seamless path to incremental loads
Streaming data loads
27
Streaming options
➢ Micro-batching (every X minutes)
➢ Kafka, Event Hubs, Queues,... (never ending)
Streaming options
➢ Transformations can be never ending
➢ Any operation is possible
➢ Can collect data in other data platforms
➢ Is transactionally safe if it is supported (Kafka, …)
➢ Can be parallelized & scaled out
Metadata driven
Data possibilities
30
➢ Kettle transformations & jobs are metadata
➢ ETL Metadata Injection: transformation templates
➢ Neo4j is a great metadata database
➢ Kettle can make use of this
Metadata FTW
Metadata driven loads
➢ Loading hundreds of types of files
➢ Processing data from hundreds of databases
➢ Automatic data standardization and normalisation
→ Massive time gains!
Metadata driven extracts
➢ Without hardcoded sources, selections and targets
➢ Sourcing selections from users, processes, ...
➢ Using the possibilities of the Kettle engine
→ Flexibility, performance, without coding
Kettle Execution Lineage
34
Kettle Logging Architecture
➢ Unique ID per execution
➢ Precise sourcing of logging records
➢ Very “graphy” data
Execution
Metadata
Impact
Parent /
child
relation
Parent /
child
relation
The Kettle Neo4j Logging plugin
➢ Stores operational metadata in a graph
➢ https://github.com/mattcasters/kettle-neo4j-logging
➢ Tools
○ View execution information: log, duration, errors
○ Find error paths
○ Jump to error location
○ Find execution path of a step
○ Get time window: “since last succesful execution”
Execution lineage in a graph
➢ Documents the exection process
○ Log text, metadata, times, ...
Roadmap update
38
Roadmap Neo4j plugin
➢ 25 releases in 2018
➢ Major 4.0 release next week
➢ Then:
○ New Neo4j Output step
○ More graph data type operations
○ <Insert YOUR suggestion!>
➢ Tuning options for Neo4j steps running in initial
Kettle Apache Beam implementation:
→ DataFlow, Spark, Flink, …
Roadmap Neo4j Logging plugin
➢ Generic impact information logging
➢ Store data lineage in Neo4j
➢ Git revision graph loading (new step)
➢ Storing and viewing unit testing results
➢ Operational “dashboard”
Q&A
41

Más contenido relacionado

La actualidad más candente

Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Trey Grainger
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
Neo4j
 

La actualidad más candente (20)

page ranking algorithm
page ranking algorithmpage ranking algorithm
page ranking algorithm
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache spark
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4j
 
The Customer Journey Is a Graph
The Customer Journey Is a GraphThe Customer Journey Is a Graph
The Customer Journey Is a Graph
 
Introduction to Neo4j and .Net
Introduction to Neo4j and .NetIntroduction to Neo4j and .Net
Introduction to Neo4j and .Net
 
Basics of MongoDB
Basics of MongoDB Basics of MongoDB
Basics of MongoDB
 
Demystifying Graph Neural Networks
Demystifying Graph Neural NetworksDemystifying Graph Neural Networks
Demystifying Graph Neural Networks
 
Intermediate Cypher.pdf
Intermediate Cypher.pdfIntermediate Cypher.pdf
Intermediate Cypher.pdf
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
 
Neo4j: Import and Data Modelling
Neo4j: Import and Data ModellingNeo4j: Import and Data Modelling
Neo4j: Import and Data Modelling
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data Science
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j Graph
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Neo4j Fundamentals
Neo4j FundamentalsNeo4j Fundamentals
Neo4j Fundamentals
 
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine LearnGraphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 

Similar a Neo4j Data Loading with Kettle

GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...
GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...
GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...
Neo4j
 
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
PranavPatil822557
 

Similar a Neo4j Data Loading with Kettle (20)

GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...
GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...
GraphDay Paris - Intégrer des flux de données dans Neo4j avec l'ETL Open Sour...
 
Road to NODES - Handling Neo4j Data with Apache Hop
Road to NODES - Handling Neo4j Data with Apache HopRoad to NODES - Handling Neo4j Data with Apache Hop
Road to NODES - Handling Neo4j Data with Apache Hop
 
PaaSTA: Running applications at Yelp
PaaSTA: Running applications at YelpPaaSTA: Running applications at Yelp
PaaSTA: Running applications at Yelp
 
Implementing Observability for Kubernetes.pdf
Implementing Observability for Kubernetes.pdfImplementing Observability for Kubernetes.pdf
Implementing Observability for Kubernetes.pdf
 
DevSecOps - Security in DevOps
DevSecOps - Security in DevOpsDevSecOps - Security in DevOps
DevSecOps - Security in DevOps
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
ZaloPay Merchant Platform on K8S on-premise
ZaloPay Merchant Platform on K8S on-premiseZaloPay Merchant Platform on K8S on-premise
ZaloPay Merchant Platform on K8S on-premise
 
Architecting The Future - WeRise Women in Technology
Architecting The Future - WeRise Women in TechnologyArchitecting The Future - WeRise Women in Technology
Architecting The Future - WeRise Women in Technology
 
Fineo Technical Overview - NextSQL for IoT
Fineo Technical Overview - NextSQL for IoTFineo Technical Overview - NextSQL for IoT
Fineo Technical Overview - NextSQL for IoT
 
ODN - Technical introduction of the platform
ODN - Technical introduction of the platformODN - Technical introduction of the platform
ODN - Technical introduction of the platform
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
 
Containers for grownups migrating traditional &amp; existing applications[1...
Containers for grownups   migrating traditional &amp; existing applications[1...Containers for grownups   migrating traditional &amp; existing applications[1...
Containers for grownups migrating traditional &amp; existing applications[1...
 
Adding Support for Networking and Web Technologies to an Embedded System
Adding Support for Networking and Web Technologies to an Embedded SystemAdding Support for Networking and Web Technologies to an Embedded System
Adding Support for Networking and Web Technologies to an Embedded System
 
Architecting the Future: Abstractions and Metadata - GlueCon
Architecting the Future: Abstractions and Metadata - GlueConArchitecting the Future: Abstractions and Metadata - GlueCon
Architecting the Future: Abstractions and Metadata - GlueCon
 
Kubernetes meetup bangalore december 2017 - v02
Kubernetes meetup bangalore   december 2017 - v02Kubernetes meetup bangalore   december 2017 - v02
Kubernetes meetup bangalore december 2017 - v02
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
Pentaho ppt up
Pentaho ppt upPentaho ppt up
Pentaho ppt up
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
Machine Learning , Analytics & Cyber Security the Next Level Threat Analytics...
 
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
 

Más de Neo4j

Más de Neo4j (20)

LARUS - Galileo.XAI e Gen-AI: la nuova prospettiva di LARUS per il futuro del...
LARUS - Galileo.XAI e Gen-AI: la nuova prospettiva di LARUS per il futuro del...LARUS - Galileo.XAI e Gen-AI: la nuova prospettiva di LARUS per il futuro del...
LARUS - Galileo.XAI e Gen-AI: la nuova prospettiva di LARUS per il futuro del...
 
GraphSummit Milan - Visione e roadmap del prodotto Neo4j
GraphSummit Milan - Visione e roadmap del prodotto Neo4jGraphSummit Milan - Visione e roadmap del prodotto Neo4j
GraphSummit Milan - Visione e roadmap del prodotto Neo4j
 
GraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with GraphGraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with Graph
 
LARUS - Galileo.XAI e Gen-AI: la nuova prospettiva di LARUS per il futuro del...
LARUS - Galileo.XAI e Gen-AI: la nuova prospettiva di LARUS per il futuro del...LARUS - Galileo.XAI e Gen-AI: la nuova prospettiva di LARUS per il futuro del...
LARUS - Galileo.XAI e Gen-AI: la nuova prospettiva di LARUS per il futuro del...
 
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaUNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
 
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
CERVED e Neo4j su una nuvola, migrazione ed evoluzione di un grafo mission cr...
 
From Knowledge Graphs via Lego Bricks to scientific conversations.pptx
From Knowledge Graphs via Lego Bricks to scientific conversations.pptxFrom Knowledge Graphs via Lego Bricks to scientific conversations.pptx
From Knowledge Graphs via Lego Bricks to scientific conversations.pptx
 
Novo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMsNovo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMs
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansQIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
 
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafosBBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
 
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
 

Último

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Último (20)

WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 

Neo4j Data Loading with Kettle

  • 1. Neo4j Data Loading with Kettle Matt Casters Chief Solutions Architect / Kettle Project Founder
  • 2. Agenda ➢ What is Kettle? ➢ The Neo4j plugins ➢ Data loading performance tips ➢ Streaming data integration ➢ Metadata driven data possibilities ➢ Kettle Execution lineage in a graph ➢ Roadmap update ➢ Q&A
  • 4. Kettle: Introduction ➢ Pentaho Data Integration from Hitachi Vantara ➢ One of the most widely used ETL tools ➢ Ready for the most demanding tasks ➢ Open source Apache Public License 2.0 ➢ Well maintained ➢ Large community, marketplace, ... ➢ Easy to embed, install, package, rebrand ➢ Download : Sourceforge / Pentaho / 8.2 / PDI-CE
  • 5. Kettle: where is it used? ➢ On tiny and enormous systems, real or virtual ➢ Very small computers, Raspberry Pie sized ➢ Your laptop or browser ➢ Locally or in the cloud ➢ On Hadoop clusters, VMs, Docker, Serverless, ➢ At large and small companies ➢ In government ➢ In education ➢ In the Neo4j Solutions Reference Architecture
  • 6. Kettle: Why is it used? ➢ Reduce costs! ➢ Answers the “build or buy?” question build buy Time Accum. Cost Kettle
  • 7. Kettle: Architecture ➢ Metadata driven, engine based : ○ No code generation ○ Define what you need to happen → GUI, Web, code, rules, … ○ Clear and transparent, self documenting ➢ Types of work: ○ Jobs for workflows ○ Transformations for parallel data streaming
  • 8. Kettle: Design ➢ 100% Exposure of our engine through UI elements ➢ Everyone should be able to play along: plugins! ➢ We built integration points for others: run everywhere! ➢ Allow the user to avoid programming anything ➢ Allow the user to program anything: JavaScript, Java, Groovy, RegEx, Rules, Python, Ruby, R, … ➢ Transparency wins: best in class logging, data lineage, execution lineage, debugging, data previewing, row sniff testing, …
  • 9. Kettle: things of note ➢ SpoonGit: UI integration with git ➢ WebSpoon: web interface to the full Spoon UI ➢ Data Sets: build transformation unit tests ➢ Huge list of other plugins available, including from Neo4j, on a marketplace, … ➢ Support for the latest technology stacks ➢ Project on github has over 1,000 forks https://github.com/pentaho/pentaho-kettle
  • 10. Kettle: The Toolset ➢ Spoon: GUI ➢ Scripts ➢ Server(s) ➢ Java API & SDK ➢ Standard file format ➢ Plugin ecosystem ➢ Docker image(s) ➢ Documentation, books, ...
  • 12. Neo4j Plugins: where to find? ➢ Started by the community, extended by Neo4j ➢ Releases/Download shortcut: ○ http://neo4j.kettle.be ➢ Project: ○ https://github.com/knowbi/knowbi-pentaho- pdi-neo4j-output Give us feedback!
  • 13. Neo4j Cypher ➢ For reading and writing ➢ Dynamic Cypher ➢ Batching and UNWIND ➢ Parameters ➢ Return values ➢ Helpers
  • 14. Neo4j Output ➢ Easy node creation ➢ Create/Merge of ()-[]-() ➢ Batching and UNWIND ➢ Dynamic labels
  • 15. Neo4j Graph Output ➢ Update (parts of) a graph ➢ Using a logical model ➢ Using field mapping ➢ Auto-generate Cypher
  • 16. Check Neo4j Connection ➢ Job Entry (workflow) ➢ Validate DBs are up ➢ Used in error diagnostic ➢ Defensive setup ➢ Pessimistic approach
  • 17. Neo4j Cypher Script ➢ Job Entry (workflow) ➢ Executes series of Cypher statements
  • 19. Plugins v4 ➢ Bulk loading steps ➢ Performance options ➢ Encrypted/obfuscated password in variables ➢ Bug fixes & UI improvements
  • 20. Neo4j Generate CSVs ➢ Generate CSV files for Neo4j Import ➢ Generates appropriate header ➢ Handles escaping, quoting, … ➢ Outputs file names
  • 21. Neo4j Split Graph ➢ Splits a graph field into nodes and relationships ➢ Used for unique value calculation
  • 22. Neo4j Importer ➢ Runs a neo4j-import command ➢ Accepts the filenames of CSV files
  • 24. Pre-processing in Kettle ➢ Do work in Kettle that can be avoided in Neo4j ➢ Calculate unique nodes ➢ Do required data conversions ➢ Data cleaning
  • 25. Parallel loading & batching ➢ Parallel node creation ➢ Limit high parallelism in the general case ➢ UNWIND in Neo4j Cypher step ➢ Create option in Neo4j Output step ➢ Use larger batch sizes (>1000) ➢ Create indexes up-front or with the options
  • 26. Importing data ➢ Bulk loading with import is much faster ➢ A few orders of magnitude faster ➢ Collect all the data in CSV files ➢ Use the new steps to load ➢ Seamless path to incremental loads
  • 28. Streaming options ➢ Micro-batching (every X minutes) ➢ Kafka, Event Hubs, Queues,... (never ending)
  • 29. Streaming options ➢ Transformations can be never ending ➢ Any operation is possible ➢ Can collect data in other data platforms ➢ Is transactionally safe if it is supported (Kafka, …) ➢ Can be parallelized & scaled out
  • 31. ➢ Kettle transformations & jobs are metadata ➢ ETL Metadata Injection: transformation templates ➢ Neo4j is a great metadata database ➢ Kettle can make use of this Metadata FTW
  • 32. Metadata driven loads ➢ Loading hundreds of types of files ➢ Processing data from hundreds of databases ➢ Automatic data standardization and normalisation → Massive time gains!
  • 33. Metadata driven extracts ➢ Without hardcoded sources, selections and targets ➢ Sourcing selections from users, processes, ... ➢ Using the possibilities of the Kettle engine → Flexibility, performance, without coding
  • 35. Kettle Logging Architecture ➢ Unique ID per execution ➢ Precise sourcing of logging records ➢ Very “graphy” data Execution Metadata Impact Parent / child relation Parent / child relation
  • 36. The Kettle Neo4j Logging plugin ➢ Stores operational metadata in a graph ➢ https://github.com/mattcasters/kettle-neo4j-logging ➢ Tools ○ View execution information: log, duration, errors ○ Find error paths ○ Jump to error location ○ Find execution path of a step ○ Get time window: “since last succesful execution”
  • 37. Execution lineage in a graph ➢ Documents the exection process ○ Log text, metadata, times, ...
  • 39. Roadmap Neo4j plugin ➢ 25 releases in 2018 ➢ Major 4.0 release next week ➢ Then: ○ New Neo4j Output step ○ More graph data type operations ○ <Insert YOUR suggestion!> ➢ Tuning options for Neo4j steps running in initial Kettle Apache Beam implementation: → DataFlow, Spark, Flink, …
  • 40. Roadmap Neo4j Logging plugin ➢ Generic impact information logging ➢ Store data lineage in Neo4j ➢ Git revision graph loading (new step) ➢ Storing and viewing unit testing results ➢ Operational “dashboard”