2. Agenda
• NiFi REST API
• NiFi In Depth
• NiFi developer Guide
• Custom Processor
• Contribution Sharing
3. NiFi REST API
• The Rest API provides programmatic access to command and control a
NiFi instance in real time.
• Start and stop processors, monitor queues, query provenance data, and
more.
7. NiFi REST API
• Every component in NiFi actually has a unique ID.
• Every operation to component is actually REST request to NiFi instance.
• Most of operation need to specify component ID
• https:// /nifi-api/process-groups
/015d1045-0b88-1db2-da38-cb71ac006792/process-groups
NiFi Instance URL
REST API Usage
REST Path
Unique Component ID
12. NiFi REST API
• RevisionDTO – indentify component version view to client
ProcessGroupDTO – Component body of ProcessGroup
PositionDTO – Position in canvas
• All DTO, Entity are provided.
<dependency>
<groupId>org.apache.nifi</groupId>
<artifactId>nifi-client-dto</artifactId>
<version>1.1.2</version>
</dependency>
13. REST API Recap
• Every component in NiFi actually has a unique ID.
• Every operation to component is actually REST request to NiFi instance.
• Most of operation need to specify component ID
14. NiFi in Depth
• Repositories
• Life of FlowFile
FlowFile Mechanism in Depth
17. NiFi in Depth
• FlowFile are the heart of NiFi and its flow-based design.
• A FlowFile is a data record, Consist of a pointer to its content, attributes
and associated with provenance events
• Attribute are key/value pairs act as metadata for the FlowFile
• Content is the actual data of the file
• Provenance is a record of what has happened to the FlowFile
18. NiFi in Depth
• Repository are immutable.
• The benefits of this are many, including: substantial reduction in storage
space required for the typical complex graphs of processing, natural
replay capability, takes advantage of OS caching, reduces random
read/write performance hits, and is easy to reason over.
• All three repositories actually directories on local storage to persist data.
19. NiFi in Depth
• The FlowFile repository contains metadata for all current FlowFiles in the
flow
• The Content Repository holds the content for current and past FlowFiles
• The Provenance Repository holds the history of FlowFiles
20. NiFi in Depth
• FlowFiles are held in Map in JVM memory
• FlowFile metadata include
- Attributes
- A pointer to the actual contet of FlowFile
- State (Which Connection/Queue belonged in)
• FlowFile Repository act as NiFi’s “Write-Ahead Log”
• Each change happens as a transactional unit of work
21. NiFi in Depth
• NiFi recover a FlowFile by restoring a snapshot of the FlowFile
• A snapshot is automatically taken periodically by the system
• Compute a new base checkpoint by serializing FlowFile map into disk
with filename ‘.partial’
• Step by Step WAL in NiFi
https://cwiki.apache.org/confluence/display/NIFI/NiFi%27s+Write-
Ahead+Log+Implementation
22. Content Repository
• Largest Repositories, utilize immutability and copy-on-write to maximize
speed and thread-safety
• Resource Claims are Java objects that point to specific files on disk
• The FlowFile has a “Content Claim” object
- a reference to Resource Claims
- offset of content within the file
- length of the content
23. Provenance Repository
• History of each FlowFile, provide Data Lineage (Chain of Custody)
• When a provenance event is created, it copies all the FlowFile’s
attributes and content pointer and stat to one location in the
Provenance Repo
• Provenance Repository design decisions
https://cwiki.apache.org/confluence/display/NIFI/Persistent+Provenance
+Repository+Design
25. Repositories Recap
• The FlowFile repository contains metadata for all current FlowFiles in the
flow
• The Content Repository holds the content for current and past FlowFiles
• The Provenance Repository holds the history of FlowFiles
• Best practice
- Analyze contents of FlowFile as few times as possible
- Extract key information into attributes
- Update FlowFile repository is much faster than content repository
26. Life of FlowFile
• Data Ingress → Pass by Reference → Copy-On-Write → Data Egress
• Important aspect of flow-based programming is the resource-
constrained relationships between the black boxes.
• Route from one processor to another simply by passing a reference to
FlowFile
31. Data Egress
• Eventually FlowFile will be “DROPPED”, no longer processing and is
available for deletion.
• Remains in the FlowFile repository until next repository checkpoint. (24
hours default) release all old content claims.
• Periodically, The Content Repo ask the Resource Claim Manager which
Resource Claims can be cleaned up.
36. Common Processor Patterns
• Data Ingress
• Data Egress
• Route Based on Content
• Route Based on Attribute
• Split Content
• Update Attributes Based on Content
• Enrich Modify Content
37. Error Handling
• ProcessException or other Exception means it is known failure
and roll back session
• Don’t catch general Exceptions, Throwable.
• Penalization vs Yielding
38. Session rollback
• ProcessSession provide transactionality
• Call commit() or rollback() to end session.
• Best practice is to keep simplicity
39. Testing
• NiFi provide mock framework for Processor testing.
Use TestRunner interface
• 1-AddControllerService if needed
runner.addControllerService()
• 2-Set Property Value
Map<String, String> attributes
attributes.put(‘property name’, ‘property value’);
• 3-Enqueue FlowFiles
runner.enqueuer(“Select ….”.getBytes(),attributes);
• 4-Run the processor
runner.run();
runner.assertAllFlowFilesTransferred(Success,1);
40. Recap Developer guide
• Understand life cycle of Processor
• Understand supporting component API
• Understand processor general pattern
• Understand how to handle process failure
• Understand how to test processor
41. Contribution preparation
• NiFi Contributor Guide
https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide
• Git Feature Branch Workflow
https://www.atlassian.com/git/tutorials/comparing-workflows
• How to Write a Git Commit Message
https://chris.beams.io/posts/git-commit/
42. Contribution feedback
• Don’t produce trailing whitespace
• GitHub Pull request procedure
• Commit title start with NIFI-2829
• Open Source Ci fail all the time, Don’t panic.
• Keep patient and humble for reviewers feedback.
43. Contribution feedback
• While dealing with Time Zone problem.
We should consider building in different time zone.
• In java 1.8, there is standard library provide great support to dealing
with Time issue in Java.
https://docs.oracle.com/javase/8/docs/api/java/time/package-
summary.html
https://magiclen.org/java-8-date-time-api/
44. Reference
• Official Apache NiFi
https://nifi.apache.org/
• All Micron nifi instance
http://nifi.micron.com/
• Hortonworks forum