Nitendra Gautam

Inroduction to Apache NiFi

Background

When I was working with Kafka I did a lot of research on event-driven messaging and event based architecture. While doing this research I stumbled upon Apache NiFi ,which helps to create complex data flows for a distributed or Internet of Things(IOT) based application .I decided to do this write up which introduces NiFi which will be a key player in the IOT based applicatios in the future .

Introduction

Apache NiFi is an open source tool for automating and managing the flow of data between systems (Databases, Sensors, Hadoop, Data Platform and other sources). It solves the problem of real-time collecting and transporting data from multitude data sources and also provides interactive user interface and control of live flows with full and automated data provenance.

It is data source agnostic, supporting disparate and distributed sources of differing formats, schemas, protocols, speeds and sizes such as machines, geo location devices, click streams, files, social feeds, log files and videos and more. It is configurable plumbing for moving data around, similar to how Fedex, UPS or other courier/ delivery services move parcels around. And just like those services, Apache NiFi allows you to trace your data in real-time, just like you could trace a delivery.

This project is written using flow based programming using Java and provides a web-based user interface to manage data flows in real-time. NiFi provides the data acquisition, simple event processing, transport and delivery mechanism designed to accommodate the diverse data-flows generated by a world of connected people, systems, and things.

This project was a classified project United States National Security Agency (NSA) for 8 years and was named as Niagrafiles. NSA made this application open-source through Apache Source Foundation in 2014 via its technology transfer program.

NiFi is helpful in creating DataFlow. It means you can transfer data from one system to another system as well as process the data in between.

Use Cases of NiFi

NiFi is used for data ingestion to pull data into NiFi, from numerous different data sources and create FlowFiles. It can process extremely large data ,extremely large data sets ,extremely small data with high rates and variable sized data. It can be used for various use cases some of which are given below.

NiFi vs Kafka

Both Apache NiFi and Apache Kafka provide a broker to connect producers and consumers but they do so in a way that is quite different from one another and complementary when looking holistically at what it takes to connect the enterprise.

With Kafka the logic of the dataflow lives in the systems that produce data and systems that consume data. NiFi decouples the producer and consumer further and allows as much of the dataflow logic as possible or desired to live in a broker itself.This is why NiFi has interactive command and control to effect immediate change and why NiFi offers the processor API to operate on, alter, and route the data streams as they flow. It is also why NiFi provides powerful back-pressure and congestion control features. The model NiFi offers means you do have a point of central control with distributed execution, where you can address cross cutting concerns, where you can tackle things like compliance checks and tracking which you would not want on the producer/consumers.

There are of course many other aspects to discuss but sticking to the ideas raised in the thread so far here is a response for a few of them.

Push vs Pull Data Ingestion Pattern

In terms of this data ingestion Pattern ,Kafka producers push data to Kafka broker and Kafka consumer pull data from Kafka broker.Though it is a clean and scalable model ,it requires that system to accept and follow that protocol. In Contrast NiFi does not have that specific protocol . It supports both push/pull data ingestion pattern to get data in and out of NiFi

High Availability

On the data plane NiFi does not offer distributed data durability today as Kafka does. As Lars pointed out the NiFi community is adding distributed durability but the value of it for NiFi’s use cases will be less vital than it is for Kafka as NiFi isn’t holding the data for the arbitrary consumer pattern that Kafka supports. If a NiFi node goes down the data is delayed while it is down. Avoiding data loss though is easily solved thanks to tried and true RAID or distributed block storage. NiFi’s control plane does already provide high availability as the cluster manager and even multiple nodes in a cluster can be lost while the live flow can continue operating normally.

Performance

Kafka offers an impressive balance of both high throughput and low latency. But comparing performance of Kafka and NiFi is not very meaningful given that they do very different things. It would be best to discuss performance tradeoffs in the context of a particular use case.

Google Groups

Nifi Flowfile

A FlowFile is a message or event data or user data, which is pushed or created in the NiFi. A FlowFile has mainly two things attached with it. Its content (Actual payload: Stream of bytes) and attributes. Attributes are key value pairs attached to the content (You can say metadata for the content).

Relationship In Nifi Dataflow

When a processor finishes with processing of FlowFile. It can result in Failure or Success or any other relationship. And based on this relationship you can send data to the Downstream or next processor or mediated accordingly.

Reporting Task

A Reporting Task is a NiFi extension point that is capable of reporting and analyzing NiFi’s internal metrics in order to provide the information to external resources or report status information as bulletins that appear directly in the NiFi User Interface.

Nifi Processor

Processor is a main component in the NiFi, which will really work on the FlowFile content and helps in creating, sending, receiving, transforming routing, splitting, merging, and processing FlowFile.

Programming Languag Supported by Apache NiFi

NiFi is implemented in the Java programming language and allows extensions (processors, controller services, and reporting tasks) to be implemented in Java. In addition, NiFi supports processors that execute scripts written in Groovy, Jython, and several other popular scripting languages.

Content Repository in Apache NiFi

In the NiFi Flow File, we do not store the contents. They are stored in the content repository and referenced by the FlowFile. This allows the contents of FlowFiles to be stored independently and efficiently based on the underlying storage mechanism.

Backpressure In NiFi System

Sometime what happens that Producer system is faster than consumer system. Hence, the messages which are consumed is slower. Hence, all the messages (FlowFiles) which are not being processed will remain in the connection buffer. However, you can limit the connection backpressure size either based on number of FlowFiles or number of data size. If it reaches to defined limit, connection will give back pressure to producer processor not run. Hence, no more FlowFiles generated, until backpressure is reduced.

Template In NiFi

Template is a re-usable workflow. Which you can import and export in the same or different NiFi instances. It can save lot of time rather than creating Flow again and again each time. Template is created as an xml file.

Use of Bulleting in NiFi

If you want to know if any problems occur in a dataflow. You can check in the logs for anything interesting, it is much more convenient to have notifications pop up on the screen. If a Processor logs anything as a WARNING or ERROR, we will see a “Bulletin Indicator” show up in the top-right-hand corner of the Processor.

This indicator looks like a sticky note and will be shown for five minutes after the event occurs. Hovering over the bulletin provides information about what happened so that the user does not have to sift through log messages to find it. If in a cluster, the bulletin will also indicate which node in the cluster emitted the bulletin. We can also change the log level at which bulletins will occur in the Settings tab of the Configure dialog for a Processor.

Do The Attributes Get Added To Content (actual Data) When Data Is Pulled By Nifi ?

You can certainly add attributes to your FlowFiles at anytime, that’s the whole point of separating metadata from the actual data. Essentially, one FlowFile represents an object or a message moving through NiFi. Each FlowFile contains a piece of content, which is the actual bytes. You can then extract attributes from the content, and store them in memory. You can then operate against those attributes in memory, without touching your content. By doing so you can save a lot of IO overhead, making the whole flow management process extremely efficient.

Creating a Template when Password is Stored in the DataFlow

Password is a sensitive property. Hence, while exporting the DataFlow as a template password will be dropped. As soon as you import the template in the same or different NiFi system.

Nifi Support for Huge Volume Of Payload In A Dataflow

Huge volume of data can transit from DataFlow. As data moves through NiFi, a pointer to the data is being passed around, referred to as a FlowFile. The content of the FlowFile is only accessed as needed.

Nifi Custom Properties Registry

You can use to load custom key, value pair you can use custom properties registry, which can be configure as (in nifi.properties file)

nifi.variable.registry.properties=/conf/nifi_registry 

And you can put key value pairs in that file and you can use that properties in you NiFi processor using expression language e.g. ${OS} , if you have configured that property in registry file.

Architecture in Apache NiFi

FromNiFi 1.0 there is 0-master philosophy is considered. And each node in the NiFi cluster is the same. NiFi cluster is managed by the Zookeeper. Apache ZooKeeper elects a single node as the Cluster Coordinator, and failover is handled automatically by ZooKeeper. All cluster nodes report heartbeat and status information to the Cluster Coordinator. The Cluster Coordinator is responsible for disconnecting and connecting nodes. Additionally, every cluster has one Primary Node, also elected by ZooKeeper.

References:

Apache NiFi

Analyze Traffic Patterns with Apache NiFi - Hortonworks

Google Groups

Real World Use Cases of Real-Time DataFlows in Record Time - Hortonworks