Apache Flink is an open source platform for distributed stream and batch data processing.
Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Flink includes several APIs for creating applications that use the Flink engine:
- DataSet API for static data embedded in Java, Scala, and Python,
- DataStream API for unbounded streams embedded in Java and Scala, and
- Table API with a SQL-like expression language embedded in Java and Scala.
Flink also bundles libraries for domain-specific use cases:
- Machine Learning library, and
- Gelly, a graph processing API and library.
You can integrate Flink easily with other well-known open source systems both for data input and output as well as deployment.
Flink's data streaming runtime achieves high throughput rates and low latency with little configuration. The charts below show the performance of a distributed item counting task, requiring streaming data shuffles.
Flink programs can be written in Java or Scala and are automatically compiled and optimized into dataflow programs that are executed in a cluster or cloud environment. Flink does not provide its own data storage system, input data must be stored in a distributed storage system like HDFS or HBase. For data stream processing, Flink consumes data from (reliable) message queues like Kafka.