7 The Kafka ecosystem and its future

Use cases and challenges

Primary use cases
- Connecting disparate sources of data
- Given the client API, it is possible to write data connectors and sinks for any data source
- Data supply chain pipelines, replacing old ETL environments.
- More generally, "Big data" integration (Hadoop, Spark)
Remaining challenges
- Govern data evolution
- Intrinsic data (in)consistency
- Big (last 5 years in 2016) and fast (next 5 years in 2016) data

Each producer defines its message contract via the (Key|Value)-serializer properties
The base serializers are not enough for most cases, hence custom serializers
These contracts have versions, all cohabitating
Consumers need to be aware of those in order to deserialize data
K limitation is the lack of a registry of message formats and their versions
Confluent: Kafka Schema Registry
- Almost universal format: Apache Avro. Self-describing.
- First class Avro (de)serializers
- Schema registry and version management in the cluster
- RESTful service discovery
- Compatibility broker

Alternatives: Protobuf, Apache Thrift, MessagePack

Lots of duplicated effort writing producers and consumers, all of them mostly the same
Lack of a common framework for integrating the sources and targets, although they are not that numerous for each category:
- producers: file systems, NoSQL, RDBMS, ...
- consumers: Search Engines, HDFS, RDBMS, ...
Confluent after Kafka 0.10 => Kafka Connect and Connector Hub
- Common framework for integration
- Makes writing consumers and producers easier and more consistent
- Platform connectors:
  - Oracle, HP, ...
  - 50+ and growing
- Connector Hub: available to anyone providing such an integration
- Should make K integration faster and cheaper

Needs: Real-time, Predictive Analytics, Machine Learning
Apache Platforms include: Apache Storm, Apache Spark, Apache Cassandra, Apache Hadoop, Apache Flink
Problem: each of these includes its own cluster management, multiplying the operational cost
When using K in the middle, it means lots of producers and consumers to keep up at scale
Confluent after Kafka 0.10 => Kafka Streams
- Leverages K machinery instead of writing all these integrations
- Single infrastructure solution
  - At least for streaming-based processing
- Embeddable within existing applications
- Java Library, just like KafkaConsumer and KafkaProducer