7 The Kafka ecosystem and its future
Use cases and challenges
- Primary use cases
- Connecting disparate sources of data
- Given the client API, it is possible to write data connectors and sinks for any data source
- Data supply chain pipelines, replacing old ETL environments.
- More generally, "Big data" integration (Hadoop, Spark)
- Remaining challenges
- Govern data evolution
- Intrinsic data (in)consistency
- Big (last 5 years in 2016) and fast (next 5 years in 2016) data
Governance and evolution
- Each producer defines its message contract via the (Key|Value)-serializer properties
- The base serializers are not enough for most cases, hence custom serializers
- These contracts have versions, all cohabitating
- Consumers need to be aware of those in order to deserialize data
- K limitation is the lack of a registry of message formats and their versions
- Confluent: Kafka Schema Registry
- Almost universal format: Apache Avro. Self-describing.
- First class Avro (de)serializers
- Schema registry and version management in the cluster
- RESTful service discovery
- Compatibility broker
Alternatives: Protobuf, Apache Thrift, MessagePack
Consistency and productivity
- Lots of duplicated effort writing producers and consumers, all of them mostly the same
- Lack of a common framework for integrating the sources and targets, although
they are not that numerous for each category:
- producers: file systems, NoSQL, RDBMS, ...
- consumers: Search Engines, HDFS, RDBMS, ...
- Confluent after Kafka 0.10 => Kafka Connect and Connector Hub
- Common framework for integration
- Makes writing consumers and producers easier and more consistent
- Platform connectors:
- Oracle, HP, ...
- 50+ and growing
- Connector Hub: available to anyone providing such an integration
- Should make K integration faster and cheaper
Fast data
- Needs: Real-time, Predictive Analytics, Machine Learning
- Apache Platforms include: Apache Storm, Apache Spark, Apache Cassandra, Apache Hadoop, Apache Flink
- Problem: each of these includes its own cluster management, multiplying the
operational cost
- When using K in the middle, it means lots of producers and consumers to keep up
at scale
- Confluent after Kafka 0.10 => Kafka Streams
- Leverages K machinery instead of writing all these integrations
- Single infrastructure solution
- At least for streaming-based processing
- Embeddable within existing applications
- Java Library, just like KafkaConsumer and KafkaProducer
Ecosystem
- Scale-ups: LinkedIn, Netflix, Twitter, Uber
- Publisher: Confluent