# 7 The Kafka ecosystem and its future ## Use cases and challenges - Primary use cases - Connecting disparate sources of data - Given the client API, it is possible to write data connectors and sinks for any data source - Data supply chain pipelines, replacing old ETL environments. - More generally, "Big data" integration (Hadoop, Spark) - Remaining challenges - Govern data evolution - Intrinsic data (in)consistency - Big (last 5 years in 2016) and fast (next 5 years in 2016) data ## Governance and evolution - Each producer defines its message contract via the (Key|Value)-serializer properties - The base serializers are not enough for most cases, hence custom serializers - These contracts have versions, all cohabitating - Consumers need to be aware of those in order to deserialize data - K limitation is the lack of a registry of message formats and their versions - Confluent: Kafka Schema Registry - Almost universal format: Apache Avro. Self-describing. - First class Avro (de)serializers - Schema registry and version management in the cluster - RESTful service discovery - Compatibility broker Alternatives: Protobuf, Apache Thrift, MessagePack ## Consistency and productivity - Lots of duplicated effort writing producers and consumers, all of them mostly the same - Lack of a common framework for integrating the sources and targets, although they are not _that_ numerous for each category: - producers: file systems, NoSQL, RDBMS, ... - consumers: Search Engines, HDFS, RDBMS, ... - Confluent after Kafka 0.10 => Kafka Connect and Connector Hub - Common framework for integration - Makes writing consumers and producers easier and more consistent - Platform connectors: - Oracle, HP, ... - 50+ and growing - Connector Hub: available to anyone providing such an integration - Should make K integration faster and cheaper ## Fast data - Needs: Real-time, Predictive Analytics, Machine Learning - Apache Platforms include: Apache Storm, Apache Spark, Apache Cassandra, Apache Hadoop, Apache Flink - Problem: each of these includes its own cluster management, multiplying the operational cost - When using K in the middle, it means lots of producers and consumers to keep up at scale - Confluent after Kafka 0.10 => Kafka Streams - Leverages K machinery instead of writing all these integrations - Single infrastructure solution - At least for streaming-based processing - Embeddable within existing applications - Java Library, just like KafkaConsumer and KafkaProducer ## Ecosystem - Scale-ups: LinkedIn, Netflix, Twitter, Uber - Publisher: Confluent