07 Ecosystem.md 2.6 KB

7 The Kafka ecosystem and its future

Use cases and challenges

  • Primary use cases
    • Connecting disparate sources of data
    • Given the client API, it is possible to write data connectors and sinks for any data source
    • Data supply chain pipelines, replacing old ETL environments.
    • More generally, "Big data" integration (Hadoop, Spark)
  • Remaining challenges
    • Govern data evolution
    • Intrinsic data (in)consistency
    • Big (last 5 years in 2016) and fast (next 5 years in 2016) data

Governance and evolution

  • Each producer defines its message contract via the (Key|Value)-serializer properties
  • The base serializers are not enough for most cases, hence custom serializers
  • These contracts have versions, all cohabitating
  • Consumers need to be aware of those in order to deserialize data
  • K limitation is the lack of a registry of message formats and their versions
  • Confluent: Kafka Schema Registry
    • Almost universal format: Apache Avro. Self-describing.
    • First class Avro (de)serializers
    • Schema registry and version management in the cluster
    • RESTful service discovery
    • Compatibility broker

Alternatives: Protobuf, Apache Thrift, MessagePack

Consistency and productivity

  • Lots of duplicated effort writing producers and consumers, all of them mostly the same
  • Lack of a common framework for integrating the sources and targets, although they are not that numerous for each category:
    • producers: file systems, NoSQL, RDBMS, ...
    • consumers: Search Engines, HDFS, RDBMS, ...
  • Confluent after Kafka 0.10 => Kafka Connect and Connector Hub
    • Common framework for integration
    • Makes writing consumers and producers easier and more consistent
    • Platform connectors:
      • Oracle, HP, ...
      • 50+ and growing
    • Connector Hub: available to anyone providing such an integration
    • Should make K integration faster and cheaper

Fast data

  • Needs: Real-time, Predictive Analytics, Machine Learning
  • Apache Platforms include: Apache Storm, Apache Spark, Apache Cassandra, Apache Hadoop, Apache Flink
  • Problem: each of these includes its own cluster management, multiplying the operational cost
  • When using K in the middle, it means lots of producers and consumers to keep up at scale
  • Confluent after Kafka 0.10 => Kafka Streams
    • Leverages K machinery instead of writing all these integrations
    • Single infrastructure solution
      • At least for streaming-based processing
    • Embeddable within existing applications
    • Java Library, just like KafkaConsumer and KafkaProducer

Ecosystem

  • Scale-ups: LinkedIn, Netflix, Twitter, Uber
  • Publisher: Confluent