|
@@ -0,0 +1,65 @@
|
|
|
+# 7 The Kafka ecosystem and its future
|
|
|
+
|
|
|
+## Use cases and challenges
|
|
|
+
|
|
|
+- Primary use cases
|
|
|
+ - Connecting disparate sources of data
|
|
|
+ - Given the client API, it is possible to write data connectors and sinks for any data source
|
|
|
+ - Data supply chain pipelines, replacing old ETL environments.
|
|
|
+ - More generally, "Big data" integration (Hadoop, Spark)
|
|
|
+- Remaining challenges
|
|
|
+ - Govern data evolution
|
|
|
+ - Intrinsic data (in)consistency
|
|
|
+ - Big (last 5 years in 2016) and fast (next 5 years in 2016) data
|
|
|
+
|
|
|
+## Governance and evolution
|
|
|
+
|
|
|
+- Each producer defines its message contract via the (Key|Value)-serializer properties
|
|
|
+- The base serializers are not enough for most cases, hence custom serializers
|
|
|
+- These contracts have versions, all cohabitating
|
|
|
+- Consumers need to be aware of those in order to deserialize data
|
|
|
+- K limitation is the lack of a registry of message formats and their versions
|
|
|
+- Confluent: Kafka Schema Registry
|
|
|
+ - Almost universal format: Apache Avro. Self-describing.
|
|
|
+ - First class Avro (de)serializers
|
|
|
+ - Schema registry and version management in the cluster
|
|
|
+ - RESTful service discovery
|
|
|
+ - Compatibility broker
|
|
|
+
|
|
|
+Alternatives: Protobuf, Apache Thrift, MessagePack
|
|
|
+
|
|
|
+## Consistency and productivity
|
|
|
+
|
|
|
+- Lots of duplicated effort writing producers and consumers, all of them mostly the same
|
|
|
+- Lack of a common framework for integrating the sources and targets, although
|
|
|
+ they are not _that_ numerous for each category:
|
|
|
+ - producers: file systems, NoSQL, RDBMS, ...
|
|
|
+ - consumers: Search Engines, HDFS, RDBMS, ...
|
|
|
+- Confluent after Kafka 0.10 => Kafka Connect and Connector Hub
|
|
|
+ - Common framework for integration
|
|
|
+ - Makes writing consumers and producers easier and more consistent
|
|
|
+ - Platform connectors:
|
|
|
+ - Oracle, HP, ...
|
|
|
+ - 50+ and growing
|
|
|
+ - Connector Hub: available to anyone providing such an integration
|
|
|
+ - Should make K integration faster and cheaper
|
|
|
+
|
|
|
+## Fast data
|
|
|
+
|
|
|
+- Needs: Real-time, Predictive Analytics, Machine Learning
|
|
|
+- Apache Platforms include: Apache Storm, Apache Spark, Apache Cassandra, Apache Hadoop, Apache Flink
|
|
|
+- Problem: each of these includes its own cluster management, multiplying the
|
|
|
+ operational cost
|
|
|
+- When using K in the middle, it means lots of producers and consumers to keep up
|
|
|
+ at scale
|
|
|
+- Confluent after Kafka 0.10 => Kafka Streams
|
|
|
+ - Leverages K machinery instead of writing all these integrations
|
|
|
+ - Single infrastructure solution
|
|
|
+ - At least for streaming-based processing
|
|
|
+ - Embeddable within existing applications
|
|
|
+ - Java Library, just like KafkaConsumer and KafkaProducer
|
|
|
+
|
|
|
+## Ecosystem
|
|
|
+
|
|
|
+- Scale-ups: LinkedIn, Netflix, Twitter, Uber
|
|
|
+- Publisher: Confluent
|