Browse Source

7: Ecosystem.

Frederic G. MARAND 3 years ago
parent
commit
04ce809da3
2 changed files with 66 additions and 1 deletions
  1. 65 0
      docs/07 Ecosystem.md
  2. 1 1
      pom.xml

+ 65 - 0
docs/07 Ecosystem.md

@@ -0,0 +1,65 @@
+# 7 The Kafka ecosystem and its future
+
+## Use cases and challenges
+
+- Primary use cases
+    - Connecting disparate sources of data
+    - Given the client API, it is possible to write data connectors and sinks for any data source
+    - Data supply chain pipelines, replacing old ETL environments.
+    - More generally, "Big data" integration (Hadoop, Spark)
+- Remaining challenges
+    - Govern data evolution
+    - Intrinsic data (in)consistency
+    - Big (last 5 years in 2016) and fast (next 5 years in 2016) data
+
+## Governance and evolution
+
+- Each producer defines its message contract via the (Key|Value)-serializer properties
+- The base serializers are not enough for most cases, hence custom serializers
+- These contracts have versions, all cohabitating
+- Consumers need to be aware of those in order to deserialize data
+- K limitation is the lack of a registry of message formats and their versions
+- Confluent: Kafka Schema Registry
+    - Almost universal format: Apache Avro. Self-describing.
+    - First class Avro (de)serializers
+    - Schema registry and version management in the cluster
+    - RESTful service discovery
+    - Compatibility broker
+
+Alternatives: Protobuf, Apache Thrift, MessagePack
+
+## Consistency and productivity
+
+- Lots of duplicated effort writing producers and consumers, all of them mostly the same
+- Lack of a common framework for integrating the sources and targets, although
+  they are not _that_ numerous for each category:
+    - producers: file systems, NoSQL, RDBMS,  ...
+    - consumers: Search Engines, HDFS, RDBMS, ...
+- Confluent after Kafka 0.10 => Kafka Connect and Connector Hub
+    - Common framework for integration
+    - Makes writing consumers and producers easier and more consistent
+    - Platform connectors:
+        - Oracle, HP, ...
+        - 50+ and growing
+    - Connector Hub: available to anyone providing such an integration
+    - Should make K integration faster and cheaper
+
+## Fast data
+
+- Needs: Real-time, Predictive Analytics, Machine Learning
+- Apache Platforms include: Apache Storm, Apache Spark, Apache Cassandra, Apache Hadoop, Apache Flink
+- Problem: each of these includes its own cluster management, multiplying the
+  operational cost
+- When using K in the middle, it means lots of producers and consumers to keep up
+  at scale
+- Confluent after Kafka 0.10 => Kafka Streams
+    - Leverages K machinery instead of writing all these integrations
+    - Single infrastructure solution
+        - At least for streaming-based processing
+    - Embeddable within existing applications
+    - Java Library, just like KafkaConsumer and KafkaProducer
+
+## Ecosystem
+
+- Scale-ups: LinkedIn, Netflix, Twitter, Uber
+- Publisher: Confluent

+ 1 - 1
pom.xml

@@ -6,7 +6,7 @@
 
     <groupId>fr.osinet.ps.kafka</groupId>
     <artifactId>samples</artifactId>
-    <version>1.0-SNAPSHOT</version>
+    <version>0.1-SNAPSHOT</version>
 
     <properties>
         <maven.compiler.source>8</maven.compiler.source>