首页  思维导图  详情



 



kafka

2020-06-23 18:47:54   82  举报





AI智能生成

kafka简介

kafka

模板推荐

作者其他创作

大纲/内容

设计特点

高吞吐量，每秒百万级的消息读写

持久化存储，并有replication副本防止丢失

分布式，producer，broker，consumer都为分布式实现

磁盘系统

kafka的高吞吐量主要得益于磁盘系统的使用设计，broker基于内存映射文件 技术，将消息先写入系统的页缓存中，页缓存直接映射到磁盘文件

架构

图

Kafka强依赖于ZooKeeper

brocker

broker就是kafka server，每一台kafka服务器都是一个brocker

选举

通过在zk上创建临时节点/controller临时节点来实现leader选举，并在该节点上写入当前brocker信息， 利用zk的强一致性特性，一个节点只能被一个客户端创建成功，创建成功的broker即为leader，即先到先得原则， 其他节点会监听watch /controller节点

producer

主要负责序列化消息并发送

consumer

consumerGroup

消费者组，含有一个组名，partition在一个组内只能有一个consumer消费， 可以用来实现消息广播，多个CG同时订阅一个topic

子主题

topic

主题，是写入kafka的基础单元，是一个逻辑概念,实际写入的是partition，一个topic包含多个partition

每条消息属于且仅属于一个Topic

发送和订阅消息都必须指定topic

partition

topic的组成单元，可以水平扩容，是kafka高吞吐量的保障

消息持久化时，每条消息都是根据一定的分区规则路由到对应的partition中，并append到log文件的尾部

在同一个partition中消息是顺序写入且有序的，但不同partiton之间不能保证消息的有序性

个数最好与服务器个数相当

多个大小相等的segment file (段)组成了一个partition

副本 replication

partition可以有指定数据的副本，主从模式，producer和consumer只与leader交互 follower从leader复制

ISR

in-sync replica，已同步的副本

Kafka会在Zookeeper上针对每个Topic维护一个ISR，持有partition的已同步的副本信息， 如果某个分区的Leader不可用，Kafka就会从ISR集合中选择一个副本作为新的Leader。

setment file

每个partition 就相当于一个巨型的文件里面由多个大小相等的segment file小文件组成， 但是每个segment file 的消息数量并不一定相等，

组成

.index 索引文件

包含若干索引条目，每个条目表示数据文件中一条message的索引

.log 数据文件

offset

位移

partition中的每个消息都有一个连续的序号，用于partition唯一标识一条消息。 Offset记录着下一条将要发送给Consumer的消息的序号。 Offset从语义上来看拥有两种：Current Offset和Committed Offset。

current offset

Current Offset保存在Consumer中，它表示Consumer希望收到的下一条消息的序号。 它仅仅在poll()方法中使用。例如，Consumer第一次调用poll()方法后收到了20条消息，那么Current Offset就被设置为20。 这样Consumer下一次调用poll()方法时，Kafka就知道应该从序号为21的消息开始读取。这样就能够保证每次Consumer poll消息时， 都能够收到不重复的消息。

Committed Offset

已提交位移，保存在Broker上，表示Consumer已经确认消费过的消息的序号，举个例子，Consumer通过poll() 方法收到20条消息后，此时Current Offset就是20，经过一系列的逻辑处理后，并没有调用consumer.commitAsync()或consumer.commitSync()来提交Committed Offset，那么此时Committed Offset依旧是0。

总结

Current Offset是针对Consumer的poll过程的，它可以保证每次poll都返回不重复的消息； 而Committed Offset能够保证新的Consumer能够从正确的位置开始消费一个partition，从而避免重复消费。

存储模型

groupid-topic-partition -> offset的方式保存。实际上保存在__consumers_offsets这个topic中。

参数配置

ACKS

影响消息持久化

auto.offset.reset

表示如果Kafka中没有存储对应的offset信息的话（有可能offset信息被删除），消费者从何处开始消费消息。它拥有三个可选值：

earliest：从最早的offset开始消费 latest：从最后的offset开始消费 none：直接抛出exception给consumer

场景

Consumer消费了5条消息后宕机了，重启之后它读取到对应的partition的Committed Offset为5， 因此会直接从第6条消息开始读取。此时完全依赖于Committed Offset机制，和auto.offset.reset配置完全无关。

新建了一个新的Group，并添加了一个Consumer，它订阅了一个已经存在的Topic。 此时Kafka中还没有这个Consumer相应的Offset信息，因此此时Kafka就会根据auto.offset.reset 配置来决定这个Consumer从何处开始消费消息。

producer配置

<table border="0" cellpadding="0" cellspacing="0" width="1165" style="border-collapse: collapse;width:671pt;mso-yfti-tbllook:1536"><colgroup><col width="121" style="mso-width-source:userset;width:70pt"></colgroup><colgroup><col width="802" style="mso-width-source:userset;width:462pt"></colgroup><colgroup><col width="99" style="mso-width-source:userset;width:57pt"></colgroup><colgroup><col width="81" style="mso-width-source:userset;width:47pt"></colgroup><colgroup><col width="61" style="mso-width-source:userset;width:35pt"></colgroup><tbody><tr height="49" style="mso-height-source:userset;height:28.3pt"><td height="49" class="oa2" width="121" style="height:28.3pt;width:70pt"> bootstrap.servers </td> <td class="oa2" width="802" style="width:462pt"> 主机，配置格式： host1:port1,host2:port2,.... 由于这些主机是用于初始化连接，以获得整个集群（集群是会动态变化的），因此这个配置清单不需要包含整个集群的服务器。（当然，为了避免单节点风险，这个清单最好配置多台主机）。 </td> <td class="oa2" width="99" style="width:57pt"> </td> <td class="oa2" width="81" style="width:47pt"> </td> <td class="oa2" width="61" style="width:35pt"> high </td> </tr> <tr height="32" style="mso-height-source:userset;height:18.71pt"> <td height="32" class="oa2" width="121" style="height:18.71pt;width:70pt"> key.serializer </td> <td class="oa2" width="802" style="width:462pt"> 关键字的序列化类，实现以下接口： org.apache.kafka.common.serialization.Serializer 接口。 </td> <td class="oa2" width="99" style="width:57pt"> </td> <td class="oa2" width="81" style="width:47pt"> </td> <td class="oa2" width="61" style="width:35pt"> high </td> </tr> <tr height="33" style="mso-height-source:userset;height:19.23pt"> <td height="33" class="oa2" width="121" style="height:19.23pt;width:70pt"> value.serializer </td> <td class="oa2" width="802" style="width:462pt"> 值的序列化类，实现以下接口： org.apache.kafka.common.serialization.Serializer 接口。 </td> <td class="oa2" width="99" style="width:57pt"> </td> <td class="oa2" width="81" style="width:47pt"> </td> <td class="oa2" width="61" style="width:35pt"> high </td> </tr> <tr height="49" style="mso-height-source:userset;height:28.47pt"> <td height="49" class="oa2" width="121" style="height:28.47pt;width:70pt"> acks </td> <td class="oa3" width="802" style="width:462pt"> 消息持久化级别配置，见详细介绍 </td> <td class="oa3" width="99" style="width:57pt"> 1 </td> <td class="oa3" width="81" style="width:47pt"> [all, -1, 0, 1] </td> <td class="oa3" width="61" style="width:35pt"> high </td> </tr> <tr height="56" style="mso-height-source:userset;height:32.4pt"> <td height="56" class="oa3" width="121" style="height:32.4pt;width:70pt"> buffer.memory </td> <td class="oa3" width="802" style="width:462pt"> Producer 用来缓冲等待被发送到服务器的记录的缓存。如果记录发送的速度比发送到服务器的速度快， Producer 就会阻塞，如果阻塞的时间超过 max.block.ms 配置的时长，则会抛出一个异常。 </td> <td class="oa3" width="99" style="width:57pt"> 32MB </td> <td class="oa3" width="81" style="width:47pt"> [0,...] </td> <td class="oa3" width="61" style="width:35pt"> high </td> </tr> <tr height="101" style="mso-height-source:userset;height:57.98pt"> <td height="101" class="oa2" width="121" style="height:57.98pt;width:70pt"> retries </td> <td class="oa3" width="802" style="width:462pt"> 若设置大于0的值，则客户端会将发送失败的记录重新发送，尽管这些记录有可能是暂时性的错误。请注意，这种 retry 与客户端收到错误信息之后重新发送记录并无区别。允许 retries 并且没有设置max.in.flight.requests.per.connection 为1时，消息可能乱序。比如：当两个批次都被发送到同一个 partition ，第一个批次发生错误并发生 retries 而第二个批次已经成功，则第二个批次的记录就会先于第一个批次出现。 </td> <td class="oa3" width="99" style="width:57pt"> 0 </td> <td class="oa3" width="81" style="width:47pt"> [0,...,2147483647] </td> <td class="oa3" width="61" style="width:35pt"> high </td> </tr> <tr height="72" style="mso-height-source:userset;height:41.47pt"> <td height="72" class="oa2" width="121" style="height:41.47pt;width:70pt"> batch.size </td> <td class="oa3" width="802" style="width:462pt"> producer会把发往同一分区的多条消息封装进一个batch中，当batch满了后，producer才会把消息发送出去 小的 batch.size 降低吞吐量(如果 batch.size = 0的话将完全禁用批处理)。 很大的 batch.size 可能造成内存浪费 </td> <td class="oa3" width="99" style="width:57pt"> 16KB </td> <td class="oa3" width="81" style="width:47pt"> [0,...] </td> <td class="oa3" width="61" style="width:35pt"> medium </td> </tr> <tr height="131" style="mso-height-source:userset;height:75.63pt"> <td height="131" class="oa2" width="121" style="height:75.63pt;width:70pt"> linger.ms </td> <td class="oa3" width="802" style="width:462pt"> producer 会将两个请求发送时间间隔内到达的记录合并到一个单独的批处理请求中。通常只有当记录到达的速度超过了发送的速度时才会出现这种情况。producer 将等待给定的延迟时间，以便将在等待过程中到达的其他记录能合并到本批次的处理中。但batch.size达到设置值时 ，Producer 会忽略这个参数，立刻发送数据。但是如果累积的字节数少于 batch.size ，那么将在指定的时间内“逗留”(linger)，以等待更多的记录出现。这个设置默认为0(即没有延迟)。例如：如果设置linger.ms=5 ，则发送的请求会减少并降低部分负载，但同时会增加5毫秒的延迟。 </td> <td class="oa3" width="99" style="width:57pt"> 0 </td> <td class="oa3" width="81" style="width:47pt"> [0,...] </td> <td class="oa3" width="61" style="width:35pt"> medium </td> </tr> <tr height="58" style="mso-height-source:userset;height:33.3pt"> <td height="58" class="oa2" width="121" style="height:33.3pt;width:70pt"> max.request.size </td> <td class="oa3" width="802" style="width:462pt"> 请求的最大字节数。这个设置将限制 Producer 在单个请求中发送的记录批量的数量，以避免发送巨大的请求。这实际上也等同于批次的最大记录数的限制。请注意，服务器对批次的大小有自己的限制，这可能与此不同。 </td> <td class="oa3" width="99" style="width:57pt"> 1MB </td> <td class="oa3" width="81" style="width:47pt"> [0,...] </td> <td class="oa3" width="61" style="width:35pt"> medium </td> </tr></tbody></table>

consumer配置

<table border="0" cellpadding="0" cellspacing="0" width="1206" style="border-collapse: collapse;width:695pt;mso-yfti-tbllook:1536"><colgroup><col width="190" style="mso-width-source:userset;width:110pt"></colgroup><colgroup><col width="791" style="mso-width-source:userset;width:456pt"></colgroup><colgroup><col width="73" style="mso-width-source:userset;width:42pt"></colgroup><colgroup><col width="77" style="mso-width-source:userset;width:44pt"></colgroup><colgroup><col width="76" style="mso-width-source:userset;width:44pt"></colgroup><tbody><tr height="56" style="mso-height-source:userset;height:32.05pt"><td height="56" class="oa2" width="190" style="height:32.05pt;width:110pt"> bootstrap.servers </td> <td class="oa2" width="791" style="width:456pt"> 主机，配置格式： host1:port1,host2:port2,.... 由于这些主机是用于初始化连接，以获得整个集群（集群是会动态变化的），因此这个配置清单不需要包含整个集群的服务器。（当然，为了避免单节点风险，这个清单最好配置多台主机）。 </td> <td class="oa2" width="73" style="width:42pt"> </td> <td class="oa2" width="77" style="width:44pt"> </td> <td class="oa2" width="76" style="width:44pt"> high </td> </tr> <tr height="20" style="mso-height-source:userset;height:11.64pt"> <td height="20" class="oa2" width="190" style="height:11.64pt;width:110pt"> key.serializer </td> <td class="oa2" width="791" style="width:456pt"> 关键字的序列化类，实现以下接口： org.apache.kafka.common.serialization.Serializer 接口。 </td> <td class="oa2" width="73" style="width:42pt"> </td> <td class="oa2" width="77" style="width:44pt"> </td> <td class="oa2" width="76" style="width:44pt"> high </td> </tr> <tr height="20" style="mso-height-source:userset;height:11.64pt"> <td height="20" class="oa2" width="190" style="height:11.64pt;width:110pt"> value.serializer </td> <td class="oa2" width="791" style="width:456pt"> 值的序列化类，实现以下接口： org.apache.kafka.common.serialization.Serializer 接口。 </td> <td class="oa2" width="73" style="width:42pt"> </td> <td class="oa2" width="77" style="width:44pt"> </td> <td class="oa2" width="76" style="width:44pt"> high </td> </tr> <tr height="83" style="mso-height-source:userset;height:47.63pt"> <td height="83" class="oa2" width="190" style="height:47.63pt;width:110pt"> fetch.min.bytes </td> <td class="oa3" width="791" style="width:456pt"> 获取请求返回的最小数据量。如果没有足够的数据可用，请求将等待这么多数据累积后才响应请求。默认设置为1个字节意味着一旦有一个字节的数据可用，或者fetch请求等待数据到达的时间过长，fetch请求就会得到响应。将此值设置为大于1将导致服务器等待更大数量的数据累积，这可以稍微提高服务器的吞吐量，但代价是增加一些延迟。 </td> <td class="oa3" width="73" style="width:42pt"> 1 </td> <td class="oa3" width="77" style="width:44pt"> [0,...] </td> <td class="oa3" width="76" style="width:44pt"> high </td> </tr> <tr height="47" style="mso-height-source:userset;height:27.22pt"> <td height="47" class="oa3" width="190" style="height:27.22pt;width:110pt"> group.id </td> <td class="oa3" width="791" style="width:456pt"> 标识所属的消费者组的ID。如果consumer通过使用subscribe(topic)或基于kafka的偏移管理策略来使用组管理功能，则需要此属性。 </td> <td class="oa3" width="73" style="width:42pt"> "" </td> <td class="oa3" width="77" style="width:44pt"> </td> <td class="oa3" width="76" style="width:44pt"> high </td> </tr> <tr height="65" style="mso-height-source:userset;height:37.43pt"> <td height="65" class="oa2" width="190" style="height:37.43pt;width:110pt"> heartbeat.interval.ms </td> <td class="oa3" width="791" style="width:456pt"> 心跳用于确保消费者会话保持活动状态，并在新消费者加入或离开组时重新平衡。该值必须设置为低于session.timeout。，但一般应设置不高于该值的1/3。它可以调整甚至更低，以控制正常再平衡的预期时间。 </td> <td class="oa3" width="73" style="width:42pt"> 3000 </td> <td class="oa3" width="77" style="width:44pt"> </td> <td class="oa3" width="76" style="width:44pt"> high </td> </tr> <tr height="38" style="mso-height-source:userset;height:21.85pt"> <td height="38" class="oa2" width="190" style="height:21.85pt;width:110pt"> max.partition.fetch.bytes </td> <td class="oa3" width="791" style="width:456pt"> 限制Consumer每次发起fetch请求时，读取到的数据大小 </td> <td class="oa3" width="73" style="width:42pt"> 1MB </td> <td class="oa3" width="77" style="width:44pt"> [0,...] </td> <td class="oa3" width="76" style="width:44pt"> high </td> </tr> <tr height="47" style="mso-height-source:userset;height:27.22pt"> <td height="47" class="oa2" width="190" style="height:27.22pt;width:110pt"> session.timeout.ms </td> <td class="oa3" width="791" style="width:456pt"> Consumer session 过期时间。这个值必须设置在broker configuration中的group.min.session.timeout.ms 与 group.max.session.timeout.ms之间。 </td> <td class="oa3" width="73" style="width:42pt"> 10s </td> <td class="oa3" width="77" style="width:44pt"> </td> <td class="oa3" width="76" style="width:44pt"> high </td> </tr> <tr height="136" style="mso-height-source:userset;height:78.26pt"> <td height="136" class="oa2" width="190" style="height:78.26pt;width:110pt"> </td><td class="oa3" width="791" style="width:456pt"></td><td class="oa3" width="73" style="width:42pt"></td><td class="oa3" width="77" style="width:44pt"></td><td class="oa3" width="76" style="width:44pt"> </td> </tr> <tr height="31" style="mso-height-source:userset;height:18.03pt"> <td height="31" class="oa2" width="190" style="height:18.03pt;width:110pt">enable.auto.commit </td> <td class="oa3" width="791" style="width:456pt"> 自动提交offset </td> <td class="oa3" width="73" style="width:42pt"> true </td> <td class="oa3" width="77" style="width:44pt"> </td> <td class="oa3" width="76" style="width:44pt"> medium </td> </tr> <tr height="38" style="mso-height-source:userset;height:21.85pt"> <td height="38" class="oa2" width="190" style="height:21.85pt;width:110pt"> auto.commit.interval.ms </td> <td class="oa3" width="791" style="width:456pt"> offset自动提交到Kafka的频率。 </td> <td class="oa3" width="73" style="width:42pt"> 5s </td> <td class="oa3" width="77" style="width:44pt"> </td> <td class="oa3" width="76" style="width:44pt"> low </td> </tr></tbody></table>

 Collect

Get Started

KAFKA

 Collect

Get Started

kafka

 Collect

Get Started

Kafka

 Collect

Get Started

Kafka





0 条评论

下一页