Pseudo-Lab · ehddnr301 · Aug 25, 2024 · Aug 1, 2024 · Aug 4, 2024 · Aug 4, 2024
diff --git a/book/_toc.yml b/book/_toc.yml
@@ -77,6 +77,16 @@ chapters:
   - file: docs/8_cloud_computing_and_data_engineering/8.4_use_case_cloud_dataengineering.md
   - file: docs/8_cloud_computing_and_data_engineering/8.5_multi_cloud_dataengineering.md
   - file: docs/8_cloud_computing_and_data_engineering/8.6_reference.md
+- file: docs/9_kafka_zero_to_hero/main_page.md
+  sections:
+  - file: docs/9_kafka_zero_to_hero/9_1_kafka_intro.md
+    sections:
+    - file: docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_1_architecture.md
+    - file: docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_2_topic_and_partition.md
+    - file: docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_3_partition_and_producer_consumer.md
+    - file: docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_4_kafka_performance.md
+    - file: docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_5_kafka_disaster_recovery.md
+    - file: docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_6_reference.md
 - file: docs/p_movieFlix/main_page.md
   sections:
   - file: docs/p_movieFlix/1_architecture.md

diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro.md b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro.md
@@ -0,0 +1,29 @@
+# 9.1. Kafka Intro
+
+- date : 2023-07-14
+- author
+  * [이동욱](https://github.com/ehddnr301)
+
+- keyword
+  * Kafka 기본 구조
+  * 토픽과 파티션
+  * 파티션과 오프셋, 메시지 순서
+  * 여러 파티션과 프로듀서
+  * 여러 파티션과 커슈머
+  * Kafka 성능
+  * 장애대응
+
+## 후기
+
+- 1주차 Kafka Intro 스터디 내용을 정리한것입니다. 스터디중에 나온 ZeroCopy나 Replication에 대한 이야기는 Additional로 각 챕터에 추가되었습니다.
+- 잘못된 내용이나 질문은 댓글 혹은 PR로 자유롭게 기여해주시면 감사하겠습니다!
+
+
+<script src="https://utteranc.es/client.js"
+        repo="Pseudo-Lab/data-engineering-for-everybody"
+        issue-term="pathname"
+        label="comments"
+        theme="preferred-color-scheme"
+        crossorigin="anonymous"
+        async>
+</script>
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_1_architecture.md b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_1_architecture.md
@@ -0,0 +1,36 @@
+# 카프카 기본 구조
+
+![alt text](./images/9_1_1.png)
+
+## 카프카란
+
+- 카프카는 4개의 구성요소로 이루어져 있습니다. (Zookeeper, Kafka Cluster, Producer, Consumer)
+- 카프카는 주로 대용량의 실시간 데이터 스트리밍을 처리하기 위해 사용되는 분산 스트리밍 플랫폼입니다.
+
+## Zookeeper
+
+- 카프카 클러스터를 관리하는 용도로 Zookeeper가 필요합니다.
+- 브로커 메타데이터 관리, 파티션 리더 선출 등의 관리 역할을 수행합니다.
+- 주키퍼를 제거한 **KRaft 모드**를 사용하여 주키퍼 의존성을 제거하는 방법도 존재합니다.
+
+## Kafka Cluster
+
+- 카프카 클러스터는 메시지를 저장하는 저장소 입니다.
+- 하나의 카프카 클러스터는 여러개의 브로커(각각의 서버)로 구성이 됩니다.
+
+## Producer
+
+- 카프카 클러스터에 메시지를 넣는 역할
+
+## Consumer
+
+- 카프카 클러스터에서 메시지를 읽어오는 역할
+
+<script src="https://utteranc.es/client.js"
+        repo="Pseudo-Lab/data-engineering-for-everybody"
+        issue-term="pathname"
+        label="comments"
+        theme="preferred-color-scheme"
+        crossorigin="anonymous"
+        async>
+</script>
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_2_topic_and_partition.md b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_2_topic_and_partition.md
@@ -0,0 +1,40 @@
+# 토픽과 파티션
+
+![alt text](./images/9_1_2_0.png)
+
+## 토픽
+
+- Topic은 메시지를 구분하는 단위로써 목적에 따라 여러 이름을 가질 수 있습니다.
+  - 파일시스템의 폴더에 비유할 수 있습니다.
+  - 무슨 데이터를 담고있는지 명확하게 명시하면 유지보수 시 편리하게 관리 할 수 있습니다.
+  - Topic1 (X) -> purchase_log, refund_log, ... (O)
+- Consumer와 Producer는 Topic을 기준으로 메시지를 주고받게 됩니다.
+  - Producer: 어떤 **Topic**에 메시지를 저장해줘
+  - Consumer: 어떤 **Topic**에서 메시지를 읽어올래
+
+## 파티션
+
+![alt text](./images/9_1_2_1.png)
+
+- 파티션은 메시지를 저장하는 물리적인 파일입니다.
+  - 각 파티션은 여러 Segment로 이루어져있고 각각은 특정 오프셋 범위를 가집니다.
+- 하나의 토픽은 한개 이상의 파티션으로 구성됩니다.
+  - 파티션은 하나의 토픽을 물리적으로 분할한 것입니다.
+  - 첫번째 파티션 번호는 0번부터 시작합니다.
+
+### 파티션과 오프셋, 메시지 순서
+
+- 파티션은 기본적으로 추가만 가능한 Append-Only 파일입니다.
+  - 각 메시지 저장위치를 Offset이라고 합니다.
+  - 프로듀서가 넣은 메시지는 Paritition의 맨 뒤에 추가됩니다.
+  - Consumer는 Offset 기준으로 메시지를 순서대로 읽어오게 됩니다.
+  - 메시지는 삭제되지 않습니다. (retention.ms에 따라 일정 시간이 지난 뒤 삭제될 수는 있습니다.)
+
+<script src="https://utteranc.es/client.js"
+        repo="Pseudo-Lab/data-engineering-for-everybody"
+        issue-term="pathname"
+        label="comments"
+        theme="preferred-color-scheme"
+        crossorigin="anonymous"
+        async>
+</script>
diff --git a/...s/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_3_partition_and_producer_consumer.md b/...s/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_3_partition_and_producer_consumer.md
@@ -0,0 +1,32 @@
+# 파티션과 프로듀서 & 컨슈머
+
+## 파티션과 프로듀서
+
+![alt text](./images/9_1_3_0.png)
+
+- 프로듀서는 어떤 Partition에 메시지를 저장할지 결정해야합니다.
+  1. 메시지 키가 제공되지 않은 경우 RoundRobin 방식으로 돌아가면서 저장할 수 있습니다.
+  2. Key를 이용해 특정 Partition을 선택할 수 있습니다.
+    - Key가 지정된 경우 Kafka는 키의 해시값을 이용해 메시지를 특정 파티션에 저장합니다.
+    - 같은 키를 갖는 메시지는 항상 같은 Partition에 저장되기 때문에 메시지 순서가 보장됩니다.
+
+## 파티션과 컨슈머
+
+![alt text](./images/9_1_3_1.png)
+
+- 컨슈머는 **컨슈머 그룹**에 속하게 됩니다.
+- 한개의 파티션은 컨슈머 그룹에서 한개 컨슈머만 연결 가능합니다.
+  - 컨슈머 그룹에 속한 컨슈머들은 하나의 파티션을 공유 할 수 없습니다.
+  - 한 컨슈머 그룹 기준으로 파티션의 메시지를 순서대로 처리하게 됩니다.
+
+- 그림 예시를 살펴보면 Consumer Group A에서 각 컨슈머는 Partition0과 Partition1을 공유할 수 없습니다.
+- 다른 Consumer Group B는 Consumer Group A에서 읽고있는 Partition0과 Partition1을 읽을 수 있습니다.
+
+<script src="https://utteranc.es/client.js"
+        repo="Pseudo-Lab/data-engineering-for-everybody"
+        issue-term="pathname"
+        label="comments"
+        theme="preferred-color-scheme"
+        crossorigin="anonymous"
+        async>
+</script>
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_4_kafka_performance.md b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_4_kafka_performance.md
@@ -0,0 +1,48 @@
+# 카프카 성능
+
+![alt text](./images/9_1_4.png)
+
+## 카프카 성능이 좋은 이유
+
+**Partition file은 OS 페이지캐시 사용**
+- Partition에 대한 File IO를 메모리에서 처리합니다.
+- 서버에서 Page Cache를 Kafka만 사용해야 성능에 유리합니다.
+
+**Zero Copy**
+- 디스크 버퍼에서 네트워크 버퍼로 직접 데이터 복사합니다.
+
+**Consumer 추적을 위해 Broker가 하는 일이 비교적 단순**
+- 메시지 필터, 메시지 재전송과 같은 일은 Broker가 하지않습니다. (Producer와 Consumer가 직접 함)
+- Broker는 Consumer와 Partition간 매핑 관리합니다.
+
+**묶어서 보내고, 받기 (Batch)**
+- Producer: 일정 크기만큼 메시지를 모아서 전송 가능합니다.
+- Consumer: 최소 크기만큼 메시지를 모아서 조회 가능합니다.
+
+**처리량(throughput) 증대(확장)가 쉬움**
+- 1개 장비의 용량 한계 -> Broker 추가, Partition 추가
+- 컨슈머가 느림 -> 컨슈머 추가 (+ Partition 추가)
+
+## Additional - Zero Copy
+
+**일반적인 copy작업**
+1. 데이터를 읽어 커널의 주소공간에 있는 Read buffer에 복사
+2. 커널의 Read buffer에서 Apllication buffer로 데이터 복사
+3. Application buffer에서 커널의 Socket buffer로 데이터 복사
+4. Scoket buffer에서 NIC로 데이터 복사
+
+**Zero-Copy**
+1. 데이터를 읽어 커널의 주소공간에 있는 Read buffer에 복사
+2. Read buffer데이터를 Socket buffer로 복사
+3. NIC buffer로 복사
+
+- 요약하자면 더 적은 복사횟수, 더 적은 Context Switching으로 불필요한 데이터 복사를 줄이고 CPU 자원을 아낄수 있다 입니다.
+
+<script src="https://utteranc.es/client.js"
+        repo="Pseudo-Lab/data-engineering-for-everybody"
+        issue-term="pathname"
+        label="comments"
+        theme="preferred-color-scheme"
+        crossorigin="anonymous"
+        async>
+</script>
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_5_kafka_disaster_recovery.md b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_5_kafka_disaster_recovery.md
@@ -0,0 +1,31 @@
+# 카프카 장애대응
+
+## 리플리카를 통한 장애대응
+
+- 리플리카는 파티션의 복제본입니다.
+  - Replication Factor 만큼 파티션의 복제본이 각 Broker에 생성됩니다.
+  - Replication Factor를 2로 지정하게 되면 동일한 데이터를 가지고 있는 파티션이 서로 다른 Broker에 2개가 생성됩니다. 이중 하나가 Leader, 하나가 Follower가 됩니다.
+
+- Leader와 Follower구성
+  - Producer와 Consumer는 Leader를 통해서만 메시지를 처리합니다.
+  - Follower는 Leader로부터 메시지를 복제합니다.
+
+- 장애대응
+  - 리더가 속한 브로커가 장애시 다른 Follower가 Leader가 됩니다.
+
+## Additional - Replication의 기본값은 주로 3인 이유
+
+- 합의 알고리즘에서 사용되는 **정족수**
+  - 5개의 노드로 구성할 경우 정족수를 채우기 위해 최소 3개의 노드는 동의를 해야해서 2개까지 장애를 허용합니다.
+  - 4개의 노드로 구성할 경우 정족수를 채우기 위해 최소 3개의 노드는 동의를 해야해서 1개까지 장애를 허용합니다.
+  - 3개의 노드로 구성할 경우 정족수를 채우기 위해 최소 2개의 노드는 동의를 해야해서 1개까지 장애를 허용합니다.
+- 짝수개 노드로 클러스터를 구성한 경우 홀수개 노드로 구성했을때보다 장애허용에 있어 이득을 보기 힘들어 보통 홀수개로 구성되는데 그 중 3이라는 최소 수치를 설정한것으로 이해하였습니다.
+
+<script src="https://utteranc.es/client.js"
+        repo="Pseudo-Lab/data-engineering-for-everybody"
+        issue-term="pathname"
+        label="comments"
+        theme="preferred-color-scheme"
+        crossorigin="anonymous"
+        async>
+</script>
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_6_reference.md b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/9_1_6_reference.md
@@ -0,0 +1,16 @@
+# `🗂️ 참고자료(Reference)`
+
+- [최범균님 Kafka 조금 아는 척하기1](https://www.youtube.com/watch?v=0Ssx7jJJADI)
+- [devocean Apache Kafka의 새로운 협의 프로토콜인 KRaft에 대해(1)](https://devocean.sk.com/blog/techBoardDetail.do?ID=165711&boardType=techBlog)
+- [Apache Kafka Guide #19 Segment and Indexes](https://medium.com/apache-kafka-from-zero-to-hero/apache-kafka-guide-19-segment-and-indexes-7a428f089695)
+- [Kafka의 Zero copy](https://h-devnote.tistory.com/19)
+- [뗏목 타고 합의 알고리즘 이해하기: The Raft Consensus Algorithm](https://www.youtube.com/watch?v=aywjlaKxQp4)
+
+<script src="https://utteranc.es/client.js"
+        repo="Pseudo-Lab/data-engineering-for-everybody"
+        issue-term="pathname"
+        label="comments"
+        theme="preferred-color-scheme"
+        crossorigin="anonymous"
+        async>
+</script>
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_1.png b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_1.png
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_2_0.png b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_2_0.png
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_2_1.png b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_2_1.png
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_3_0.png b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_3_0.png
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_3_1.png b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_3_1.png
diff --git a/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_4.png b/book/docs/9_kafka_zero_to_hero/9_1_kafka_intro/images/9_1_4.png
diff --git a/book/docs/9_kafka_zero_to_hero/main_page.md b/book/docs/9_kafka_zero_to_hero/main_page.md
@@ -0,0 +1,30 @@
+# 9. Kafka Zero-To-Hero
+
+- [Kafka - Zero to Hero 실시간 성장하기](https://www.notion.so/chanrankim/Kafka-Zero-to-Hero-8637c3be78f145649f44fa990aeb9892) 스터디 내용을 정리한 문서입니다.
+
+- 참가 스터디원
+
+  - [김승규]()
+  - [김승태]()
+  - [김예진](https://github.com/Yejining)
+  - [신진수]()
+  - [이동욱](https://github.com/ehddnr301)
+  - [이영전]()
+  - [이호민]()
+  - [이힘찬](https://github.com/ssilb4)
+  - [장승호]()
+  - [홍지영]()
+
+- 스터디 목표
+  1. Kafka 이론에 대해 사용가능한 정도로 이해하기
+  2. Kafka 상태 모니터링하기
+  3. 데이터 실시간처리를 하는 도중 발생하는 Kafka 에러 트러블 슈팅 해보기
+
+<script src="https://utteranc.es/client.js"
+        repo="Pseudo-Lab/data-engineering-for-everybody"
+        issue-term="pathname"
+        label="comments"
+        theme="preferred-color-scheme"
+        crossorigin="anonymous"
+        async>
+</script>