-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathatom.xml
428 lines (201 loc) · 595 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Codroc Blog</title>
<link href="https://codroc.github.io/atom.xml" rel="self"/>
<link href="https://codroc.github.io/"/>
<updated>2022-09-07T11:57:16.000Z</updated>
<id>https://codroc.github.io/</id>
<author>
<name>Codroc</name>
</author>
<generator uri="https://hexo.io/">Hexo</generator>
<entry>
<title>enable_if</title>
<link href="https://codroc.github.io/2022/09/07/enable_if/"/>
<id>https://codroc.github.io/2022/09/07/enable_if/</id>
<published>2022-09-07T11:57:16.000Z</published>
<updated>2022-09-07T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="C-enable-if的使用(转载)"><a href="#C-enable-if的使用(转载)" class="headerlink" title="C++ enable_if的使用(转载)"></a>C++ enable_if的使用(转载)</h1><p>C++的enable_if常用于构建需要根据不同的类型的条件实例化不同模板的时候。本文主要讲了enable_if的使用场景和使用方式。 ## 函数重载的缺陷 函数重载能解决同名函数针对不同传入参数类型而实现不同的功能。举一个简单的例子:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">print</span><span class="params">(<span class="keyword">int</span> a)</span></span>{</span><br><span class="line"> <span class="built_in">cout</span><<<span class="string">"in int print"</span>;</span><br><span class="line">}</span><br><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">print</span><span class="params">(<span class="keyword">double</span> a)</span></span>{</span><br><span class="line"> <span class="built_in">cout</span><<<span class="string">"in double print"</span>;</span><br><span class="line">}</span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span></span>{</span><br><span class="line"> f(<span class="number">1</span>);</span><br><span class="line"> f(<span class="number">1.0</span>);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>输出:</p><figure class="highlight scss"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">in int f()</span><br><span class="line">in double f()</span><br></pre></td></tr></table></figure><p>可以看出,这里的选择方式是通过不同的参数类型实现的。那么问题来了,如果我们是写的模板,想根据模板的条件来选择实现该怎么办?(例如,对于我们定义的一些class做输入时,采用一种方式实现,而对于其他类型的话采用另一种方式)。这就需要用到enable_if</p><h2 id="SFINAE-原则与-enable-if-简介"><a href="#SFINAE-原则与-enable-if-简介" class="headerlink" title="SFINAE 原则与 enable_if 简介"></a>SFINAE 原则与 enable_if 简介</h2><p>C++模板函数重载依赖于 SFINAE (substitution-failure-is-not-an-error) 原则,即替换失败不认为是错误,而只是简单地pass掉。看下面一个例子:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><iostream></span></span></span><br><span class="line"><span class="keyword">using</span> <span class="keyword">namespace</span> <span class="built_in">std</span>;</span><br><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">f</span><span class="params">(<span class="keyword">double</span> a)</span></span>{</span><br><span class="line"> <span class="built_in">cout</span><<<span class="string">"in double f()"</span><<<span class="built_in">endl</span>;</span><br><span class="line">}</span><br><span class="line"><span class="keyword">template</span><<span class="keyword">typename</span> T></span><br><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">f</span><span class="params">(<span class="keyword">typename</span> T::noexist a)</span></span>{</span><br><span class="line"> <span class="built_in">cout</span><<<span class="string">"in T::noexist f()"</span><<<span class="built_in">endl</span>;</span><br><span class="line">}</span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span></span>{</span><br><span class="line"> f(<span class="number">1</span>);</span><br><span class="line"> f(<span class="number">1.0</span>);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>程序正常编译通过,输出:</p><figure class="highlight scss"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">in double f()</span><br><span class="line">in double f()</span><br></pre></td></tr></table></figure><p>可以看到double和int都没有一个叫noexist的类型,所以解析是失败的,但是直接跳过,调用f的时候都转换为double输出。利用这个原则,我们可以构建一个开关的类,当满足某一条件时,让某类型能出现,不满足时,让他没有该类型,解析失败。这个开关函数就是 enable_if。enable_if是c++的标准模板,其实现非常简单,这里我们给出其实现的一种方式:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span><<span class="keyword">bool</span> B, <span class="class"><span class="keyword">class</span> <span class="title">T</span> =</span> <span class="keyword">void</span>></span><br><span class="line"> <span class="class"><span class="keyword">struct</span> <span class="title">user_enable_if</span> {</span>};</span><br><span class="line"><span class="keyword">template</span><<span class="class"><span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">user_enable_if</span><</span><span class="literal">true</span>, T> { <span class="keyword">typedef</span> T type; };</span><br></pre></td></tr></table></figure><p>这里我们部分偏特化了当条件B为true时的模板user_enable_if,与普通的user_enable_if的区别就在于定义了type类型,这样,用户使用typename user_enable_if<cond, Type>::type时,当cond为true时,这个表达式是一个类型,而当cond为false时,该表达式解析失败。</p><h2 id="enable-if的使用场景"><a href="#enable-if的使用场景" class="headerlink" title="enable_if的使用场景"></a>enable_if的使用场景</h2><p>enable_if可以作为参数或返回值加到函数中,我们看具体的例子: 1. 作为参数传入 我们在函数参数里多加了一个参数作推导用。</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><iostream></span></span></span><br><span class="line"><span class="keyword">using</span> <span class="keyword">namespace</span> <span class="built_in">std</span>;</span><br><span class="line"><span class="keyword">template</span><<span class="keyword">bool</span> B, <span class="class"><span class="keyword">class</span> <span class="title">T</span> =</span> <span class="keyword">void</span>></span><br><span class="line"> <span class="class"><span class="keyword">struct</span> <span class="title">user_enable_if</span> {</span>};</span><br><span class="line"><span class="keyword">template</span><<span class="class"><span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">user_enable_if</span><</span><span class="literal">true</span>, T> { <span class="keyword">typedef</span> T type; };</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">A</span>{</span>};</span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span><<span class="keyword">typename</span> T></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">Traits</span>{</span></span><br><span class="line"> <span class="keyword">static</span> <span class="keyword">const</span> <span class="keyword">bool</span> is_basic = <span class="literal">true</span>;</span><br><span class="line">};</span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span><></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">Traits</span><</span>A>{</span><br><span class="line"> <span class="keyword">static</span> <span class="keyword">const</span> <span class="keyword">bool</span> is_basic = <span class="literal">false</span>;</span><br><span class="line">};</span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span><<span class="keyword">typename</span> T></span><br><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">f</span><span class="params">(T a, <span class="keyword">typename</span> user_enable_if<Traits<T>::is_basic, <span class="keyword">void</span>>::type* dump= <span class="number">0</span>)</span></span>{</span><br><span class="line"> <span class="built_in">cout</span><<<span class="string">"a basic type"</span><<<span class="built_in">endl</span>;</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span><<span class="keyword">typename</span> T></span><br><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">f</span><span class="params">(T a, <span class="keyword">typename</span> user_enable_if<!Traits<T>::is_basic, <span class="keyword">void</span>>::type* dump= <span class="number">0</span>)</span></span>{</span><br><span class="line"> <span class="built_in">cout</span><<<span class="string">"a class type"</span><<<span class="built_in">endl</span>;</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span></span>{</span><br><span class="line"> A a;</span><br><span class="line"> f(<span class="number">1</span>);</span><br><span class="line"> f(a);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p> 运行输出:</p><figure class="highlight haskell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="title">a</span> basic <span class="class"><span class="keyword">type</span></span></span><br><span class="line"><span class="title">a</span> <span class="keyword">class</span> <span class="class"><span class="keyword">type</span></span></span><br></pre></td></tr></table></figure><p>在这里,当f的输入是1时,Traits::is_basic为true,user_enable_if<a href="Traits::is_basic">Traits::is_basic</a>::type能得到一个type(void),因此能实例化,而第二个模板不能实例化。而当f的输入是a时,结果正好相反。但有时后我们对参数个数有限制(例如,我们是重载的operator函数,参数个数被严格限制),这时候我们可以把enable_if加到返回值上。 1. 作为返回值 代码如下:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><iostream></span></span></span><br><span class="line"><span class="keyword">using</span> <span class="keyword">namespace</span> <span class="built_in">std</span>;</span><br><span class="line"><span class="keyword">template</span><<span class="keyword">bool</span> B, <span class="class"><span class="keyword">class</span> <span class="title">T</span> =</span> <span class="keyword">void</span>></span><br><span class="line"> <span class="class"><span class="keyword">struct</span> <span class="title">user_enable_if</span> {</span>};</span><br><span class="line"><span class="keyword">template</span><<span class="class"><span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">user_enable_if</span><</span><span class="literal">true</span>, T> { <span class="keyword">typedef</span> T type; };</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">A</span>{</span>};</span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span><<span class="keyword">typename</span> T></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">Traits</span>{</span></span><br><span class="line"> <span class="keyword">static</span> <span class="keyword">const</span> <span class="keyword">bool</span> is_basic = <span class="literal">true</span>;</span><br><span class="line">};</span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span><></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">Traits</span><</span>A>{</span><br><span class="line"> <span class="keyword">static</span> <span class="keyword">const</span> <span class="keyword">bool</span> is_basic = <span class="literal">false</span>;</span><br><span class="line">};</span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span><<span class="keyword">typename</span> T></span><br><span class="line"><span class="keyword">typename</span> user_enable_if<Traits<T>::is_basic, T>::<span class="function">type <span class="title">f</span><span class="params">(T a)</span></span>{</span><br><span class="line"> <span class="built_in">cout</span><<<span class="string">"a basic type"</span><<<span class="built_in">endl</span>;</span><br><span class="line"> <span class="keyword">return</span> a;</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span><<span class="keyword">typename</span> T></span><br><span class="line"><span class="keyword">typename</span> user_enable_if<!Traits<T>::is_basic, T>::<span class="function">type <span class="title">f</span><span class="params">(T a)</span></span>{</span><br><span class="line"> <span class="built_in">cout</span><<<span class="string">"a class type"</span><<<span class="built_in">endl</span>;</span><br><span class="line"> <span class="keyword">return</span> a;</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span></span>{</span><br><span class="line"> A a;</span><br><span class="line"> f(<span class="number">1</span>);</span><br><span class="line"> f(a);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>在这里,我们把enable_if用到了返回值上,当传入是1时,user_enable_if<Traits::is_basic, T>::type 为T(即int),而user_enable_if<!Traits::is_basic, T>::type无法解析,因此使用第一个模板实例化。当传入为a时相反。</p><hr><h5 id="转载自:enable-if"><a href="#转载自:enable-if" class="headerlink" title="转载自:enable_if"></a>转载自:<a href="https://blog.csdn.net/jeffasd/article/details/84667071#:~:text=C%2B%2B%E7%9A%84enable_if%E5%B8%B8%E7%94%A8%E4%BA%8E%E6%9E%84%E5%BB%BA%E9%9C%80%E8%A6%81%E6%A0%B9%E6%8D%AE%E4%B8%8D%E5%90%8C%E7%9A%84%E7%B1%BB%E5%9E%8B%E7%9A%84%E6%9D%A1%E4%BB%B6%E5%AE%9E%E4%BE%8B%E5%8C%96%E4%B8%8D%E5%90%8C%E6%A8%A1%E6%9D%BF%E7%9A%84%E6%97%B6%E5%80%99%E3%80%82%20%E6%9C%AC%E6%96%87%E4%B8%BB%E8%A6%81%E8%AE%B2%E4%BA%86enable_if%E7%9A%84%E4%BD%BF%E7%94%A8%E5%9C%BA%E6%99%AF%E5%92%8C%E4%BD%BF%E7%94%A8%E6%96%B9%E5%BC%8F%E3%80%82%20%23%23%20%E5%87%BD%E6%95%B0%E9%87%8D%E8%BD%BD%E7%9A%84%E7%BC%BA%E9%99%B7%20%E5%87%BD%E6%95%B0%E9%87%8D%E8%BD%BD%E8%83%BD%E8%A7%A3%E5%86%B3%E5%90%8C%E5%90%8D%E5%87%BD%E6%95%B0%E9%92%88%E5%AF%B9%E4%B8%8D%E5%90%8C%E4%BC%A0%E5%85%A5%E5%8F%82%E6%95%B0%E7%B1%BB%E5%9E%8B%E8%80%8C%E5%AE%9E%E7%8E%B0%E4%B8%8D%E5%90%8C%E7%9A%84%E5%8A%9F%E8%83%BD%E3%80%82%20%E4%B8%BE%E4%B8%80%E4%B8%AA%E7%AE%80%E5%8D%95%E7%9A%84%E4%BE%8B%E5%AD%90%EF%BC%9A%20%E5%8F%AF%E4%BB%A5%E7%9C%8B%E5%87%BA%EF%BC%8C%E8%BF%99%E9%87%8C%E7%9A%84%E9%80%89%E6%8B%A9%E6%96%B9%E5%BC%8F%E6%98%AF%E9%80%9A%E8%BF%87%E4%B8%8D%E5%90%8C%E7%9A%84%E5%8F%82%E6%95%B0%E7%B1%BB%E5%9E%8B%E5%AE%9E%E7%8E%B0%E7%9A%84%E3%80%82,%E9%82%A3%E4%B9%88%E9%97%AE%E9%A2%98%E6%9D%A5%E4%BA%86%EF%BC%8C%E5%A6%82%E6%9E%9C%E6%88%91%E4%BB%AC%E6%98%AF%E5%86%99%E7%9A%84%E6%A8%A1%E6%9D%BF%EF%BC%8C%E6%83%B3%E6%A0%B9%E6%8D%AE%E6%A8%A1%E6%9D%BF%E7%9A%84%E6%9D%A1%E4%BB%B6%E6%9D%A5%E9%80%89%E6%8B%A9%E5%AE%9E%E7%8E%B0%E8%AF%A5%E6%80%8E%E4%B9%88%E5%8A%9E%EF%BC%9F%20%EF%BC%88%E4%BE%8B%E5%A6%82%EF%BC%8C%E5%AF%B9%E4%BA%8E%E6%88%91%E4%BB%AC%E5%AE%9A%E4%B9%89%E7%9A%84%E4%B8%80%E4%BA%9Bclass%E5%81%9A%E8%BE%93%E5%85%A5%E6%97%B6%EF%BC%8C%E9%87%87%E7%94%A8%E4%B8%80%E7%A7%8D%E6%96%B9%E5%BC%8F%E5%AE%9E%E7%8E%B0%EF%BC%8C%E8%80%8C%E5%AF%B9%E4%BA%8E%E5%85%B6%E4%BB%96%E7%B1%BB%E5%9E%8B%E7%9A%84%E8%AF%9D%E9%87%87%E7%94%A8%E5%8F%A6%E4%B8%80%E7%A7%8D%E6%96%B9%E5%BC%8F%EF%BC%89%E3%80%82%20%E8%BF%99%E5%B0%B1%E9%9C%80%E8%A6%81%E7%94%A8%E5%88%B0enable_if%20C%2B%2B%E6%A8%A1%E6%9D%BF%E5%87%BD%E6%95%B0%E9%87%8D%E8%BD%BD%E4%BE%9D%E8%B5%96%E4%BA%8E%20SFINAE%20%28substitution-failure-is-not-an-error%29%20%E5%8E%9F%E5%88%99%EF%BC%8C%E5%8D%B3%E6%9B%BF%E6%8D%A2%E5%A4%B1%E8%B4%A5%E4%B8%8D%E8%AE%A4%E4%B8%BA%E6%98%AF%E9%94%99%E8%AF%AF%EF%BC%8C%E8%80%8C%E5%8F%AA%E6%98%AF%E7%AE%80%E5%8D%95%E5%9C%B0pass%E6%8E%89%E3%80%82%20%E7%9C%8B%E4%B8%8B%E9%9D%A2%E4%B8%80%E4%B8%AA%E4%BE%8B%E5%AD%90%EF%BC%9A">enable_if</a></h5>]]></content>
<summary type="html"><h1 id="C-enable-if的使用(转载)"><a href="#C-enable-if的使用(转载)" class="headerlink" title="C++ enable_if的使用(转载)"></a>C++ enable_if的使用(转载)</h1><p>C+</summary>
</entry>
<entry>
<title>c++ std function 要求参数有 copy constructor</title>
<link href="https://codroc.github.io/2022/08/10/std_function%E6%9E%84%E9%80%A0%E5%87%BD%E6%95%B0%E9%97%AE%E9%A2%98/"/>
<id>https://codroc.github.io/2022/08/10/std_function%E6%9E%84%E9%80%A0%E5%87%BD%E6%95%B0%E9%97%AE%E9%A2%98/</id>
<published>2022-08-09T16:00:00.000Z</published>
<updated>2022-08-09T16:00:00.000Z</updated>
<content type="html"><![CDATA[<h1 id="std-function-默认对参数进行-copy"><a href="#std-function-默认对参数进行-copy" class="headerlink" title="std::function 默认对参数进行 copy"></a>std::function 默认对参数进行 copy</h1><p>在写线程池的时候碰到了这样的报错信息:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">error: use of deleted function ‘std::packaged_task<_Res(_ArgTypes ...)>::packaged_task(const std::packaged_task<_Res(_ArgTypes ...)>&) [with _Res = void; _ArgTypes = {}]’</span><br><span class="line"></span><br><span class="line">packaged_task(const packaged_task&) = delete;</span><br></pre></td></tr></table></figure><p>也就是说我的代码里调用了 <strong>packaged_task</strong> 的 copy 构造函数,但是它实际上是被删除了,因此导致编译错误;那么哪里会调用 <strong>packaged_task</strong> 的 copy 构造呢</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"> <span class="class"><span class="keyword">class</span> <span class="title">ThreadPool</span> {</span></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="keyword">using</span> Task = <span class="built_in">std</span>::function<<span class="keyword">void</span>()>;</span><br><span class="line">... ...</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line">... ...</span><br><span class="line"> <span class="comment">// 任务队列</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">queue</span><Task> _task_queue;</span><br><span class="line">};</span><br><span class="line"><span class="keyword">template</span> <<span class="keyword">typename</span> Function, class... Args></span><br><span class="line"><span class="built_in">std</span>::<span class="built_in">future</span><<span class="keyword">typename</span> <span class="built_in">std</span>::result_of<Function(Args...)>::type></span><br><span class="line">ThreadPool::add_task(<span class="keyword">const</span> Function& f, Args... args) {</span><br><span class="line"> <span class="keyword">using</span> Ret = <span class="keyword">typename</span> <span class="built_in">std</span>::result_of<Function(Args...)>::type;</span><br><span class="line"> <span class="function"><span class="built_in">std</span>::packaged_task<<span class="title">void</span><span class="params">()</span>> <span class="title">task</span><span class="params">(<span class="built_in">std</span>::bind(f, args...))</span></span>;</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">future</span><Ret> ret = task.get_future();</span><br><span class="line"> {</span><br><span class="line"> <span class="function"><span class="built_in">std</span>::unique_lock<<span class="built_in">std</span>::mutex> <span class="title">lock</span><span class="params">(_mu)</span></span>;</span><br><span class="line"> _task_queue.push(<span class="built_in">std</span>::move(task));</span><br><span class="line"> _cv.notify_all();</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> ret;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>可以看到 <strong>_task_queue</strong> 的类型是 <code>std::queue<std::function<void()>></code>,而 <code>std::function</code> 会对传入的参数调用 copy 构造函数,所以它的参数必须是可拷贝的(copy constructible)<a href="https://en.cppreference.com/w/cpp/utility/functional/function/function">cppreference</a>;<strong>然而 packaged_task 不能被 copy 只能被 move,所以会报错</strong>;</p><p>修改方式就是把 <code>std::queue<std::function<void()>></code> 改成 <code>std::queue<std::packaged_task<void()>></code> 就可以了;</p><h1 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h1><ol><li><a href="https://en.cppreference.com/w/cpp/utility/functional/function/function">cppreference——std::function</a></li><li><a href="https://zhuanlan.zhihu.com/p/410001289">c++11 threadpool</a></li><li><a href="https://en.cppreference.com/w/cpp/thread/packaged_task">cppreference——std::packaged_task</a></li></ol>]]></content>
<summary type="html"><h1 id="std-function-默认对参数进行-copy"><a href="#std-function-默认对参数进行-copy" class="headerlink" title="std::function 默认对参数进行 copy"></a>std::funct</summary>
</entry>
<entry>
<title>智能指针总结</title>
<link href="https://codroc.github.io/2022/08/09/%E6%99%BA%E8%83%BD%E6%8C%87%E9%92%88%E6%80%BB%E7%BB%93/"/>
<id>https://codroc.github.io/2022/08/09/%E6%99%BA%E8%83%BD%E6%8C%87%E9%92%88%E6%80%BB%E7%BB%93/</id>
<published>2022-08-09T11:57:16.000Z</published>
<updated>2022-08-09T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="C-智能指针总结"><a href="#C-智能指针总结" class="headerlink" title="C++ 智能指针总结"></a>C++ 智能指针总结</h1><p>根据 cppreference,C++ 提供了这些智能指针,以及和它们有联系的技术(未标注的均为 since c++11)</p><p>智能指针:</p><ul><li>unique_ptr</li><li>shared_ptr</li><li>weak_ptr</li></ul><p>相关技术:</p><ul><li><p>owner_less</p></li><li><p>enable_shared_from_this</p></li><li><p>bad_weak_ptr</p></li><li><p>default_delete</p></li><li><p>out_ptr_t(c++23)</p></li><li><p>inout_ptr_t(c++23)</p></li></ul><h1 id="智能指针"><a href="#智能指针" class="headerlink" title="智能指针"></a>智能指针</h1><p><strong>所有智能指针默认非线程安全</strong></p><p>通常 unique_ptr 和 raw_ptr 组合使用;shared_ptr 和 weak_ptr 组合使用;</p><p>unique_ptr <strong>独占</strong>资源所属权,它其实可以取代 scope pointer,很好的贯彻了 RAII 思想,让空悬指针、野指针、double free、memory leak 等内存问题得到轻松的解决;</p><p>shared_ptr <strong>共享</strong>资源所属权,<strong>控制了对象的生命周期</strong>,当没有任何 shared_ptr 指向资源时,资源会被自动释放,这能帮助我们实现 GC,这里的 Garbage 不光指 memory,还包括任何系统资源;在多线程场景下,它能帮我们安全析构对象;</p><p>weak_ptr <strong>不会对对象的生命周期产生影响,也不持有资源的所有权</strong>,但是它能够知道某个对象是否还“活着”;在必要时,它还能提升为 shared_ptr;通常把它和 shared_ptr 组合起来用于双向关联的两个类(见 muduo 阅读笔记);</p><p>我根据三个智能指针的功能,写了简单的实现:</p><h3 id="unique-ptr"><a href="#unique-ptr" class="headerlink" title="unique_ptr"></a>unique_ptr</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span> <<span class="class"><span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">unique_ptr</span> {</span></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="function"><span class="keyword">explicit</span> <span class="title">unique_ptr</span><span class="params">(T* raw_ptr)</span></span></span><br><span class="line"><span class="function"> : <span class="title">raw_ptr_</span><span class="params">(raw_ptr)</span></span></span><br><span class="line"><span class="function"> </span>{}</span><br><span class="line"></span><br><span class="line"> <span class="comment">// move</span></span><br><span class="line"> <span class="built_in">unique_ptr</span>(<span class="built_in">unique_ptr</span><T>&& rhs) {</span><br><span class="line"> raw_ptr_ = rhs.raw_ptr_;</span><br><span class="line"> rhs.raw_ptr_ = <span class="literal">nullptr</span>;</span><br><span class="line"> }</span><br><span class="line"> <span class="built_in">unique_ptr</span>& <span class="keyword">operator</span>=(<span class="built_in">unique_ptr</span><T>&& rhs) {</span><br><span class="line"> <span class="keyword">this</span>->~<span class="built_in">unique_ptr</span>();</span><br><span class="line"> <span class="keyword">new</span> (<span class="keyword">this</span>) <span class="built_in">unique_ptr</span>(<span class="built_in">std</span>::move(rhs));</span><br><span class="line"> <span class="keyword">return</span> *<span class="keyword">this</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> ~<span class="built_in">unique_ptr</span>() {</span><br><span class="line"> <span class="keyword">if</span> (raw_ptr_) {</span><br><span class="line"> <span class="keyword">delete</span> raw_ptr_;</span><br><span class="line"> raw_ptr_ = <span class="literal">nullptr</span>;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">swap</span><span class="params">(<span class="built_in">unique_ptr</span><T>& rhs)</span> </span>{</span><br><span class="line"> <span class="keyword">if</span> (<span class="keyword">this</span> == &rhs) <span class="keyword">return</span>;</span><br><span class="line"> T* tmp = raw_ptr_;</span><br><span class="line"> raw_ptr_ = rhs.raw_ptr_;</span><br><span class="line"> rhs.raw_ptr_ = tmp;</span><br><span class="line"> }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">reset</span><span class="params">(T* raw_ptr = <span class="literal">nullptr</span>)</span> </span>{</span><br><span class="line"> <span class="keyword">this</span>->~<span class="built_in">unique_ptr</span>();</span><br><span class="line"> raw_ptr_ = raw_ptr;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function">T* <span class="title">release</span><span class="params">()</span> </span>{</span><br><span class="line"> T* ret = raw_ptr_;</span><br><span class="line"> raw_ptr_ = <span class="literal">nullptr</span>;</span><br><span class="line"> <span class="keyword">return</span> ret;</span><br><span class="line"> }</span><br><span class="line"> <span class="function">T* <span class="title">get</span><span class="params">()</span> <span class="keyword">const</span> </span>{ <span class="keyword">return</span> raw_ptr_; }</span><br><span class="line"></span><br><span class="line"> T& <span class="keyword">operator</span>*() <span class="keyword">const</span> { <span class="keyword">return</span> *raw_ptr_; }</span><br><span class="line"> T* <span class="keyword">operator</span>->() <span class="keyword">const</span> { <span class="keyword">return</span> raw_ptr_; }</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> <span class="comment">// 不允许 copy</span></span><br><span class="line"> <span class="built_in">unique_ptr</span>(<span class="keyword">const</span> <span class="built_in">unique_ptr</span>&) = <span class="keyword">delete</span>;</span><br><span class="line"> <span class="built_in">unique_ptr</span>& <span class="keyword">operator</span>=(<span class="keyword">const</span> <span class="built_in">unique_ptr</span>&) = <span class="keyword">delete</span>;</span><br><span class="line"></span><br><span class="line"> T* raw_ptr_;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><h3 id="shared-ptr"><a href="#shared-ptr" class="headerlink" title="shared_ptr"></a>shared_ptr</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span> <<span class="class"><span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">weak_ptr</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span> <<span class="class"><span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">shared_ptr</span> {</span></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="built_in">shared_ptr</span>()</span><br><span class="line"> : raw_ptr_(<span class="literal">nullptr</span>), ref_(<span class="literal">nullptr</span>)</span><br><span class="line"> {}</span><br><span class="line"> <span class="built_in">shared_ptr</span>(T* raw_ptr)</span><br><span class="line"> : raw_ptr_(raw_ptr), ref_(<span class="keyword">new</span> <span class="built_in">std</span>::atomic<<span class="keyword">uint64_t</span>>(<span class="number">1</span>))</span><br><span class="line"> {}</span><br><span class="line"> <span class="comment">// copy</span></span><br><span class="line"> <span class="built_in">shared_ptr</span>(<span class="keyword">const</span> <span class="built_in">shared_ptr</span><T>& rhs) {</span><br><span class="line"> <span class="keyword">if</span> (rhs) {</span><br><span class="line"> raw_ptr_ = rhs.raw_ptr_;</span><br><span class="line"> ref_ = rhs.ref_;</span><br><span class="line"> ref_->fetch_add(<span class="number">1</span>);</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> <span class="built_in">shared_ptr</span><T>& <span class="keyword">operator</span>=(<span class="keyword">const</span> <span class="built_in">shared_ptr</span><T>& rhs) {</span><br><span class="line"> <span class="keyword">if</span> (<span class="keyword">this</span> == &rhs) <span class="keyword">return</span> *<span class="keyword">this</span>;</span><br><span class="line"> <span class="keyword">if</span> (rhs) {</span><br><span class="line"> <span class="keyword">this</span>->~<span class="built_in">shared_ptr</span>();</span><br><span class="line"> raw_ptr_ = rhs.raw_ptr_;</span><br><span class="line"> ref_ = rhs.ref_;</span><br><span class="line"> ref_->fetch_add(<span class="number">1</span>);</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> *<span class="keyword">this</span>;</span><br><span class="line"> }</span><br><span class="line"> <span class="comment">// move</span></span><br><span class="line"> <span class="built_in">shared_ptr</span>(<span class="built_in">shared_ptr</span><T>&& rhs) {</span><br><span class="line"> }</span><br><span class="line"> <span class="built_in">shared_ptr</span><T>& <span class="keyword">operator</span>=(<span class="built_in">shared_ptr</span><T>&& rhs) {</span><br><span class="line"> } </span><br><span class="line"> ~<span class="built_in">shared_ptr</span>() {</span><br><span class="line"> <span class="comment">// 错误示例:</span></span><br><span class="line"> <span class="comment">// if (*ref_ == 1) { // 在多线程时,可能会出现 2 个sp同时进入析构函数,然后读取 ref_ 时发现都是 2,导致内存泄漏</span></span><br><span class="line"> <span class="comment">// delete raw_ptr_;</span></span><br><span class="line"> <span class="comment">// } else {</span></span><br><span class="line"> <span class="comment">// ref_->fetch_sub(1);</span></span><br><span class="line"> <span class="comment">// }</span></span><br><span class="line"> <span class="comment">// raw_ptr_ = nullptr;</span></span><br><span class="line"> <span class="comment">// ref_ = nullptr;</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> (*<span class="keyword">this</span>) {</span><br><span class="line"> <span class="keyword">int</span> ref = ref_->fetch_sub(<span class="number">1</span>); <span class="comment">// 析构的时候这一步很容易错</span></span><br><span class="line"> <span class="keyword">if</span> (ref == <span class="number">1</span>) {</span><br><span class="line"> <span class="keyword">delete</span> raw_ptr_;</span><br><span class="line"> }</span><br><span class="line"> raw_ptr_ = <span class="literal">nullptr</span>;</span><br><span class="line"> ref_ = <span class="literal">nullptr</span>;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">reset</span><span class="params">(T* raw_ptr = <span class="literal">nullptr</span>)</span> </span>{</span><br><span class="line"> <span class="keyword">if</span> (*<span class="keyword">this</span>) {</span><br><span class="line"> <span class="keyword">this</span>->~<span class="built_in">shared_ptr</span>();</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="keyword">new</span> (<span class="keyword">this</span>) <span class="built_in">shared_ptr</span><T>(raw_ptr);</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">swap</span><span class="params">(<span class="built_in">shared_ptr</span><T>& lhs, <span class="built_in">shared_ptr</span><T>& rhs)</span> </span>{</span><br><span class="line"> <span class="keyword">if</span> (&lhs == &rhs) <span class="keyword">return</span>;</span><br><span class="line"> T* tmp1 = lhs.raw_ptr_;</span><br><span class="line"> <span class="built_in">std</span>::atomic<<span class="keyword">uint64_t</span>>* tmp2 = lhs.ref_;</span><br><span class="line"></span><br><span class="line"> lhs.raw_ptr_ = rhs.raw_ptr_;</span><br><span class="line"> lhs.ref_ = rhs.ref_;</span><br><span class="line"> rhs.raw_ptr_ = tmp1;</span><br><span class="line"> rhs.ref_ = tmp2;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function">T* <span class="title">get</span><span class="params">()</span> <span class="keyword">const</span> </span>{ <span class="keyword">return</span> raw_ptr_; }</span><br><span class="line"> <span class="function"><span class="keyword">uint64_t</span> <span class="title">use_count</span><span class="params">()</span> <span class="keyword">const</span> </span>{</span><br><span class="line"> <span class="keyword">if</span> (!*<span class="keyword">this</span>)</span><br><span class="line"> <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line"> <span class="keyword">return</span> *ref_;</span><br><span class="line"> }</span><br><span class="line"> <span class="function"><span class="keyword">bool</span> <span class="title">unique</span><span class="params">()</span> <span class="keyword">const</span> </span>{</span><br><span class="line"> <span class="keyword">if</span> (!*<span class="keyword">this</span>)</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"> <span class="keyword">return</span> *ref_ == <span class="number">1</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> T& <span class="keyword">operator</span>*() <span class="keyword">const</span> {</span><br><span class="line"> <span class="keyword">return</span> *raw_ptr_;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> T* <span class="keyword">operator</span>->() <span class="keyword">const</span> {</span><br><span class="line"> <span class="keyword">return</span> raw_ptr_;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="keyword">operator</span> T*() <span class="keyword">const</span> { <span class="keyword">return</span> raw_ptr_; }</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> <span class="keyword">friend</span> <span class="class"><span class="keyword">class</span> <span class="title">weak_ptr</span><</span>T>;</span><br><span class="line"></span><br><span class="line"> T* raw_ptr_;</span><br><span class="line"> <span class="built_in">std</span>::atomic<<span class="keyword">uint64_t</span>>* ref_;</span><br><span class="line">};</span><br><span class="line"><span class="comment">// shared_ptr</span></span><br><span class="line"><span class="keyword">template</span> <<span class="class"><span class="keyword">class</span> <span class="title">T</span>, <span class="keyword">class</span> <span class="title">U</span>></span></span><br><span class="line"><span class="keyword">bool</span> <span class="keyword">operator</span>==(<span class="keyword">const</span> <span class="built_in">shared_ptr</span><T>& lhs, <span class="keyword">const</span> <span class="built_in">shared_ptr</span><U>& rhs) {</span><br><span class="line"> <span class="keyword">return</span> <span class="keyword">reinterpret_cast</span><<span class="keyword">long</span> <span class="keyword">long</span>>(lhs.get()) == <span class="keyword">reinterpret_cast</span><<span class="keyword">long</span> <span class="keyword">long</span>>(rhs.get());</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h3 id="weak-ptr"><a href="#weak-ptr" class="headerlink" title="weak_ptr"></a>weak_ptr</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span> <<span class="class"><span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">weak_ptr</span> {</span></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> weak_ptr() = <span class="keyword">default</span>;</span><br><span class="line"> weak_ptr(<span class="keyword">const</span> weak_ptr<T>& wp) {</span><br><span class="line"> *<span class="keyword">this</span> = wp;</span><br><span class="line"> }</span><br><span class="line"> weak_ptr(<span class="keyword">const</span> <span class="built_in">shared_ptr</span><T>& sp) {</span><br><span class="line"> *<span class="keyword">this</span> = sp;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> weak_ptr<T>& <span class="keyword">operator</span>=(<span class="keyword">const</span> weak_ptr<T>& rhs) {</span><br><span class="line"> raw_ptr_ = rhs.raw_ptr_;</span><br><span class="line"> ref_ = rhs.ref_;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> weak_ptr<T>& <span class="keyword">operator</span>=(<span class="keyword">const</span> <span class="built_in">shared_ptr</span><T>& rhs) {</span><br><span class="line"> raw_ptr_ = rhs.raw_ptr_;</span><br><span class="line"> ref_ = rhs.ref_;</span><br><span class="line"> <span class="keyword">return</span> *<span class="keyword">this</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// move</span></span><br><span class="line"> weak_ptr(weak_ptr<T>&& rhs) {</span><br><span class="line"> *<span class="keyword">this</span> = <span class="built_in">std</span>::move(rhs);</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> weak_ptr<T>& <span class="keyword">operator</span>=(weak_ptr<T>&& rhs) {</span><br><span class="line"> *<span class="keyword">this</span> = rhs;</span><br><span class="line"> rhs.raw_ptr_ = <span class="literal">nullptr</span>;</span><br><span class="line"> rhs.ref_ = <span class="literal">nullptr</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// modifiers</span></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">reset</span><span class="params">()</span> </span>{</span><br><span class="line"> raw_ptr_ = <span class="literal">nullptr</span>;</span><br><span class="line"> ref_ = <span class="literal">nullptr</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">swap</span><span class="params">(weak_ptr<T>& wp)</span> </span>{</span><br><span class="line"> swap(raw_ptr_, wp.raw_ptr_);</span><br><span class="line"> swap(ref_, wp.ref_);</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// observers</span></span><br><span class="line"> <span class="function"><span class="built_in">shared_ptr</span><T> <span class="title">lock</span><span class="params">()</span> <span class="keyword">const</span> </span>{</span><br><span class="line"> <span class="built_in">shared_ptr</span><T> ret;</span><br><span class="line"> <span class="keyword">if</span> (!expired()) {</span><br><span class="line"> ret.raw_ptr_ = raw_ptr_;</span><br><span class="line"> ret.ref_ = ref_;</span><br><span class="line"> ref_->fetch_add(<span class="number">1</span>);</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> ret;</span><br><span class="line"> }</span><br><span class="line"> <span class="function"><span class="keyword">bool</span> <span class="title">expired</span><span class="params">()</span> <span class="keyword">const</span> </span>{ <span class="keyword">return</span> ref_ == <span class="literal">nullptr</span> <span class="keyword">or</span> *ref_ == <span class="number">0</span>; }</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> T* raw_ptr_{<span class="number">0</span>};</span><br><span class="line"> <span class="built_in">std</span>::atomic<<span class="keyword">uint64_t</span>>* ref_{<span class="number">0</span>};</span><br><span class="line">};</span><br></pre></td></tr></table></figure><h1 id="相关技术"><a href="#相关技术" class="headerlink" title="相关技术"></a>相关技术</h1><h3 id="owner-less"><a href="#owner-less" class="headerlink" title="owner_less"></a>owner_less</h3><p>见 <a href="https://stackoverflow.com/questions/53217358/what-does-stdowner-less-do">What does std::owner_less do?</a></p><h3 id="enable-shared-from-this"><a href="#enable-shared-from-this" class="headerlink" title="enable_shared_from_this"></a>enable_shared_from_this</h3><p>用于获取被 shared_ptr 管理的对象的 shared_ptr;(有点绕。。。)</p><p>记住三点:</p><ol><li>enable_shared_from_this 使用了 CRTP 技术实现编译期多态;</li><li>继承了 enable_shared_from_this 的类,必须是堆对象,而不能是栈对象;</li><li>不要在派生类的构造函数里面调用 shared_from_this,因为此时还没有把它交给 shared_ptr 管理;</li></ol><h3 id="bad-weak-ptr"><a href="#bad-weak-ptr" class="headerlink" title="bad_weak_ptr"></a>bad_weak_ptr</h3><p>这是一个异常类,继承自 std::exception</p><blockquote><p><code>std::bad_weak_ptr</code> is the type of the object thrown as exceptions by the constructors of <a href="https://en.cppreference.com/w/cpp/memory/shared_ptr">std::shared_ptr</a> that take <a href="https://en.cppreference.com/w/cpp/memory/weak_ptr">std::weak_ptr</a> as the argument, when the <a href="https://en.cppreference.com/w/cpp/memory/weak_ptr">std::weak_ptr</a> refers to an already deleted object.——cppreference</p></blockquote><p>上面的话的意思就是:<strong>std::bad_weak_ptr</strong> 是在调用构造函数 <code>shared_ptr(const weak_ptr<T>& rhs)</code> 时,因为 <strong>std::weak_ptr</strong> 所指的对象已经被删除而抛出的异常对象;</p><p>Member Function 中只需要记住一个:<strong>std::bad_weak_ptr::what</strong>(基类的虚函数);</p><p>示例:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><memory></span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><iostream></span></span></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span></span></span><br><span class="line"><span class="function"></span>{</span><br><span class="line"> <span class="function"><span class="built_in">std</span>::<span class="built_in">shared_ptr</span><<span class="keyword">int</span>> <span class="title">p1</span><span class="params">(<span class="keyword">new</span> <span class="keyword">int</span>(<span class="number">42</span>))</span></span>;</span><br><span class="line"> <span class="function"><span class="built_in">std</span>::weak_ptr<<span class="keyword">int</span>> <span class="title">wp</span><span class="params">(p1)</span></span>;</span><br><span class="line"> p1.reset();</span><br><span class="line"> <span class="keyword">try</span> {</span><br><span class="line"> <span class="function"><span class="built_in">std</span>::<span class="built_in">shared_ptr</span><<span class="keyword">int</span>> <span class="title">p2</span><span class="params">(wp)</span></span>;</span><br><span class="line"> } <span class="keyword">catch</span>(<span class="keyword">const</span> <span class="built_in">std</span>::bad_weak_ptr& e) {</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">cout</span> << e.what() << <span class="string">'\n'</span>;</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h3 id="default-delete"><a href="#default-delete" class="headerlink" title="default_delete"></a>default_delete</h3><p>智能指针中的删除器,用于 unique_ptr 和 shared_ptr(只有这两个有 ownership);默认情况下是,<code>std::default_delete<T>()</code> 或 <code>std::default_delete<T[]>()</code>;前者的默认行为是 delete,后者的默认行为是 delete [];</p><p>可以在构造 unique_ptr 的时候指定 default_delete:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span><</span><br><span class="line"> <span class="class"><span class="keyword">class</span> <span class="title">T</span>,</span></span><br><span class="line"><span class="class"> <span class="keyword">class</span> <span class="title">Deleter</span> =</span> <span class="built_in">std</span>::default_delete<T></span><br><span class="line">> <span class="class"><span class="keyword">class</span> <span class="title">unique_ptr</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="built_in">std</span>::<span class="built_in">unique_ptr</span><<span class="keyword">int</span>> <span class="title">up</span><span class="params">(<span class="keyword">new</span> <span class="keyword">int</span>(<span class="number">10</span>), <span class="built_in">std</span>::default_delete<<span class="keyword">int</span>>())</span></span>;</span><br><span class="line"><span class="function"><span class="built_in">std</span>::<span class="built_in">unique_ptr</span><<span class="keyword">int</span>> <span class="title">up</span><span class="params">(<span class="keyword">new</span> <span class="keyword">int</span>[<span class="number">10</span>], <span class="built_in">std</span>::default_delete<<span class="keyword">int</span>[]>())</span></span>;</span><br></pre></td></tr></table></figure><p>可以看到, unique_ptr 模板类的模板参数中就带有 Deleter;</p><p>再看下 shared_ptr:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span>< <span class="class"><span class="keyword">class</span> <span class="title">T</span> ></span> <span class="class"><span class="keyword">class</span> <span class="title">shared_ptr</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span>< <span class="class"><span class="keyword">class</span> <span class="title">Y</span>, <span class="keyword">class</span> <span class="title">Deleter</span> ></span></span><br><span class="line"><span class="built_in">shared_ptr</span>( Y* ptr, Deleter d );</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="built_in">std</span>::<span class="built_in">shared_ptr</span><<span class="keyword">int</span>> <span class="title">sp</span><span class="params">(<span class="keyword">new</span> <span class="keyword">int</span>(<span class="number">10</span>), <span class="built_in">std</span>::default_delete<<span class="keyword">int</span>>())</span></span>;</span><br><span class="line"><span class="function"><span class="built_in">std</span>::<span class="built_in">shared_ptr</span><<span class="keyword">int</span>> <span class="title">sp</span><span class="params">(<span class="keyword">new</span> <span class="keyword">int</span>[<span class="number">10</span>], <span class="built_in">std</span>::default_delete<<span class="keyword">int</span>[]>())</span></span>;</span><br><span class="line"></span><br><span class="line"><span class="built_in">std</span>::<span class="built_in">vector</span><<span class="keyword">int</span>*> v;</span><br><span class="line"><span class="keyword">for</span>(<span class="keyword">int</span> n = <span class="number">0</span>; n < <span class="number">100</span>; ++n)</span><br><span class="line"> v.push_back(<span class="keyword">new</span> <span class="keyword">int</span>(n));</span><br><span class="line"><span class="built_in">std</span>::for_each(v.begin(), v.end(), <span class="built_in">std</span>::default_delete<<span class="keyword">int</span>>());</span><br></pre></td></tr></table></figure><p>shared_ptr 模板类的模板参数中仅有一个资源类型参数,没有 Deleter;它是在构造函数多了个模板参数,它的构造函数是函数模板;</p><h3 id="out-ptr-t"><a href="#out-ptr-t" class="headerlink" title="out_ptr_t"></a>out_ptr_t</h3><h3 id="inout-ptr-t"><a href="#inout-ptr-t" class="headerlink" title="inout_ptr_t"></a>inout_ptr_t</h3><p>见 <a href="https://stackoverflow.com/questions/68918312/understanding-stdinout-ptr-and-stdout-ptr-in-c23">Understanding std::inout_ptr and std::out_ptr in C++23</a></p>]]></content>
<summary type="html"><h1 id="C-智能指针总结"><a href="#C-智能指针总结" class="headerlink" title="C++ 智能指针总结"></a>C++ 智能指针总结</h1><p>根据 cppreference,C++ 提供了这些智能指针,以及和它们有联系的技术(</summary>
</entry>
<entry>
<title>leveldb 源码分析 [8] —— Compaction</title>
<link href="https://codroc.github.io/2022/08/08/leveldb8_compaction/"/>
<id>https://codroc.github.io/2022/08/08/leveldb8_compaction/</id>
<published>2022-08-08T11:57:16.000Z</published>
<updated>2022-08-08T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="LevelDB-源码分析【8】——-Compaction"><a href="#LevelDB-源码分析【8】——-Compaction" class="headerlink" title="LevelDB 源码分析【8】—— Compaction"></a>LevelDB 源码分析【8】—— Compaction</h1><p>Compaction 类型:</p><ul><li>minor compaction:指 immutable memtable ——> level(0) sstabl</li><li>major compaction:指合并 level(i) 和 level(i + 1) 的 sstable 到 level(i + 1)</li></ul><p>Compaction 的四个目的:</p><ul><li>持久化数据:把内存中的数据通过 minor compaction 持久化到磁盘</li><li>提高读取效率:由于 level(0) 中 sstable 的数据可以出现 overlap,因此读取一个 key 最差情况下可能要遍历 level(0) 的所有 sstable 文件(因为,给出一个 key = 50,和 level(0) 中的所有 sstable 文件 sstable0、sstable1,sstable0 的 key 的范围在[0, 100],sstable1 的 key 的范围在[30, 70],那么最差情况就是我查看了sstable1 发现没有 key,然后再查看 sstable0 发现存在 key,也就是说最差情况要遍历完 level0 的所有的 sstable 文件)</li><li>平衡读写差异:当用户写入的速度始终大于 major compaction 的速度时,就会导致 0 层的文件数量还是不断上升,用户的读取效率持续下降</li><li>整理数据:Leveldb 是典型的 LSM Tree 的实现,一个同样的 key,可能存在多条数据项;为了减少空间放大对不同版本相同 key 的数据项进行整合</li></ul><h2 id="Compaction-的过程"><a href="#Compaction-的过程" class="headerlink" title="Compaction 的过程"></a>Compaction 的过程</h2><p>上面介绍了 Compaction 分为两类,那么它们是怎么进行的,以及触发 compaction 的时机是什么?</p><h3 id="Minor-Compaction"><a href="#Minor-Compaction" class="headerlink" title="Minor Compaction"></a>Minor Compaction</h3><p><strong>触发的时机:</strong></p><p>当 memtable 的 size 达到一个阈值后,会变成 immutable memtable,然后后台线程发现存在 immutable memtable 后回去执行 minor compaction</p><p><strong>过程:</strong></p><p>一次 minor compaction 非常简单,其本质就是将一个内存数据库中的所有数据持久化到一个磁盘文件中。每次 minor compaction 结束后,都会生成一个新的 sstable 文件,也意味着 <strong>Leveldb 的版本状态发生了变化,会进行一个版本的更替</strong></p><p><strong>minor compaction 的优先级高于 major compaction</strong>,当进行 minor compaction 的时候有 major compaction 正在进行,则会首先暂停 major compaction</p><p><img src="https://s2.loli.net/2022/07/30/wprDEfxMTiFts4H.png" alt="compaction0.png"></p><h3 id="Major-Compaction"><a href="#Major-Compaction" class="headerlink" title="Major Compaction"></a>Major Compaction</h3><p><strong>触发的时机:</strong></p><ul><li>0 层 sstable 文件个数到达一定数量(默认为 4 个)(目的:为了提高 0 层的读取效率)</li><li>i 层(i > 0)所有 sstable 文件的数据量超过 10^i MB 时 (目的:为了降低 compaction 的 IO 开销)</li><li>当某个文件无效读取的次数过多 (目的:避免可能存在 “巨大” 的合并开销,我称其为“进位”开销,具体参看<a href="https://leveldb-handbook.readthedocs.io/zh/latest/compaction.html">compaction</a>)</li></ul><p>什么是无效读取?就是指读了该 sstable 文件,想要找到对应的 key,但是 miss 了,就表示该次读取无效;</p><p><strong>过程:</strong></p><p>整个 major compaction 可以简单地分为以下几步:</p><ol><li>寻找合适的输入文件;</li><li>根据 key 重叠情况扩大输入文件集合;</li><li>多路合并;</li><li>积分计算;</li></ol><h4 id="寻找合适的输入文件"><a href="#寻找合适的输入文件" class="headerlink" title="寻找合适的输入文件"></a>寻找合适的输入文件</h4><p><em>对于 0 层 sstable 文件</em>数量达到一定阈值以及 <em>i 层 sstable 文件</em>数据量达到一定阈值而触发的 compaction 采用<strong>轮转的方法</strong>选择<strong>起始输入文件</strong>。它们会记住上次 compaction 之后输出文件的最大的 key,然后这次的起始输入文件就选择该 key 的后面一个 sstable 文件;</p><p>对于<em>错峰合并</em>,起始输入文件则为无效查询次数过多的文件;</p><h4 id="扩大输入文件集合"><a href="#扩大输入文件集合" class="headerlink" title="扩大输入文件集合"></a>扩大输入文件集合</h4><p>该过程如下:</p><ol><li>红星标注的为起始输入文件;</li><li>在level i层中,查找与起始输入文件有key重叠的文件,如图中红线所标注,最终构成level i层的输入文件;</li><li>利用level i层的输入文件,在level i+1层找寻有key重叠的文件,结果为绿线标注的文件,构成level i,i+1层的输入文件;</li><li>最后利用两层的输入文件,在不扩大level i+1输入文件的前提下,查找level i层的有key重叠的文件,结果为蓝线标准的文件,构成最终的输入文件;</li></ol><p><img src="https://s2.loli.net/2022/07/30/6dzhBLFVCxD8kj3.png" alt="compaction1.png"></p><h4 id="多路合并"><a href="#多路合并" class="headerlink" title="多路合并"></a>多路合并</h4><p>多路合并就是简单的有序数组归并的过程,不过需要注意的一点是,当一个 sstable 被合并之后,如果该 sstable 还在被用户引用,那么就不能立即删除,要等到引用计数为 0 时在做删除操作;</p><h4 id="积分计算"><a href="#积分计算" class="headerlink" title="积分计算"></a>积分计算</h4><p>对每一层,leveldb 都会为其维护一个元数据(计分牌,存于 version 中),用于表示每一层的文件个数或是数据总量,来挑选出下一个需要进行合并的层;</p><p>计分的规则:</p><ul><li>对于0层文件,该层的分数为文件总数/4;</li><li>对于非0层文件,该层的分数为文件数据总量/数据总量上限;</li></ul><p>将得分最高的层数记录,若该得分超过1,则为下一次进行合并的层数;</p>]]></content>
<summary type="html"><h1 id="LevelDB-源码分析【8】——-Compaction"><a href="#LevelDB-源码分析【8】——-Compaction" class="headerlink" title="LevelDB 源码分析【8】—— Compaction"></a>Le</summary>
</entry>
<entry>
<title>leveldb 源码分析 [7] —— Version Control</title>
<link href="https://codroc.github.io/2022/08/08/leveldb7_version_control/"/>
<id>https://codroc.github.io/2022/08/08/leveldb7_version_control/</id>
<published>2022-08-08T11:57:16.000Z</published>
<updated>2022-08-08T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="LevelDB-源码分析【7】——-Version-Control"><a href="#LevelDB-源码分析【7】——-Version-Control" class="headerlink" title="LevelDB 源码分析【7】—— Version Control"></a>LevelDB 源码分析【7】—— Version Control</h1><p><strong>涉及到的源文件:</strong></p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">db/version_set.h</span><br><span class="line">db/version_edit.h</span><br></pre></td></tr></table></figure><p>Leveldb 每一次删除或增加 sstable 都会从一个版本升级到另一个版本,每次 sstable 文件的更替对于 leveldb 来说是一个最小的操作单元,具有原子性。</p><p>Leveldb 用 Version 表示一个版本的元数据,它的定义是这样的:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">Version</span> {</span></span><br><span class="line"> <span class="keyword">public</span>:</span><br><span class="line"> <span class="class"><span class="keyword">struct</span> <span class="title">GetStats</span> {</span></span><br><span class="line"> FileMetaData* seek_file;</span><br><span class="line"> <span class="keyword">int</span> seek_file_level;</span><br><span class="line"> };</span><br><span class="line">... ...</span><br><span class="line"> <span class="comment">// DOC: next_ 和 prev_ 用于双向链表</span></span><br><span class="line"> VersionSet* vset_; <span class="comment">// VersionSet to which this Version belongs</span></span><br><span class="line"> Version* next_; <span class="comment">// Next version in linked list</span></span><br><span class="line"> Version* prev_; <span class="comment">// Previous version in linked list</span></span><br><span class="line"> <span class="comment">// DOC: refs_ 记录了被不同的 Version 引用的个数,保证被引用中的文件不会被删除</span></span><br><span class="line"> <span class="keyword">int</span> refs_; <span class="comment">// Number of live refs to this version</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// DOC: 每一层的 sstable 文件元数据</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">vector</span><FileMetaData*> files_[config::kNumLevels];</span><br><span class="line"></span><br><span class="line"> <span class="comment">// DOC: 这两个变量的作用是什么?</span></span><br><span class="line"> FileMetaData* file_to_compact_;</span><br><span class="line"> <span class="keyword">int</span> file_to_compact_level_;</span><br><span class="line"></span><br><span class="line"> <span class="comment">// DOC: 触发 compaction 的状态信息,这些信息会在读写请求或 compaction 的过程中更新</span></span><br><span class="line"> <span class="keyword">double</span> compaction_score_;</span><br><span class="line"> <span class="keyword">int</span> compaction_level_;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>所以它本身就是双向链表中的一个节点,里面保存了某个版本的 Leveldb 的元数据:</p><ul><li>可以用 files_ 索引到每个 level 的所有 sstable 文件的元数据;</li><li>用 file_to_compact_ 和 file_to_compact_level_ 记录下一个将要被 compaction 的 sstable;</li><li>用 compaction_score_ 和 compaction_level_ 记录某个 level 是否需要进行 compaction;如果 compaction_score_ >= 1 则要进行 compaction;</li></ul><p>然后就可以用这样的节点组成一个循环双向链表 VertionSet:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">VersionSet</span> {</span></span><br><span class="line"> ... ...</span><br><span class="line"> Env* <span class="keyword">const</span> env_;</span><br><span class="line"> <span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">string</span> dbname_;</span><br><span class="line"> <span class="keyword">const</span> Options* <span class="keyword">const</span> options_;</span><br><span class="line"> TableCache* <span class="keyword">const</span> table_cache_;</span><br><span class="line"> <span class="keyword">const</span> InternalKeyComparator icmp_;</span><br><span class="line"> <span class="keyword">uint64_t</span> next_file_number_;</span><br><span class="line"> <span class="keyword">uint64_t</span> manifest_file_number_;</span><br><span class="line"> <span class="keyword">uint64_t</span> last_sequence_;</span><br><span class="line"> <span class="keyword">uint64_t</span> log_number_;</span><br><span class="line"> <span class="keyword">uint64_t</span> prev_log_number_; <span class="comment">// 0 or backing store for memtable being compacted</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// Opened lazily</span></span><br><span class="line"> WritableFile* descriptor_file_;</span><br><span class="line"> <span class="built_in">log</span>::Writer* descriptor_log_;</span><br><span class="line"> Version ; <span class="comment">// Head of circular doubly-linked list of versions.</span></span><br><span class="line"> Version* current_; <span class="comment">// == dummy_versions_.prev_</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// Per-level key at which the next compaction at that level should start.</span></span><br><span class="line"> <span class="comment">// Either an empty string, or a valid InternalKey.</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> compact_pointer_[config::kNumLevels];</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p><strong>VersionSet <strong>是一个 Version 构成的循环双向链表,这些 Version 按时间顺序先后产生,记录了当时的元信息,链表尾部是当前最新的 Version;每个 Version 自己会维护引用计数,当其被引用时不会被删除,其对应的 sstable 也得以保留;通过这种方式,使得 leveldb 可以在</strong>任意一个稳定的快照视图上</strong>(即任意一个未被删除的 Version 上)访问文件。</p><h3 id="如何从-Version-i-升级到-Version-i-1"><a href="#如何从-Version-i-升级到-Version-i-1" class="headerlink" title="如何从 Version_i 升级到 Version_i+1"></a>如何从 Version_i 升级到 Version_i+1</h3><p>相邻 Version 之间的不同仅仅是一些文件被创建和另一些文件被删除。也就是说将文件变动应用在旧的 Version 上可以得到新的 Version,这也就是 Version 产生的方式。leveldb 用 <strong>VersionEdit</strong> 来表示这种相邻 Version 的差值;</p><p><img src="https://s2.loli.net/2022/07/31/aLUwk2ZRFA3EIoj.png" alt="version_control1.png"></p><p><strong>VersionEdit:</strong></p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">VersionEdit</span> {</span></span><br><span class="line"> ... ...</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> <span class="keyword">friend</span> <span class="class"><span class="keyword">class</span> <span class="title">VersionSet</span>;</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">typedef</span> <span class="built_in">std</span>::<span class="built_in">set</span><<span class="built_in">std</span>::<span class="built_in">pair</span><<span class="keyword">int</span>, <span class="keyword">uint64_t</span>>> DeletedFileSet;</span><br><span class="line"></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> comparator_;</span><br><span class="line"> <span class="keyword">uint64_t</span> log_number_;</span><br><span class="line"> <span class="keyword">uint64_t</span> prev_log_number_;</span><br><span class="line"> <span class="keyword">uint64_t</span> next_file_number_;</span><br><span class="line"> SequenceNumber last_sequence_;</span><br><span class="line"> <span class="keyword">bool</span> has_comparator_;</span><br><span class="line"> <span class="keyword">bool</span> has_log_number_;</span><br><span class="line"> <span class="keyword">bool</span> has_prev_log_number_;</span><br><span class="line"> <span class="keyword">bool</span> has_next_file_number_;</span><br><span class="line"> <span class="keyword">bool</span> has_last_sequence_;</span><br><span class="line"></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">vector</span><<span class="built_in">std</span>::<span class="built_in">pair</span><<span class="keyword">int</span>, InternalKey>> compact_pointers_;</span><br><span class="line"> DeletedFileSet deleted_files_;</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">vector</span><<span class="built_in">std</span>::<span class="built_in">pair</span><<span class="keyword">int</span>, FileMetaData>> new_files_;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><h3 id="Manifest"><a href="#Manifest" class="headerlink" title="Manifest"></a>Manifest</h3><p><strong>将 Version 和 VersionEdit 持久化存储得到的就是 Manifest 文件</strong></p><p>manifest 文件专用于记录版本信息。leveldb 采用了增量式的存储方式,记录每一个版本相较于上一个版本的变化情况。</p><p>展开来说,一个 Manifest 文件中,包含了多条 Session Record。一个 Session Record 记录了从上一个版本至该版本的变化情况。</p><blockquote><p>(1)新增了哪些sstable文件;</p><p>(2)删除了哪些sstable文件(由于compaction导致);</p><p>(3)最新的journal日志文件标号等;</p></blockquote><p>借助这个 Manifest 文件,leveldb 启动时,可以根据一个初始的版本状态,不断地应用这些版本改动,使得系统的版本信息恢复到最近一次使用的状态。</p><p>一个 Manifest 文件的格式示意图如下所示:</p><p><img src="https://s2.loli.net/2022/07/31/C3wkzITaNyqU1X6.png" alt="version_control0.png"></p><p>一个 Manifest 内部包含若干条 Session Record,<strong>其中第一条 Session Record</strong> 记载了当时 leveldb 的<em>全量版本信息</em>(即 Version?),其余若干条 Session Record 仅记录每次更迭的变化情况(即 VersionEdit?)。因此,每个 manifest 文件的第一条 Session Record 都是一个记录点(checkpoint or snapshot),记载了全量的版本信息,可以作为一个初始的状态进行版本恢复。</p><p>一个 Session Record 可能包含以下字段:</p><ul><li>Comparer 的名称;</li><li>最新的 journal 文件编号;</li><li>下一个可以使用的文件编号;</li><li>数据库已经持久化数据项中最大的 sequence number;</li><li>新增的文件信息;</li><li>删除的文件信息;</li><li>compaction 记录信息;</li></ul><p><strong>可以看到这些信息起始都是 VersionSet 和 VersionEdit 中的成员变量;</strong></p><h3 id="如何从一个-Manifest-文件恢复数据库"><a href="#如何从一个-Manifest-文件恢复数据库" class="headerlink" title="如何从一个 Manifest 文件恢复数据库"></a>如何从一个 Manifest 文件恢复数据库</h3><p>当 leveldb 要根据 manifest 进行恢复时,会读出最早的一个 Version 然后不断应用 VersionEdit 恢复到最近的状态,这就会产生一堆的中间 Version 状态,但这可能是不需要的,我们只需要最新最近的数据库状态;leveldb 引入<strong>VersionSet::Builder</strong> 来避免这种中间变量,方法是先将所有的 VersoinEdit 内容整理到 VersionBuilder 中,然后一次应用产生最终的 Version,这种实现上的优化如下图所示:</p><p><img src="https://s2.loli.net/2022/07/31/34v86KjOhkLxaC9.png" alt="version_control2.png"></p><h5 id="VersionSet-Builder"><a href="#VersionSet-Builder" class="headerlink" title="VersionSet::Builder"></a>VersionSet::Builder</h5><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">VersionSet</span>:</span>:Builder {</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> <span class="class"><span class="keyword">struct</span> <span class="title">BySmallestKey</span> {</span></span><br><span class="line"> <span class="keyword">const</span> InternalKeyComparator* internal_comparator;</span><br><span class="line"> <span class="function"><span class="keyword">bool</span> <span class="title">operator</span><span class="params">()</span><span class="params">(<span class="keyword">const</span> FileMetaData*, <span class="keyword">const</span> FileMetaData*)</span> <span class="keyword">const</span></span>;</span><br><span class="line"> };</span><br><span class="line"> <span class="keyword">typedef</span> <span class="built_in">std</span>::<span class="built_in">set</span><FileMetaData*, BySmallestKey> FileSet;</span><br><span class="line"> <span class="class"><span class="keyword">struct</span> <span class="title">LevelState</span> {</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">set</span><<span class="keyword">uint64_t</span>> deleted_files;</span><br><span class="line"> FileSet* added_files;</span><br><span class="line"> };</span><br><span class="line"> VersionSet* vset_;</span><br><span class="line"> Version* base_;</span><br><span class="line"> LevelState levels_[config::kNumLevels]; <span class="comment">// DOC: 每一层删了哪些旧文件,增加了哪些新文件</span></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> Builder(VersionSet* vset, Version* base) : vset_(vset), base_(base) {</span><br><span class="line"> base_->Ref();</span><br><span class="line"> BySmallestKey cmp;</span><br><span class="line"> cmp.internal_comparator = &vset_->icmp_;</span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">int</span> level = <span class="number">0</span>; level < config::kNumLevels; level++) {</span><br><span class="line"> levels_[level].added_files = <span class="keyword">new</span> FileSet(cmp);</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> <span class="comment">// 引用所有的 *edit 到当前状态(由 current version 表示)</span></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Apply</span><span class="params">(<span class="keyword">const</span> VersionEdit* edit)</span></span>;</span><br><span class="line"> <span class="comment">// 保存当前状态到 Version v</span></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">SaveTo</span><span class="params">(Version* v)</span></span>;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>通过 Builder,我们可以先把所有的 VersionEdit 通过 Builder::Apply 保存到 Builder::levels_ 中;当所有的 VersionEdit 都添加完之后再调用 Builder::SaveTo 一次性从 Version i 升级到 Version i + 1;</p><h3 id="compaction-后对-VersionSet-的变更"><a href="#compaction-后对-VersionSet-的变更" class="headerlink" title="compaction 后对 VersionSet 的变更"></a>compaction 后对 VersionSet 的变更</h3><p>看着一部前先可以先看下 compaction 的流程<a href>LevelDB 源码分析【8】—— Compaction</a></p><ul><li>compaction 的调用者首先有责任填充 <strong>VersionEdit</strong></li><li>通过 <strong>VersionSet::LogAndApply</strong>,把 VersionEdit 序列化追加到 manifest 文件中去当作一条 session record,然后通过 VersionSet::Builder 和当前状态生成一个最先的版本 v,并让 current_ = v</li><li>把 current_ 插入 VersionSet 循环双向链表的尾部,更新 VersionSet 的状态(例如 **log_number_<strong>,</strong>prev_log_number_**)</li></ul><blockquote><p>所以整个流程中,串起 compaction 和 VersionSet 以及 Version 的关键类就是 VersionEdit</p></blockquote><p>看下 compaction 是怎么创建一个 VersionEdit 的,首先得了解下 compaction 的函数调用栈:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">DBImpl::Write->DBImpl::MakeRoomForWrite</span><br><span class="line">DBImpl::MaybeScheduleCompaction</span><br><span class="line">DBImpl::BGWork</span><br><span class="line">DBImpl::BackgroundCall</span><br><span class="line">DBImpl::BackgroundCompaction // 这里就去做真正的压缩了!</span><br><span class="line">DBImlp::CompactMemTable // minor compaction</span><br><span class="line">VersionSet::CompactRange // manual compaction</span><br><span class="line">VersionSet::PickCompaction // major compaction</span><br></pre></td></tr></table></figure><h4 id="对于-minor-compaction-来说是怎么创建-VersionEdit-的?"><a href="#对于-minor-compaction-来说是怎么创建-VersionEdit-的?" class="headerlink" title="对于 minor compaction 来说是怎么创建 VersionEdit 的?"></a>对于 minor compaction 来说是怎么创建 VersionEdit 的?</h4><p>先写下我的猜想:</p><p>肯定是创建一个 VersionEdit 对象 edit,然后让 DBImlp::CompactMemTable 去填充这个 edit;edit 中最重要的三个成员就是 <strong>deleted_files_</strong> ,<strong>new_files_</strong> 和 <strong>compact_pointers_</strong> 了吧!</p><p>对于 minor compaction 它只会增加新的 sstable 文件,而不会删除旧的 sstable 文件,并且只会往 level 0 层增加文件;</p><p>那么 VersionEdit::deleted_files_ 自然为空,VersionEdit::new_files_ 则是新添加的 sstable 文件对应的元数据;</p><p>因为 level 0~i 层 compaction 时对于输入文件的选择是通过轮转的方式进行的,因此需要记住本次 compaction 后输出文件中最大的 key 值,保存到 VersionEdit::compact_pointers_ 中去;</p><p>看下代码:</p><h4 id="DBImlp-CompactMemTable"><a href="#DBImlp-CompactMemTable" class="headerlink" title="DBImlp::CompactMemTable"></a>DBImlp::CompactMemTable</h4><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">DBImpl::CompactMemTable</span><span class="params">()</span> </span>{</span><br><span class="line"> mutex_.AssertHeld();</span><br><span class="line"> assert(imm_ != <span class="literal">nullptr</span>);</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Save the contents of the memtable as a new Table</span></span><br><span class="line"> VersionEdit edit;</span><br><span class="line"> Version* base = versions_->current();</span><br><span class="line"> base->Ref();</span><br><span class="line"> Status s = WriteLevel0Table(imm_, &edit, base);</span><br><span class="line"> base->Unref();</span><br><span class="line">... ...</span><br><span class="line"> <span class="comment">// Replace immutable memtable with the generated Table</span></span><br><span class="line"> <span class="keyword">if</span> (s.ok()) {</span><br><span class="line"> edit.SetPrevLogNumber(<span class="number">0</span>);</span><br><span class="line"> edit.SetLogNumber(logfile_number_); <span class="comment">// Earlier logs no longer needed</span></span><br><span class="line"> s = versions_->LogAndApply(&edit, &mutex_);</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> (s.ok()) {</span><br><span class="line"> <span class="comment">// Commit to the new state</span></span><br><span class="line"> imm_->Unref();</span><br><span class="line"> imm_ = <span class="literal">nullptr</span>;</span><br><span class="line"> has_imm_.store(<span class="literal">false</span>, <span class="built_in">std</span>::memory_order_release);</span><br><span class="line"> RemoveObsoleteFiles();</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> RecordBackgroundError(s);</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>正如我所预测的,它首先拿到一个 VersionEdit 对象 edit;其次让 <strong>WriteLevel0Table</strong> 根据 immutable memtable 去填充 edit;然后调用 <strong>VersionSet::LogAndApply</strong> 将该 edit 应用到当前状态生成一个全新的 leveldb 版本并插入循环双向链表尾部;最后就可以删除 immutable memtable 了;</p><p>所以对于 minor compaction 来说是通过 <strong>WriteLevel0Table</strong> 来创建 VersionEdit 的</p><h4 id="DBImlp-WriteLevel0Table"><a href="#DBImlp-WriteLevel0Table" class="headerlink" title="DBImlp::WriteLevel0Table"></a>DBImlp::WriteLevel0Table</h4><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">Status <span class="title">DBImpl::WriteLevel0Table</span><span class="params">(MemTable* mem, VersionEdit* edit,</span></span></span><br><span class="line"><span class="function"><span class="params"> Version* base)</span> </span>{</span><br><span class="line"></span><br><span class="line"> FileMetaData meta;</span><br><span class="line"> meta.number = versions_->NewFileNumber();</span><br><span class="line"> ... ...</span><br><span class="line"> Iterator* iter = mem->NewIterator();</span><br><span class="line"></span><br><span class="line"> Status s;</span><br><span class="line"> {</span><br><span class="line"> mutex_.Unlock();</span><br><span class="line"> s = BuildTable(dbname_, env_, options_, table_cache_, iter, &meta); <span class="comment">// 创建 sstable 并把元数据存入 meta</span></span><br><span class="line"> mutex_.Lock();</span><br><span class="line"> }</span><br><span class="line">... ...</span><br><span class="line"></span><br><span class="line"> edit->AddFile(level, meta.number, meta.file_size, meta.smallest,</span><br><span class="line"> meta.largest);</span><br><span class="line"> ... ...</span><br><span class="line"> <span class="keyword">return</span> s;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>简单看下这段代码,它其实就是创建一个 FileMetaData 对象 meta,也就是 sstable 的元数据,怎么填充它呢?把填充任务交给 <strong>BuildTable</strong>,因为它会持久化 immutable memtable 中的数据变成 sstable,那么它最懂怎么填充 meta!</p><p>因为是 minor compaction 因此只会新增一个 sstable,因此 edit 仅仅调用以下 AddFile 就可以了!至此 edit 就填充完全了,就可以后续的更新 Version 和 VersionSet 了!</p><h4 id="对于-major-compaction-来说是怎么创建-VersionEdit-的?"><a href="#对于-major-compaction-来说是怎么创建-VersionEdit-的?" class="headerlink" title="对于 major compaction 来说是怎么创建 VersionEdit 的?"></a>对于 major compaction 来说是怎么创建 VersionEdit 的?</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">DBImpl::BackgroundCompaction</span><br><span class="line">VersionSet::PickCompaction // major compaction</span><br><span class="line">DBImpl::DoCompactionWork</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>这个流程在<a href>LevelDB 源码分析【8】—— Compaction</a>中进行了分析:</p><ol><li>获取输入文件</li><li>扩大输入文件集合</li><li>多路归并</li><li>积分计算</li></ol><p><strong>VersionSet::PickCompaction</strong> 主要做 1 和 2 两步,把需要 compact 的所有 sstable 文件放到 Compaction 对象中返回;</p><p><strong>DBImpl::DoCompactionWork</strong> 主要做 3 和 4 两步;</p><h4 id="VersionSet-PickCompaction"><a href="#VersionSet-PickCompaction" class="headerlink" title="VersionSet::PickCompaction"></a>VersionSet::PickCompaction</h4><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">Compaction* <span class="title">VersionSet::PickCompaction</span><span class="params">()</span> </span>{</span><br><span class="line"> Compaction* c;</span><br><span class="line"> <span class="keyword">int</span> level;</span><br><span class="line"></span><br><span class="line"> <span class="comment">// We prefer compactions triggered by too much data in a level over</span></span><br><span class="line"> <span class="comment">// the compactions triggered by seeks.</span></span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">bool</span> size_compaction = (current_->compaction_score_ >= <span class="number">1</span>);</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">bool</span> seek_compaction = (current_->file_to_compact_ != <span class="literal">nullptr</span>);</span><br><span class="line"> <span class="keyword">if</span> (size_compaction) {</span><br><span class="line"> level = current_->compaction_level_;</span><br><span class="line"> assert(level >= <span class="number">0</span>);</span><br><span class="line"> assert(level + <span class="number">1</span> < config::kNumLevels);</span><br><span class="line"> c = <span class="keyword">new</span> Compaction(options_, level);</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Pick the first file that comes after compact_pointer_[level]</span></span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">size_t</span> i = <span class="number">0</span>; i < current_->files_[level].size(); i++) {</span><br><span class="line"> FileMetaData* f = current_->files_[level][i];</span><br><span class="line"> <span class="comment">// DOC: 从 level i 层选择输入文件</span></span><br><span class="line"> <span class="keyword">if</span> (compact_pointer_[level].empty() || <span class="comment">// DOC: 如果 level i 层第一次做 compaction</span></span><br><span class="line"> icmp_.Compare(f->largest.Encode(), compact_pointer_[level]) > <span class="number">0</span>) { <span class="comment">// DOC: 如果 sstable 的最大的 key > compact_pointer_</span></span><br><span class="line"> c->inputs_[<span class="number">0</span>].push_back(f);</span><br><span class="line"> <span class="keyword">break</span>;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">if</span> (c->inputs_[<span class="number">0</span>].empty()) {</span><br><span class="line"> <span class="comment">// Wrap-around to the beginning of the key space</span></span><br><span class="line"> c->inputs_[<span class="number">0</span>].push_back(current_->files_[level][<span class="number">0</span>]);</span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> <span class="keyword">if</span> (seek_compaction) {</span><br><span class="line"> level = current_->file_to_compact_level_;</span><br><span class="line"> c = <span class="keyword">new</span> Compaction(options_, level);</span><br><span class="line"> c->inputs_[<span class="number">0</span>].push_back(current_->file_to_compact_);</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">nullptr</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> c->input_version_ = current_;</span><br><span class="line"> c->input_version_->Ref();</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Files in level 0 may overlap each other, so pick up all overlapping ones</span></span><br><span class="line"> <span class="comment">// DOC: level 0 层可能出现 sstable 之间 overlap 的情况,因此在 level 0 层就可以扩大输入文件集合</span></span><br><span class="line"> <span class="keyword">if</span> (level == <span class="number">0</span>) {</span><br><span class="line"> InternalKey smallest, largest;</span><br><span class="line"> GetRange(c->inputs_[<span class="number">0</span>], &smallest, &largest);</span><br><span class="line"> <span class="comment">// Note that the next call will discard the file we placed in</span></span><br><span class="line"> <span class="comment">// c->inputs_[0] earlier and replace it with an overlapping set</span></span><br><span class="line"> <span class="comment">// which will include the picked file.</span></span><br><span class="line"> current_->GetOverlappingInputs(<span class="number">0</span>, &smallest, &largest, &c->inputs_[<span class="number">0</span>]);</span><br><span class="line"> assert(!c->inputs_[<span class="number">0</span>].empty());</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// DOC: 在 level i + 1 层扩到输入文件集合</span></span><br><span class="line"> SetupOtherInputs(c);</span><br><span class="line"></span><br><span class="line"> <span class="keyword">return</span> c;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><ol><li><a href="https://leveldb-handbook.readthedocs.io/zh/latest/version.html">版本控制</a></li><li><a href="http://catkang.github.io/2017/02/03/leveldb-version.html">庖丁解LevelDB之版本控制</a></li></ol>]]></content>
<summary type="html"><h1 id="LevelDB-源码分析【7】——-Version-Control"><a href="#LevelDB-源码分析【7】——-Version-Control" class="headerlink" title="LevelDB 源码分析【7】—— Version </summary>
</entry>
<entry>
<title>leveldb 源码分析 [6] —— Cache</title>
<link href="https://codroc.github.io/2022/08/08/leveldb6_cache/"/>
<id>https://codroc.github.io/2022/08/08/leveldb6_cache/</id>
<published>2022-08-08T11:57:16.000Z</published>
<updated>2022-08-08T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="leveldb-笔记一:缓存系统-Cache"><a href="#leveldb-笔记一:缓存系统-Cache" class="headerlink" title="leveldb 笔记一:缓存系统 Cache"></a>leveldb 笔记一:缓存系统 Cache</h1><h3 id="LRUHandle"><a href="#LRUHandle" class="headerlink" title="LRUHandle"></a>LRUHandle</h3><blockquote><p>An entry is a variable length heap-allocated structure. 一个变长结构体对象,它被分配在堆上。</p></blockquote><p>LRUHandle 是 <strong>双向循环链表</strong>(为了实现 LRU 替换策略)的节点。在该链表上按访问时间排序。</p><p>变长体现在哪里?首先看它的结构体定义:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">LRUHandle</span> {</span></span><br><span class="line"> <span class="keyword">void</span>* value;</span><br><span class="line"> <span class="keyword">void</span> (*deleter)(<span class="keyword">const</span> Slice&, <span class="keyword">void</span>* value);</span><br><span class="line"> LRUHandle* next_hash;<span class="comment">// Hash 表指针,同样 Hash 值的 Handler 串接起来</span></span><br><span class="line"> LRUHandle* next;</span><br><span class="line"> LRUHandle* prev;</span><br><span class="line"> <span class="keyword">size_t</span> charge; <span class="comment">// TODO(opt): Only allow uint32_t?</span></span><br><span class="line"> <span class="keyword">size_t</span> key_length;</span><br><span class="line"> <span class="keyword">bool</span> in_cache; <span class="comment">// Whether entry is in the cache.</span></span><br><span class="line"> <span class="keyword">uint32_t</span> refs; <span class="comment">// References, including cache reference, if present.</span></span><br><span class="line"> <span class="keyword">uint32_t</span> hash; <span class="comment">// Hash of key(); used for fast sharding and comparisons</span></span><br><span class="line"> <span class="keyword">char</span> key_data[<span class="number">1</span>]; <span class="comment">// Beginning of key</span></span><br><span class="line"></span><br><span class="line"> <span class="function">Slice <span class="title">key</span><span class="params">()</span> <span class="keyword">const</span> </span>{</span><br><span class="line"> <span class="comment">// next is only equal to this if the LRU handle is the list head of an</span></span><br><span class="line"> <span class="comment">// empty list. List heads never have meaningful keys.</span></span><br><span class="line"> assert(next != <span class="keyword">this</span>);</span><br><span class="line"></span><br><span class="line"> <span class="keyword">return</span> Slice(key_data, key_length);</span><br><span class="line"> }</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>除了 key_data 字段,其他都是固定长度的。因此可以这样认为,LRUHandle 是一个<strong>尾部长度可变</strong>的对象。</p><p><strong>存疑一:</strong>为什么不是 char* key_data 而是直接把 key 存储在 LRUHandle 中呢?为什么用 <code>char key_data[1]</code> 而不是 柔性数组 <code>char key_data[]</code></p><blockquote><p>Note: GCC 由于对 C99 的支持,允许定义 char key_data[ ] 这样的柔性数组(Flexible Array)。但是由于 c++ 标准并不支持柔性数组的实现,这里定义为 key_data[1],这也是 c++ 中的标准做法。</p></blockquote><p><strong>回答一:</strong>如果存的是指针,那么指针指向的 key 对象就也需要进行 malloc 分配空间,那么带上 LRUHandle 则需要 malloc 两次。如果把 key 对象和 LRUHandle 放在一块,只需要 malloc 一次,而 malloc 是有可能会陷入内核的,因此尽量减少 malloc 的次数,可以加快速度。</p><h3 id="HandleTable"><a href="#HandleTable" class="headerlink" title="HandleTable"></a>HandleTable</h3><p>它其实就是一个简单的 HashTable 先看它的成员变量:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">HandleTable</span> {</span></span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"><span class="keyword">uint32_t</span> length_;</span><br><span class="line"> <span class="keyword">uint32_t</span> elems_;</span><br><span class="line"> LRUHandle** list_;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>它使用开链法来解决 hash 冲突,总共设置 length_ 个 bucket,每个 bucket 就是一条单向链表,每条链表的节点就是 LRUHandle,这里可以和 LRUHandle 结构体中的 next_hash 字段结合起来。elems_ 就表示了 HandleTable 中总共有多少个元素,可以用于之后对 hash table 进行 Resize。</p><p>刚开始 hash table 自然是空的,因此直接调用 Resize,进行初始化。</p><p>Resize 需要考虑两种情况:</p><ol><li>hash table 为空时进行 Resize</li><li>hash table 不为空时进行 Resize</li></ol><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">Resize</span><span class="params">()</span> </span>{</span><br><span class="line"> <span class="keyword">uint32_t</span> new_length = <span class="number">4</span>;</span><br><span class="line"> <span class="keyword">while</span> (new_length < elems_) {</span><br><span class="line"> new_length *= <span class="number">2</span>;</span><br><span class="line"> }</span><br><span class="line"> LRUHandle** new_list = <span class="keyword">new</span> LRUHandle*[new_length];</span><br><span class="line"> <span class="built_in">memset</span>(new_list, <span class="number">0</span>, <span class="keyword">sizeof</span>(new_list[<span class="number">0</span>]) * new_length);</span><br><span class="line"> <span class="keyword">uint32_t</span> count = <span class="number">0</span>;</span><br><span class="line"> <span class="comment">// hash table 不为空时,需要考虑:</span></span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">uint32_t</span> i = <span class="number">0</span>; i < length_; i++) {</span><br><span class="line"> LRUHandle* h = list_[i];</span><br><span class="line"> <span class="keyword">while</span> (h != <span class="literal">nullptr</span>) {</span><br><span class="line"> LRUHandle* next = h->next_hash; <span class="comment">// 1</span></span><br><span class="line"> <span class="keyword">uint32_t</span> hash = h->hash; <span class="comment">// 2</span></span><br><span class="line"> LRUHandle** ptr = &new_list[hash & (new_length - <span class="number">1</span>)]; <span class="comment">// 2</span></span><br><span class="line"> h->next_hash = *ptr;</span><br><span class="line"> *ptr = h;</span><br><span class="line"> h = next; <span class="comment">// 1</span></span><br><span class="line"> count++;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> <span class="comment">// hash table 不为空时,需要考虑:</span></span><br><span class="line"> assert(elems_ == count);</span><br><span class="line"> <span class="keyword">delete</span>[] list_;</span><br><span class="line"> list_ = new_list;</span><br><span class="line"> length_ = new_length;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>其实要看懂这段代码,唯一的难点是理解 LRUHandle** 到底是个什么东西,它其实就是一个 数组,数组中的每一个元素就是 LRUHandle 链表<strong>头节点的指针</strong>。然后在 while 循环中使用的是 链表的 <strong>头插法</strong></p><p>接下来就是 Insert、Remove、Lookup 和 FindPointer,这里只需要看懂 FindPointer,其他的就自然看懂了。</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">LRUHandle** <span class="title">FindPointer</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">uint32_t</span> hash)</span> </span>{</span><br><span class="line"> LRUHandle** ptr = &list_[hash & (length_ - <span class="number">1</span>)];</span><br><span class="line"> <span class="keyword">while</span> (*ptr != <span class="literal">nullptr</span> && ((*ptr)->hash != hash || key != (*ptr)->key())) {</span><br><span class="line"> ptr = &(*ptr)->next_hash;</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> ptr;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><code>ptr </code> 是单向链表头节点,这个在之前已经说过了。在这里 <code>LRUHandle** ptr = &list_[hash & (length_ - 1)];</code> 的确是这样的,但是在 while 循环中,<code>ptr</code> 已经不是这个意思了,它是 LRUHandle::next_hash 的地址;而 <code>*ptr</code> 仍然是指向 LRUHandle 节点的指针。自然,返回值也就是 LRUHandle::next_hash 的地址。<strong>因此在后续的 Insert、Remove 操作中,我们直接修改 <code>ptr</code> 所指地址处的值(也就是 LRUHandle::next_hash 的值)就可以达到我们需要的效果。</strong></p><h3 id="LRUCache"><a href="#LRUCache" class="headerlink" title="LRUCache"></a>LRUCache</h3><p>逻辑上,设计成列表,一个 Hash Table。两个列表用于存储 LRUHandle 节点,由循环双向链表来实现 LRU 替换策略,Hash Table 用于加速对节点的索引 O(1),用开链法解决 hash 冲突。</p><p>两链表,一哈希表:</p><ul><li>LRUHandle <strong>lru_</strong> GUARDED_BY(mutex_); // 虚拟头节点</li><li>LRUHandle <strong>in_use_</strong> GUARDED_BY(mutex_); // 虚拟头节点</li><li>HandleTable <strong>table_</strong> GUARDED_BY(mutex_);</li></ul><p>这里两个链表的关系是这样的,我们可以把 LRUCache 内的 Handle 分为四个状态:</p><p><img src="https://s4.ax1x.com/2022/01/13/7MccTI.png"></p><ol><li>*in use (ref=2)*:该 Handle 在 HandleTable 中,并且串联在 <code>in_use_</code> 链表中;由于该 Handle 既被外部使用,也被 <code>in_use_</code> 链表使用,因此有 ref=2;</li><li>*in lru (ref=1)*:该 Handle 在 HandleTable 中,并且串联在 <code>lru_</code> 链表中;由于该 Handle 只被 <code>lru_</code> 引用,因此 ref=1;</li><li>*not in lru, not in table (ref=1)*:该 Handle 不在链表中也不再 HandleTable 中,但是仍然被外部引用而未释放,因此 ref=1;</li><li>*not in lru, not in table (ref=0)*:该 Handle 不在链表中也不再 HandleTable 中,也不被外部使用,因此 ref=0;</li></ol><p>在 LRUCache 析构时,必须保证 in_use 链表为空,也就是说<strong>没有被外部引用并且在链表中</strong>(即状态 1)的节点。之后,就可以对 lru_ 链表中的节点逐一 调用 Unref 来让节点的 deleter 去释放资源。</p><p>LRUCache 类中有一点很符合 morden c++ 的写法,也很值得我学习:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">LRUCache</span> {</span></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="function">Cache::Handle* <span class="title">Insert</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">uint32_t</span> hash, <span class="keyword">void</span>* value,</span></span></span><br><span class="line"><span class="function"><span class="params"> <span class="keyword">size_t</span> charge,</span></span></span><br><span class="line"><span class="function"><span class="params"> <span class="keyword">void</span> (*deleter)(<span class="keyword">const</span> Slice& key, <span class="keyword">void</span>* value))</span></span>;</span><br><span class="line"> <span class="function">Cache::Handle* <span class="title">Lookup</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">uint32_t</span> hash)</span></span>;</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Release</span><span class="params">(Cache::Handle* handle)</span></span>;</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Erase</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">uint32_t</span> hash)</span></span>;</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">LRU_Remove</span><span class="params">(LRUHandle* e)</span></span>;</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">LRU_Append</span><span class="params">(LRUHandle* <span class="built_in">list</span>, LRUHandle* e)</span></span>;</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Ref</span><span class="params">(LRUHandle* e)</span></span>;</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Unref</span><span class="params">(LRUHandle* e)</span></span>;</span><br><span class="line"> <span class="function"><span class="keyword">bool</span> <span class="title">FinishErase</span><span class="params">(LRUHandle* e)</span> <span class="title">EXCLUSIVE_LOCKS_REQUIRED</span><span class="params">(mutex_)</span></span>;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>在 <strong>公共接口</strong> 中,用 <code>Cache::Handle*</code> 来表示 <code>LRUHandle*</code>;而在 <strong>私有接口</strong> 中,仍旧保留 <code>LRUHandle*</code>;这其实是向外隐藏了 <code>LRUHandle*</code>;</p><h3 id="ShardedLRUCache"><a href="#ShardedLRUCache" class="headerlink" title="ShardedLRUCache"></a>ShardedLRUCache</h3><p>这个类其实就是用来减少 race condition 的,因为 leveldb 缓存系统支持并发,因此要对每一个 LRUCache 加互斥锁,如果只有一个 LRUCache 的话,虽然在外部看来是并发访问了,但是由于为了保证线程安全,在方法临界区内所有访问都被串行化了。但是如果对 Cache 进行分片,也就是增加 LRUCache 的数量(其实就是搞一个 LRUCache 数组),通过 hash 的方式索引到具体某一个 LRUCache 进行访问,这样 LRUCache 之间是可以并行访问并保证线程安全的,这就提高了整个缓存系统的并发性。</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">static</span> <span class="keyword">const</span> <span class="keyword">int</span> kNumShardBits = <span class="number">4</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">const</span> <span class="keyword">int</span> kNumShards = <span class="number">1</span> << kNumShardBits;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">ShardedLRUCache</span> :</span> <span class="keyword">public</span> Cache {</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> LRUCache shard_[kNumShards]; <span class="comment">// 分片缓存,通过 hash 方式来索引到某一个 LRUCache</span></span><br><span class="line"> port::Mutex id_mutex_;</span><br><span class="line"> <span class="keyword">uint64_t</span> last_id_;</span><br><span class="line">};</span><br></pre></td></tr></table></figure>]]></content>
<summary type="html"><h1 id="leveldb-笔记一:缓存系统-Cache"><a href="#leveldb-笔记一:缓存系统-Cache" class="headerlink" title="leveldb 笔记一:缓存系统 Cache"></a>leveldb 笔记一:缓存系统 Cac</summary>
</entry>
<entry>
<title>leveldb 源码分析 [5] —— BloomFilter</title>
<link href="https://codroc.github.io/2022/08/06/leveldb5_bloom_filter/"/>
<id>https://codroc.github.io/2022/08/06/leveldb5_bloom_filter/</id>
<published>2022-08-06T05:14:16.000Z</published>
<updated>2022-08-06T05:14:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="LevelDB-源码分析【5】——-BloomFilter"><a href="#LevelDB-源码分析【5】——-BloomFilter" class="headerlink" title="LevelDB 源码分析【5】—— BloomFilter"></a>LevelDB 源码分析【5】—— BloomFilter</h1><p>在之前的 SSTable 中的逻辑结构中,可以看到 Filter Block,它用于减少读放大,提高读取的效率。在 Leveldb 源码中,可以看到它使用了 FilterPolicy 这个类来表示抽象的过滤策略,体现了一种<strong>依赖于抽象而不是具体</strong>的依赖倒转原则。</p><p>而 Leveldb 默认的过滤策略就是布隆过滤器(bloom filter);leveldb 中利用布隆过滤器判断指定的 key 值是否存在于 sstable 中,若过滤器表示不存在,则该 key 一定不存在,由此加快了查找的效率。</p><p>该数据结构的详细介绍参看<a href="https://leveldb-handbook.readthedocs.io/zh/latest/bloomfilter.html">布隆过滤器</a>;我们的工作还是主要关注 Leveldb 源码是怎么实现 Bloom Filter 的;</p><p>但有一些重要的点还是要说一下,和 bloom filter 的效率相关的参数:</p><ul><li>hash 函数个数 k</li><li>布隆过滤器位数组的容量 m</li><li>布隆过滤器插入的 key 的数量 n</li></ul><p>主要的数学结论有:</p><ol><li>为了获得最优的准确率,当k = ln2 * (m/n)时,布隆过滤器获得最优的准确性;</li><li>在哈希函数的个数取到最优时,要让错误率不超过є,m至少需要取到最小值的1.44倍;</li></ol><h2 id="FilterPolicy"><a href="#FilterPolicy" class="headerlink" title="FilterPolicy"></a>FilterPolicy</h2><p>首先来看下过滤策略的抽象接口该怎么设计。</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">FilterPolicy</span> {</span></span><br><span class="line"> <span class="function"><span class="keyword">virtual</span> <span class="keyword">const</span> <span class="keyword">char</span>* <span class="title">Name</span><span class="params">()</span> <span class="keyword">const</span> </span>= <span class="number">0</span>;</span><br><span class="line"> <span class="function"><span class="keyword">virtual</span> <span class="keyword">void</span> <span class="title">CreateFilter</span><span class="params">(<span class="keyword">const</span> Slice* keys, <span class="keyword">int</span> n, <span class="built_in">std</span>::<span class="built_in">string</span>* dst)</span> <span class="keyword">const</span> </span>= <span class="number">0</span>;</span><br><span class="line"> <span class="function"><span class="keyword">virtual</span> <span class="keyword">bool</span> <span class="title">KeyMayMatch</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">const</span> Slice& filter)</span> <span class="keyword">const</span> </span>= <span class="number">0</span>;</span><br><span class="line">};</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">const</span> <span class="keyword">const</span> FilterPolicy* <span class="title">NewBloomFilterPolicy</span><span class="params">(<span class="keyword">int</span> bits_per_key)</span></span>;</span><br></pre></td></tr></table></figure><p><strong>FilterPolicy::CreateFilter</strong> 的参数是一个 Slice 数组,里面是已经有序的 keys(根据用户提供的 comparator),该方法会把 keys 中的每一个 key 经过计算,最终得到一个过滤结果,并把它 append 到 dst 中去;这里 Leveldb 为了性能考虑,竟然用了 std::string*,因此方法中不能对原来的 string 进行 in-place write,只能进行 append,不然会出错。</p><p><strong>FilterPolicy::KeyMayMatch</strong> 能够根据 filter 返回给定 key 是否在 keys 中。</p><h2 id="Bloom-Filter"><a href="#Bloom-Filter" class="headerlink" title="Bloom Filter"></a>Bloom Filter</h2><p>来看下 bloom filter 具体是怎么实现的。</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">static</span> <span class="keyword">uint32_t</span> <span class="title">BloomHash</span><span class="params">(<span class="keyword">const</span> Slice& key)</span> </span>{</span><br><span class="line"> <span class="keyword">return</span> Hash(key.data(), key.size(), <span class="number">0xbc9f1d34</span>);</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">BloomFilterPolicy</span> :</span> <span class="keyword">public</span> FilterPolicy {</span><br><span class="line"> <span class="keyword">public</span>:</span><br><span class="line"> <span class="function"><span class="keyword">explicit</span> <span class="title">BloomFilterPolicy</span><span class="params">(<span class="keyword">int</span> bits_per_key)</span> : <span class="title">bits_per_key_</span><span class="params">(bits_per_key)</span> </span>{</span><br><span class="line"> <span class="comment">// We intentionally round down to reduce probing cost a little bit</span></span><br><span class="line"> k_ = <span class="keyword">static_cast</span><<span class="keyword">size_t</span>>(bits_per_key * <span class="number">0.69</span>); <span class="comment">// 0.69 =~ ln(2)</span></span><br><span class="line"> <span class="keyword">if</span> (k_ < <span class="number">1</span>) k_ = <span class="number">1</span>;</span><br><span class="line"> <span class="keyword">if</span> (k_ > <span class="number">30</span>) k_ = <span class="number">30</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">const</span> <span class="keyword">char</span>* <span class="title">Name</span><span class="params">()</span> <span class="keyword">const</span> <span class="keyword">override</span> </span>{ <span class="keyword">return</span> <span class="string">"leveldb.BuiltinBloomFilter2"</span>; }</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">CreateFilter</span><span class="params">(<span class="keyword">const</span> Slice* keys, <span class="keyword">int</span> n, <span class="built_in">std</span>::<span class="built_in">string</span>* dst)</span> <span class="keyword">const</span> <span class="keyword">override</span> </span>{</span><br><span class="line"> <span class="comment">// Compute bloom filter size (in both bits and bytes)</span></span><br><span class="line"> <span class="keyword">size_t</span> bits = n * bits_per_key_;</span><br><span class="line"></span><br><span class="line"> <span class="comment">// For small n, we can see a very high false positive rate. Fix it</span></span><br><span class="line"> <span class="comment">// by enforcing a minimum bloom filter length.</span></span><br><span class="line"> <span class="keyword">if</span> (bits < <span class="number">64</span>) bits = <span class="number">64</span>;</span><br><span class="line"></span><br><span class="line"> <span class="keyword">size_t</span> bytes = (bits + <span class="number">7</span>) / <span class="number">8</span>;</span><br><span class="line"> bits = bytes * <span class="number">8</span>;</span><br><span class="line"></span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> init_size = dst->size();</span><br><span class="line"> dst->resize(init_size + bytes, <span class="number">0</span>);</span><br><span class="line"> dst->push_back(<span class="keyword">static_cast</span><<span class="keyword">char</span>>(k_)); <span class="comment">// Remember # of probes in filter</span></span><br><span class="line"> <span class="keyword">char</span>* <span class="built_in">array</span> = &(*dst)[init_size];</span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">int</span> i = <span class="number">0</span>; i < n; i++) {</span><br><span class="line"> <span class="comment">// Use double-hashing to generate a sequence of hash values.</span></span><br><span class="line"> <span class="comment">// See analysis in [Kirsch,Mitzenmacher 2006].</span></span><br><span class="line"> <span class="keyword">uint32_t</span> h = BloomHash(keys[i]);</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">uint32_t</span> delta = (h >> <span class="number">17</span>) | (h << <span class="number">15</span>); <span class="comment">// Rotate right 17 bits</span></span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">size_t</span> j = <span class="number">0</span>; j < k_; j++) {</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">uint32_t</span> bitpos = h % bits;</span><br><span class="line"> <span class="built_in">array</span>[bitpos / <span class="number">8</span>] |= (<span class="number">1</span> << (bitpos % <span class="number">8</span>));</span><br><span class="line"> h += delta;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">bool</span> <span class="title">KeyMayMatch</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">const</span> Slice& bloom_filter)</span> <span class="keyword">const</span> <span class="keyword">override</span> </span>{</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> len = bloom_filter.size();</span><br><span class="line"> <span class="keyword">if</span> (len < <span class="number">2</span>) <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"></span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">char</span>* <span class="built_in">array</span> = bloom_filter.data();</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> bits = (len - <span class="number">1</span>) * <span class="number">8</span>;</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Use the encoded k so that we can read filters generated by</span></span><br><span class="line"> <span class="comment">// bloom filters created using different parameters.</span></span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> k = <span class="built_in">array</span>[len - <span class="number">1</span>];</span><br><span class="line"> <span class="keyword">if</span> (k > <span class="number">30</span>) {</span><br><span class="line"> <span class="comment">// Reserved for potentially new encodings for short bloom filters.</span></span><br><span class="line"> <span class="comment">// Consider it a match.</span></span><br><span class="line"> <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="keyword">uint32_t</span> h = BloomHash(key);</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">uint32_t</span> delta = (h >> <span class="number">17</span>) | (h << <span class="number">15</span>); <span class="comment">// Rotate right 17 bits</span></span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">size_t</span> j = <span class="number">0</span>; j < k; j++) {</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">uint32_t</span> bitpos = h % bits;</span><br><span class="line"> <span class="keyword">if</span> ((<span class="built_in">array</span>[bitpos / <span class="number">8</span>] & (<span class="number">1</span> << (bitpos % <span class="number">8</span>))) == <span class="number">0</span>) <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"> h += delta;</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="keyword">private</span>:</span><br><span class="line"> <span class="keyword">size_t</span> bits_per_key_;</span><br><span class="line"> <span class="keyword">size_t</span> k_;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>首先你得有一个普通的 Hash 函数,用 std::hash 就可以了,Leveldb 则是自己实现的;CreateFilter 和 KeyMayMatch 的流程十分相似,它们都要:</p><ol><li>根据 key 和 Hash 函数得到 32位 hash 值 h</li><li>翻转 h 的高15 位和低 17 位得到 delta</li><li>循环 k_ 次,每一次都让 h 对 bits 取余,并通过 delta 更新 h</li></ol><p>唯一的区别在于,CreateFilter 会把 k_ 次取余的结果写到 filter 中去;而 KeyMayMatch 是从 filter 中读数据,并判断是否和取余得到的结果匹配;</p><p>在构造函数里,根据上面的数学公式,可以推断出,bits_per_key = m/n,这样 k_ = log2*bits_per_key;</p>]]></content>
<summary type="html"><h1 id="LevelDB-源码分析【5】——-BloomFilter"><a href="#LevelDB-源码分析【5】——-BloomFilter" class="headerlink" title="LevelDB 源码分析【5】—— BloomFilter"></a</summary>
</entry>
<entry>
<title>leveldb 源码分析 [4] —— SSTable</title>
<link href="https://codroc.github.io/2022/08/05/leveldb4_sstable/"/>
<id>https://codroc.github.io/2022/08/05/leveldb4_sstable/</id>
<published>2022-08-05T11:57:16.000Z</published>
<updated>2022-08-05T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="LevelDB-源码分析【4】——-SSTable"><a href="#LevelDB-源码分析【4】——-SSTable" class="headerlink" title="LevelDB 源码分析【4】—— SSTable"></a>LevelDB 源码分析【4】—— SSTable</h1><p>SSTable 是一种文件的存储格式,MemTable 中的数据,最终都会被序列化然后压缩持久化存储到稳定介质中去(磁盘);</p><h3 id="SSTable-物理结构"><a href="#SSTable-物理结构" class="headerlink" title="SSTable 物理结构"></a>SSTable 物理结构</h3><p>SSTable 文件被划分成固定大小的块(一般每块为 4KB),<strong>这里我就有问题了:为什么一个文件要被划分出固定的块呢?</strong></p><p><strong>每一个块由三部分组成:</strong></p><ul><li>Data:经过 序列化 + 压缩 后的数据</li><li>Compression Type:压缩算法类型,leveldb 默认使用 Snappy 算法进行压缩</li><li>CRC:冗余校验校验码,校验范围包括 Data 和 Compression Type</li></ul><p><img src="https://s2.loli.net/2022/07/26/eGcVozFlOIZ96XH.png" alt="图1"></p><h3 id="SSTable-逻辑结构"><a href="#SSTable-逻辑结构" class="headerlink" title="SSTable 逻辑结构"></a>SSTable 逻辑结构</h3><p>在逻辑上,根据功能不同,leveldb在逻辑上又将sstable分为:</p><ol><li><strong>data block</strong>: 用来存储key value数据对;</li><li><strong>filter block</strong>: 用来存储一些过滤器相关的数据(布隆过滤器),但是若用户不指定leveldb使用过滤器,leveldb在该block中不会存储任何内容;</li><li><strong>meta Index block</strong>: 用来存储filter block的索引信息(索引信息指在该sstable文件中的偏移量以及数据长度);</li><li><strong>index block</strong>:index block中用来存储每个data block的索引信息;</li><li><strong>footer</strong>: 用来存储meta index block及index block的索引信息;</li></ol><p><img src="https://s2.loli.net/2022/07/26/v6nlSXJqfu13ERM.png" alt="sstable1.PNG"></p><blockquote><p>注意,1-4类型的区块,其物理结构都是如1.1节所示,每个区块都会有自己的压缩信息以及CRC校验码信息。</p></blockquote><p>想要了解各类 block 的具体建造方式,可以查看 <code>table/block_builder.h</code> 和 <code>table/block_builder.cc</code></p><h3 id="data-block"><a href="#data-block" class="headerlink" title="data block"></a>data block</h3><p>data block 中存储的数据是 leveldb 中的 key value 键值对。其中一个 data block 中的数据部分(不包括压缩类型、CRC校验码)按逻辑又以下图进行划分:</p><p><img src="https://s2.loli.net/2022/07/26/MgVXsplz3mt6Of7.png" alt="sstable2.PNG"></p><p>第一部分用来存储 key value 数据。由于 sstable 中所有的 key value 对都是严格按序存储的,为了节省存储空间,leveldb 并不会为每一对 key value 对都存储完整的 key 值,而是存储与<strong>上一个 key 非共享的部分</strong>,避免了 key 重复内容的存储。</p><p>每间隔若干个 key value 对,将为该条记录重新存储一个完整的 key。重复该过程(默认间隔值为16),每个重新存储完整 key 的点称之为 Restart point;</p><blockquote><p>leveldb设计Restart point的目的是在读取sstable内容时,加速查找的过程。</p><p>由于每个Restart point存储的都是完整的key值,因此在sstable中进行数据查找时,可以首先利用restart point点的数据进行键值比较,以便于快速定位目标数据所在的区域;</p><p>当确定目标数据所在区域时,再依次对区间内所有数据项逐项比较key值,进行细粒度地查找;</p><p>该思想有点类似于跳表中利用高层数据迅速定位,底层数据详细查找的理念,降低查找的复杂度。</p></blockquote><p><img src="https://s2.loli.net/2022/07/26/VBtcb2R9SJY6Gd3.png" alt="sstable3.PNG"></p><p>一个 entry 分为5部分内容:</p><ol><li>与前一条记录 key 共享部分的长度;</li><li>与前一条记录 key 不共享部分的长度;</li><li>value 长度;</li><li>与前一条记录 key 非共享的内容;</li><li>value 内容;</li></ol><p>那么当接收到一个 {key, value} pair 的时候,怎么把它编码成一个 Entry 呢?</p><p>代码展示:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">BlockBuilder::Add</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">const</span> Slice& value)</span> </span>{</span><br><span class="line"> <span class="function">Slice <span class="title">last_key_piece</span><span class="params">(last_key_)</span></span>;</span><br><span class="line"> assert(!finished_);</span><br><span class="line"> assert(counter_ <= options_->block_restart_interval);</span><br><span class="line"> assert(buffer_.empty() <span class="comment">// No values yet?</span></span><br><span class="line"> || options_->comparator->Compare(key, last_key_piece) > <span class="number">0</span>);</span><br><span class="line"> <span class="keyword">size_t</span> shared = <span class="number">0</span>;</span><br><span class="line"> <span class="keyword">if</span> (counter_ < options_->block_restart_interval) {</span><br><span class="line"> <span class="comment">// See how much sharing to do with previous string</span></span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> min_length = <span class="built_in">std</span>::min(last_key_piece.size(), key.size());</span><br><span class="line"> <span class="keyword">while</span> ((shared < min_length) && (last_key_piece[shared] == key[shared])) {</span><br><span class="line"> shared++;</span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="comment">// Restart compression</span></span><br><span class="line"> restarts_.push_back(buffer_.size());</span><br><span class="line"> counter_ = <span class="number">0</span>;</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> non_shared = key.size() - shared;</span><br><span class="line"></span><br><span class="line"> <span class="comment">// DOC: 制作 Entry</span></span><br><span class="line"> <span class="comment">// Add "<shared><non_shared><value_size>" to buffer_</span></span><br><span class="line"> PutVarint32(&buffer_, shared);</span><br><span class="line"> PutVarint32(&buffer_, non_shared);</span><br><span class="line"> PutVarint32(&buffer_, value.size());</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Add string delta to buffer_ followed by value</span></span><br><span class="line"> buffer_.append(key.data() + shared, non_shared);</span><br><span class="line"> buffer_.append(value.data(), value.size());</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Update state</span></span><br><span class="line"> last_key_.resize(shared);</span><br><span class="line"> last_key_.append(key.data() + shared, non_shared);</span><br><span class="line"> assert(Slice(last_key_) == key);</span><br><span class="line"> counter_++;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>我们可以把 BlockBuilder 看作是一个 Block 的建造者,它的职责就是建造一个 Block;</p><p>现在来看看它是怎么建造 data block 的:</p><ul><li>首先它会判断是否需要开启一个新的 restart point;如果是,那么它是一个完整的 key,因此 <code>shared key length = 0</code>;如果不是则需要确定共享 key 的长度;</li><li>根据共享 key 的长度制作 Entry</li><li>把 Entry 序列化到 buffer 中</li></ul><p>当一个 data block 达到一定阈值的时候,就可以 restart points 以及 restart points length 也序列化到结尾,如下代码所示:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">Slice <span class="title">BlockBuilder::Finish</span><span class="params">()</span> </span>{</span><br><span class="line"> <span class="comment">// Append restart array</span></span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">size_t</span> i = <span class="number">0</span>; i < restarts_.size(); i++) {</span><br><span class="line"> PutFixed32(&buffer_, restarts_[i]);</span><br><span class="line"> }</span><br><span class="line"> PutFixed32(&buffer_, restarts_.size());</span><br><span class="line"> finished_ = <span class="literal">true</span>;</span><br><span class="line"> <span class="keyword">return</span> Slice(buffer_);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>那么什么时候调用 <strong>BlockBuilder::Finish</strong> 呢?从 <strong>TableBuilder::Add</strong> 中可见端倪:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">TableBuilder::Add</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">const</span> Slice& value)</span> </span>{</span><br><span class="line">... ...</span><br><span class="line"> <span class="comment">// DOC: 当 data_block 的大小超过设定值时,对它 Flush</span></span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> estimated_block_size = r->data_block.CurrentSizeEstimate();</span><br><span class="line"> <span class="keyword">if</span> (estimated_block_size >= r->options.block_size) {</span><br><span class="line"> Flush();</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>当 data_block 当前占用空间大小 >= 系统设定的阈值时,就进行 <strong>Flush</strong>,那么 <strong>Flush</strong> 会调用 <strong>WriteBlock</strong>,看一下它的实现:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">TableBuilder::WriteBlock</span><span class="params">(BlockBuilder* block, BlockHandle* handle)</span> </span>{</span><br><span class="line"> <span class="comment">// File format contains a sequence of blocks where each block has:</span></span><br><span class="line"> <span class="comment">// block_data: uint8[n]</span></span><br><span class="line"> <span class="comment">// type: uint8</span></span><br><span class="line"> <span class="comment">// crc: uint32</span></span><br><span class="line"> assert(ok());</span><br><span class="line"> Rep* r = rep_;</span><br><span class="line"> Slice raw = block->Finish(); <span class="comment">// 1</span></span><br><span class="line"></span><br><span class="line"> Slice block_contents;</span><br><span class="line"> CompressionType type = r->options.compression;</span><br><span class="line"> <span class="comment">// TODO(postrelease): Support more compression options: zlib?</span></span><br><span class="line"> <span class="keyword">switch</span> (type) { <span class="comment">// 2</span></span><br><span class="line"> <span class="keyword">case</span> kNoCompression:</span><br><span class="line"> block_contents = raw;</span><br><span class="line"> <span class="keyword">break</span>;</span><br><span class="line"></span><br><span class="line"> <span class="keyword">case</span> kSnappyCompression: {</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span>* compressed = &r->compressed_output;</span><br><span class="line"> <span class="keyword">if</span> (port::Snappy_Compress(raw.data(), raw.size(), compressed) &&</span><br><span class="line"> compressed->size() < raw.size() - (raw.size() / <span class="number">8u</span>)) {</span><br><span class="line"> block_contents = *compressed;</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="comment">// Snappy not supported, or compressed less than 12.5%, so just</span></span><br><span class="line"> <span class="comment">// store uncompressed form</span></span><br><span class="line"> block_contents = raw;</span><br><span class="line"> type = kNoCompression;</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">break</span>;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> WriteRawBlock(block_contents, type, handle); <span class="comment">// 3</span></span><br><span class="line"> r->compressed_output.clear();</span><br><span class="line"> block->Reset();</span><br><span class="line">}</span><br></pre></td></tr></table></figure><ol><li>它直接调用了 <strong>BlockBuilder::Finish</strong>,得到了序列化后的 data block;</li><li>然后对 data block 进行压缩(大名鼎鼎的 Leveldb 竟然也是用 switch case 的方式来判断压缩算法,然后去执行压缩的。。。<strong>显然违反开闭原则</strong>);</li><li>最后调用 <strong>WriteRawBlock</strong></li></ol><p><strong>WriteRawBlock:</strong></p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">TableBuilder::WriteRawBlock</span><span class="params">(<span class="keyword">const</span> Slice& block_contents,</span></span></span><br><span class="line"><span class="function"><span class="params"> CompressionType type, BlockHandle* handle)</span> </span>{</span><br><span class="line"> Rep* r = rep_;</span><br><span class="line"> handle->set_offset(r->offset);</span><br><span class="line"> handle->set_size(block_contents.size());</span><br><span class="line"> r->status = r->file->Append(block_contents);</span><br><span class="line"> <span class="keyword">if</span> (r->status.ok()) {</span><br><span class="line"> <span class="keyword">char</span> trailer[kBlockTrailerSize];</span><br><span class="line"> trailer[<span class="number">0</span>] = type;</span><br><span class="line"> <span class="keyword">uint32_t</span> crc = crc32c::Value(block_contents.data(), block_contents.size());</span><br><span class="line"> crc = crc32c::Extend(crc, trailer, <span class="number">1</span>); <span class="comment">// Extend crc to cover block type</span></span><br><span class="line"> EncodeFixed32(trailer + <span class="number">1</span>, crc32c::Mask(crc));</span><br><span class="line"> r->status = r->file->Append(Slice(trailer, kBlockTrailerSize));</span><br><span class="line"> <span class="keyword">if</span> (r->status.ok()) {</span><br><span class="line"> r->offset += block_contents.size() + kBlockTrailerSize;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>可以看到它就是在 data block 后面追加了一个 compression type 和 CRC 校验码,就如图1所示;</p><h3 id="Index-Block"><a href="#Index-Block" class="headerlink" title="Index Block"></a>Index Block</h3><p>index block 用来存储所有 data block 的相关索引信息</p><p>index block 包含若干条 Entry,每一条 Entry 代表一个 data block 的索引信息。</p><p><strong>一条索引包括以下内容:</strong></p><ol><li><strong>data block i 中最大的 key 值;</strong></li><li><strong>该 data block 起始地址在 sstable 中的偏移量;</strong></li><li><strong>该 data block 的大小;</strong></li></ol><p>这三部分内容最终会被制作成一条 Entry,并追加到 Index Block 中去;</p><p><img src="https://s2.loli.net/2022/07/27/BZslgiH2481RNuh.png" alt="sstable4.PNG"></p><blockquote><p>其中,data block i最大的key值还是index block中该条记录的key值。</p><p>如此设计的目的是,依次比较index block中记录信息的key值即可实现快速定位目标数据在哪个data block中。</p></blockquote><p>在源码中,涉及到 Index Block 的数据变量有:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">BlockBuilder index_block; <span class="comment">// DOC: 就是 sstable 逻辑结构中的 Index Block</span></span><br><span class="line"><span class="keyword">bool</span> pending_index_entry; <span class="comment">// DOC: 如果是 true,表示一个 data block 刚刚 flush 到 sstable,紧接着就要制作一个 index entry 用于索引该 data block</span></span><br><span class="line">BlockHandle pending_handle; <span class="comment">// DOC: 用于制作 Index Block</span></span><br></pre></td></tr></table></figure><p>具体什么时候往 Index Block 追加索引 entry 呢?Leveldb 为了在 entry 中使用更短的 key,它将 index block entry 的制作时机推迟到下一个块 block data i + 1 插入第一个 key 的时候,这样它就能找到一个更短的 key 使得 key >= block data i 中最大的 key,并且 key < block data i + 1 中最小的 key;</p><blockquote><p>注释: </p><p> // We do not emit the index entry for a block until we have seen the<br> // first key for the next data block. This allows us to use shorter<br> // keys in the index block. For example, consider a block boundary<br> // between the keys “the quick brown fox” and “the who”. We can use<br> // “the r” as the key for the index block entry since it is >= all<br> // entries in the first block and < all entries in subsequent<br> // blocks.</p></blockquote><p>因此在 <strong>TableBuilder::Add</strong> 中进行 index block entry 的制作:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">TableBuilder::Add</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">const</span> Slice& value)</span> </span>{</span><br><span class="line"> ... ...</span><br><span class="line"> <span class="comment">// DOC: 如果是 data_block 中的第一个 key,制作 index block entry</span></span><br><span class="line"> <span class="keyword">if</span> (r->pending_index_entry) {</span><br><span class="line"> assert(r->data_block.empty());</span><br><span class="line"> r->options.comparator->FindShortestSeparator(&r->last_key, key); <span class="comment">// 找到一个更短的 key;其中 last_key 是 data block i 中最大的 key,而 key 则是 data block i + 1 中第一个 key,即最小的 key;从这两个 key 中找到一个中间的最小的 key 就可以满足要求了;</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> handle_encoding;</span><br><span class="line"> r->pending_handle.EncodeTo(&handle_encoding); </span><br><span class="line"> r->index_block.Add(r->last_key, Slice(handle_encoding));</span><br><span class="line"> r->pending_index_entry = <span class="literal">false</span>;</span><br><span class="line"> }</span><br><span class="line">... ...</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>这里 <strong>pending_handle</strong> 会在 <strong>TableBuilder::WriteRawBlock</strong> 函数调用时(即一个 data block 需要被 flush 到 sstable 时)记住每个 data block 在 是stable</p><p>中的偏移位置及其大小;</p><p>可以看到索引记录(handle_encoding)通过 <strong>BlockBuilder::Add</strong> 的方式追加到 Index Block 中,所以它也是被制作成了 Entry 格式的;</p><h3 id="Filter-Block"><a href="#Filter-Block" class="headerlink" title="Filter Block"></a>Filter Block</h3><p>为了加快 sstable 中数据查询的效率,在直接查询 data block 中的内容之前,leveldb 首先根据 filter block 中的过滤数据判断指定的 data block 中是否有需要查询的数据,若判断不存在,则无需对这个 data block 进行数据查找</p><p>filter block 存储的是 data block 数据的一些过滤信息。这些过滤数据一般指代布隆过滤器的数据,用于加快查询的速度;</p><p><strong>filter block存储的数据主要可以分为两部分:(1)过滤数据(2)索引数据。</strong></p><p>其中索引数据中,<code>filter i offset</code> 表示第 i 个 filter data 在整个 filter block 中的起始偏移量,<code>filter offset's offset</code> 表示filter block的索引数据在 filter block 中的偏移量</p><p>在读取 filter block 中的内容时,可以首先读出 <code>filter offset's offset</code> 的值,然后依次读取 <code>filter i offset</code>,根据这些 offset 分别读出<code>filter data</code></p><p>Base Lg 默认值为11,表示每 2KB 的数据,创建一个新的过滤器来存放过滤数据;</p><p><strong>一个 sstable 只有一个 filter block</strong>,其内存储了所有 block 的 filter 数据;具体来说,filter data k 包含了所有起始位置处于 [base<em>k, base</em>(k+1)) 范围内的 block 的 key 的集合的 filter 数据,按数据大小而非 block 切分主要是为了尽量均匀,以应对存在一些 block 的 key 很多,另一些 block 的 key 很少的情况;</p><p><img src="https://s2.loli.net/2022/07/27/g4BQufCRzc1lX3q.png" alt="sstable5.PNG"></p><p>在 leveldb 中,怎么建造一个 Filter Block 呢?它是通过 FilterBlockBuilder 这个类来实现的:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">FilterBlockBuilder</span> {</span></span><br><span class="line"> <span class="keyword">public</span>:</span><br><span class="line"> <span class="function"><span class="keyword">explicit</span> <span class="title">FilterBlockBuilder</span><span class="params">(<span class="keyword">const</span> FilterPolicy*)</span></span>;</span><br><span class="line"></span><br><span class="line"> FilterBlockBuilder(<span class="keyword">const</span> FilterBlockBuilder&) = <span class="keyword">delete</span>;</span><br><span class="line"> FilterBlockBuilder& <span class="keyword">operator</span>=(<span class="keyword">const</span> FilterBlockBuilder&) = <span class="keyword">delete</span>;</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">StartBlock</span><span class="params">(<span class="keyword">uint64_t</span> block_offset)</span></span>;</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">AddKey</span><span class="params">(<span class="keyword">const</span> Slice& key)</span></span>;</span><br><span class="line"> <span class="function">Slice <span class="title">Finish</span><span class="params">()</span></span>;</span><br><span class="line"></span><br><span class="line"> <span class="keyword">private</span>:</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">GenerateFilter</span><span class="params">()</span></span>;</span><br><span class="line"></span><br><span class="line"> <span class="keyword">const</span> FilterPolicy* policy_;</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> keys_; <span class="comment">// Flattened key contents</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">vector</span><<span class="keyword">size_t</span>> start_; <span class="comment">// Starting index in keys_ of each key</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> result_; <span class="comment">// Filter data computed so far</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">vector</span><Slice> tmp_keys_; <span class="comment">// policy_->CreateFilter() argument</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">vector</span><<span class="keyword">uint32_t</span>> filter_offsets_;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>为了可扩展性(支持多种过滤策略)它接受一个 <strong>FilterPolicy</strong> 指针;整个建造过程需要<strong>顺序调用</strong> StartBlock、AddKey、Finish;</p><p>每次新建一个 data block 的时候都会调用 <strong>FilterBlockBuilder::StartBlock</strong></p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">static</span> <span class="keyword">const</span> <span class="keyword">size_t</span> kFilterBaseLg = <span class="number">11</span>;</span><br><span class="line"><span class="keyword">static</span> <span class="keyword">const</span> <span class="keyword">size_t</span> kFilterBase = <span class="number">1</span> << kFilterBaseLg;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">FilterBlockBuilder::StartBlock</span><span class="params">(<span class="keyword">uint64_t</span> block_offset)</span> </span>{</span><br><span class="line"> <span class="keyword">uint64_t</span> filter_index = (block_offset / kFilterBase);</span><br><span class="line"> assert(filter_index >= filter_offsets_.size());</span><br><span class="line"> <span class="keyword">while</span> (filter_index > filter_offsets_.size()) {</span><br><span class="line"> GenerateFilter();</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>这段代码其实就是每 2KB 用 GenerateFilter 创建一个新的 Filter;</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">FilterBlockBuilder::GenerateFilter</span><span class="params">()</span> </span>{</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> num_keys = start_.size(); <span class="comment">// DOC: start_[i] 表示第 i 个 key 的 offset</span></span><br><span class="line"> <span class="keyword">if</span> (num_keys == <span class="number">0</span>) {</span><br><span class="line"> <span class="comment">// Fast path if there are no keys for this filter</span></span><br><span class="line"> filter_offsets_.push_back(result_.size());</span><br><span class="line"> <span class="keyword">return</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Make list of keys from flattened key structure</span></span><br><span class="line"> start_.push_back(keys_.size()); <span class="comment">// Simplify length computation</span></span><br><span class="line"> tmp_keys_.resize(num_keys);</span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">size_t</span> i = <span class="number">0</span>; i < num_keys; i++) {</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">char</span>* base = keys_.data() + start_[i];</span><br><span class="line"> <span class="keyword">size_t</span> length = start_[i + <span class="number">1</span>] - start_[i];</span><br><span class="line"> tmp_keys_[i] = Slice(base, length);</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Generate filter for current set of keys and append to result_.</span></span><br><span class="line"> filter_offsets_.push_back(result_.size());</span><br><span class="line"> <span class="comment">// DOC: 我估计 CreateFilter 才会去真正的创建 Filter Data i</span></span><br><span class="line"> <span class="comment">// 将得到的 raw data 追加到 result_ 后面去</span></span><br><span class="line"> policy_->CreateFilter(&tmp_keys_[<span class="number">0</span>], <span class="keyword">static_cast</span><<span class="keyword">int</span>>(num_keys), &result_);</span><br><span class="line"></span><br><span class="line"> tmp_keys_.clear();</span><br><span class="line"> keys_.clear();</span><br><span class="line"> start_.clear();</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><strong>FilterBlockBuilder::GenerateFilter</strong> 就是把 keys_ 中的内容追加到 result_ 中去,并且制作 Filter offset;其实这部分没有看明白~~之后再说</p><h2 id="footer结构"><a href="#footer结构" class="headerlink" title="footer结构"></a>footer结构</h2><p>footer 大小固定,为 48 字节,用来存储 meta index block 与 index block 在 sstable 中的索引信息,另外尾部还会存储一个 magic word,内容为:”<a href="http://code.google.com/p/leveldb/"%E5%AD%97%E7%AC%A6%E4%B8%B2">http://code.google.com/p/leveldb/"字符串</a> sha1哈希的前 8 个字节。</p><h1 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h1><ol><li><p><a href="https://leveldb-handbook.readthedocs.io/zh/latest/sstable.html">sstable</a></p></li><li><p><a href="https://riverferry.site/2021-10-27-leveldb%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90(5)%20sstable%E6%96%87%E4%BB%B6/">leveldb源码分析(5) sstable文件</a></p></li><li><p><a href="http://catkang.github.io/2017/01/17/leveldb-data.html">庖丁解LevelDB之数据存储</a></p></li></ol>]]></content>
<summary type="html"><h1 id="LevelDB-源码分析【4】——-SSTable"><a href="#LevelDB-源码分析【4】——-SSTable" class="headerlink" title="LevelDB 源码分析【4】—— SSTable"></a>LevelDB 源码分</summary>
</entry>
<entry>
<title>leveldb 源码分析 [3] —— 内存表 MemTable</title>
<link href="https://codroc.github.io/2022/08/04/leveldb3_memtable/"/>
<id>https://codroc.github.io/2022/08/04/leveldb3_memtable/</id>
<published>2022-08-04T11:57:16.000Z</published>
<updated>2022-08-04T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="LevelDB-源码分析【3】——-内存表-MemTable"><a href="#LevelDB-源码分析【3】——-内存表-MemTable" class="headerlink" title="LevelDB 源码分析【3】—— 内存表 MemTable"></a>LevelDB 源码分析【3】—— 内存表 MemTable</h1><p>这一篇重点分析 leveldb 的 MemTable 是怎么实现的,为什么要这么实现;</p><p>MemTable 使用到的技术:</p><ul><li><p>跳表</p></li><li><p>引用计数</p></li><li><p>迭代器</p></li><li><p>比较器</p></li><li><p>内存分配器 Arena(这个第一篇就分析过了)</p></li></ul><p>一个个来分析下</p><h2 id="跳表"><a href="#跳表" class="headerlink" title="跳表"></a>跳表</h2><p>代码在 <code>db/skiplist.h</code>,SkipList 是模板类:</p><ul><li>接受两个模板参数:Key 和 Comparator;分别表示,键的类型和如何对键进行比较;</li><li>Node,表示跳表中的每一个节点</li><li>Iterator,指向跳表节点</li></ul><p><strong>跳表中节点 Node 的定义:</strong></p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span> <<span class="keyword">typename</span> Key, <span class="class"><span class="keyword">class</span> <span class="title">Comparator</span>></span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">SkipList</span><</span>Key, Comparator>::Node {</span><br><span class="line"> <span class="function"><span class="keyword">explicit</span> <span class="title">Node</span><span class="params">(<span class="keyword">const</span> Key& k)</span> : <span class="title">key</span><span class="params">(k)</span> </span>{}</span><br><span class="line"></span><br><span class="line"> Key <span class="keyword">const</span> key;</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Accessors/mutators for links. Wrapped in methods so we can</span></span><br><span class="line"> <span class="comment">// add the appropriate barriers as necessary.</span></span><br><span class="line"> <span class="function">Node* <span class="title">Next</span><span class="params">(<span class="keyword">int</span> n)</span> </span>{</span><br><span class="line"> assert(n >= <span class="number">0</span>);</span><br><span class="line"> <span class="comment">// Use an 'acquire load' so that we observe a fully initialized</span></span><br><span class="line"> <span class="comment">// version of the returned Node.</span></span><br><span class="line"> <span class="keyword">return</span> next_[n].load(<span class="built_in">std</span>::memory_order_acquire);</span><br><span class="line"> }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">SetNext</span><span class="params">(<span class="keyword">int</span> n, Node* x)</span> </span>{</span><br><span class="line"> assert(n >= <span class="number">0</span>);</span><br><span class="line"> <span class="comment">// Use a 'release store' so that anybody who reads through this</span></span><br><span class="line"> <span class="comment">// pointer observes a fully initialized version of the inserted node.</span></span><br><span class="line"> next_[n].store(x, <span class="built_in">std</span>::memory_order_release);</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// No-barrier variants that can be safely used in a few locations.</span></span><br><span class="line"> <span class="function">Node* <span class="title">NoBarrier_Next</span><span class="params">(<span class="keyword">int</span> n)</span> </span>{</span><br><span class="line"> assert(n >= <span class="number">0</span>);</span><br><span class="line"> <span class="keyword">return</span> next_[n].load(<span class="built_in">std</span>::memory_order_relaxed);</span><br><span class="line"> }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">NoBarrier_SetNext</span><span class="params">(<span class="keyword">int</span> n, Node* x)</span> </span>{</span><br><span class="line"> assert(n >= <span class="number">0</span>);</span><br><span class="line"> next_[n].store(x, <span class="built_in">std</span>::memory_order_relaxed);</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="keyword">private</span>:</span><br><span class="line"> <span class="comment">// Array of length equal to the node height. next_[0] is lowest level link.</span></span><br><span class="line"> <span class="built_in">std</span>::atomic<Node*> next_[<span class="number">1</span>];</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>成员对象仅仅只有一个 key 以及 一个节点的 next_ 数组,用于表示第 i 层的指向下一个节点的指针;并提供对该对象的操作方法,主要分为 acquire_release 内存序 和 releax 内存序(有屏障和无屏障);</p><p><strong>跳表中的 Iterator:</strong></p><p>它是一个 双向 Iterator,支持 Next 和 Prev,但是时间复杂度 Next 是 O(1) 的,而 Prev 是O(n) 的,因为跳表是多层单向链表构成的;</p><p>它提供以下几种方法:</p><ul><li>Valid:判断这个迭代器是否有效,如果无效,其对应的 Node 为空</li><li>Key:返回 Node 对应的 key</li><li>Next,Prev</li><li>Seek:让迭代器指向 key Node</li><li>SeekToFirst:让迭代器指向第一个节点</li><li>SeekToLast:让迭代器指向最后一个节点</li></ul><p>跳表中最重要的就是,读和写了,这两个方法分别是 <strong>SkipList::Insert</strong> 和 <strong>Iterator::Seek</strong></p><p><strong>用户可以这样对跳表写:</strong></p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">SkipList<Key, Comparator> skip(cmp, &arena);</span><br><span class="line">Key key;</span><br><span class="line"><span class="keyword">if</span> (!skip.Contains(key)) {</span><br><span class="line"> skip.Insert(key);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><strong>用户可以这样对跳表读:</strong></p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">SkipList<Key, Comparator> skip(cmp, &arena);</span><br><span class="line">Key key;</span><br><span class="line"><span class="keyword">if</span> (skip.Contains(key)) {</span><br><span class="line"> SkipList<Key, Comparator>::<span class="function">Iterator <span class="title">iter</span><span class="params">(&skip)</span></span>;</span><br><span class="line"> assert(iter.Seek(key).Key() == key);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h3 id="SkipList-Insert"><a href="#SkipList-Insert" class="headerlink" title="SkipList::Insert"></a>SkipList::Insert</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span> <<span class="keyword">typename</span> Key, <span class="class"><span class="keyword">class</span> <span class="title">Comparator</span>></span></span><br><span class="line"><span class="keyword">void</span> SkipList<Key, Comparator>::Insert(<span class="keyword">const</span> Key& key) {</span><br><span class="line"> <span class="comment">// 获取插入节点的前继节点</span></span><br><span class="line"> Node* prev[kMaxHeight];</span><br><span class="line"> Node* x = FindGreaterOrEqual(key, prev);</span><br><span class="line"></span><br><span class="line"> assert(x == <span class="literal">nullptr</span> || !Equal(key, x->key));</span><br><span class="line"></span><br><span class="line"> <span class="comment">// 给新节点安排一个随机高度,并完善 prev 数组</span></span><br><span class="line"> <span class="keyword">int</span> height = RandomHeight();</span><br><span class="line"> <span class="keyword">if</span> (height > GetMaxHeight()) {</span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">int</span> i = GetMaxHeight(); i < height; i++) {</span><br><span class="line"> prev[i] = head_;</span><br><span class="line"> }</span><br><span class="line"> max_height_.store(height, <span class="built_in">std</span>::memory_order_relaxed);</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// 构造新节点,并链入跳表</span></span><br><span class="line"> x = NewNode(key, height);</span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">int</span> i = <span class="number">0</span>; i < height; i++) {</span><br><span class="line"> x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));</span><br><span class="line"> prev[i]->SetNext(i, x);</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>只要我们获得了新节点每一层的前继节点,那么我们就能将新节点链入跳表;</p><h3 id="Iterator-Seek"><a href="#Iterator-Seek" class="headerlink" title="Iterator::Seek"></a>Iterator::Seek</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span> <<span class="keyword">typename</span> Key, <span class="class"><span class="keyword">class</span> <span class="title">Comparator</span>></span></span><br><span class="line"><span class="keyword">inline</span> <span class="keyword">void</span> SkipList<Key, Comparator>::Iterator::Seek(<span class="keyword">const</span> Key& target) {</span><br><span class="line"> node_ = list_->FindGreaterOrEqual(target, <span class="literal">nullptr</span>);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>读写都依赖于 <strong>SkipList::FindGreaterOrEqual</strong>:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span> <<span class="keyword">typename</span> Key, <span class="class"><span class="keyword">class</span> <span class="title">Comparator</span>></span></span><br><span class="line"><span class="keyword">typename</span> SkipList<Key, Comparator>::Node*</span><br><span class="line">SkipList<Key, Comparator>::FindGreaterOrEqual(<span class="keyword">const</span> Key& key,</span><br><span class="line"> Node** prev) <span class="keyword">const</span> {</span><br><span class="line"> Node* x = head_;</span><br><span class="line"> <span class="keyword">int</span> level = GetMaxHeight() - <span class="number">1</span>;</span><br><span class="line"> <span class="keyword">while</span> (<span class="literal">true</span>) {</span><br><span class="line"> Node* next = x->Next(level);</span><br><span class="line"> <span class="keyword">if</span> (KeyIsAfterNode(key, next)) {</span><br><span class="line"> <span class="comment">// Keep searching in this list</span></span><br><span class="line"> x = next;</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="keyword">if</span> (prev != <span class="literal">nullptr</span>) prev[level] = x;</span><br><span class="line"> <span class="keyword">if</span> (level == <span class="number">0</span>) {</span><br><span class="line"> <span class="keyword">return</span> next;</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="comment">// Switch to next list</span></span><br><span class="line"> level--;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>该方法有两种调用方式:</p><ul><li>prev == nullptr:这种方式,仅仅为了得到<strong>第一个 >= key</strong> 的节点</li><li>prev != nullptr:这种方式,在前者的基础上,还有记录下<strong>每一层往下走的那个节点</strong></li></ul><p> 什么叫往下走的节点?其实就是 key 节点的每一层的前继节点;</p><blockquote><p>注意:leveldb 的跳表没有实现 delete 节点的方法,因为它根本不需要,他对 MemTable 的变更只需要不断追加就可以了;</p></blockquote><h2 id="引用计数"><a href="#引用计数" class="headerlink" title="引用计数"></a>引用计数</h2><p>使用引用计数的方式,记录有多少用户在使用该 MemTable:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">MemTable</span> {</span></span><br><span class="line"> ... ...</span><br><span class="line"> <span class="comment">// Increase reference count.</span></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Ref</span><span class="params">()</span> </span>{ ++refs_; }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Drop reference count. Delete if no more references exist.</span></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Unref</span><span class="params">()</span> </span>{</span><br><span class="line"> --refs_;</span><br><span class="line"> assert(refs_ >= <span class="number">0</span>);</span><br><span class="line"> <span class="keyword">if</span> (refs_ <= <span class="number">0</span>) {</span><br><span class="line"> <span class="keyword">delete</span> <span class="keyword">this</span>;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> ... ...</span><br><span class="line"> <span class="keyword">int</span> refs_;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>当调用 <strong>Unref</strong> 发现 <strong>refs_</strong> 为 0 时,将自动删除整个 MemTable;为什么要用这种方式?</p><h2 id="迭代器"><a href="#迭代器" class="headerlink" title="迭代器"></a>迭代器</h2><p>这里 MemTable 自己又定义了一个迭代器,用于访问跳表,其实就是继承了一个通用的 Iterator 基类,然后对跳表的迭代器功能分装了一下;</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">MemTableIterator</span> :</span> <span class="keyword">public</span> Iterator {</span><br><span class="line"> <span class="keyword">public</span>:</span><br><span class="line"> <span class="function"><span class="keyword">explicit</span> <span class="title">MemTableIterator</span><span class="params">(MemTable::Table* table)</span> : <span class="title">iter_</span><span class="params">(table)</span> </span>{}</span><br><span class="line"></span><br><span class="line"> MemTableIterator(<span class="keyword">const</span> MemTableIterator&) = <span class="keyword">delete</span>;</span><br><span class="line"> MemTableIterator& <span class="keyword">operator</span>=(<span class="keyword">const</span> MemTableIterator&) = <span class="keyword">delete</span>;</span><br><span class="line"></span><br><span class="line"> ~MemTableIterator() <span class="keyword">override</span> = <span class="keyword">default</span>;</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">bool</span> <span class="title">Valid</span><span class="params">()</span> <span class="keyword">const</span> <span class="keyword">override</span> </span>{ <span class="keyword">return</span> iter_.Valid(); }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Seek</span><span class="params">(<span class="keyword">const</span> Slice& k)</span> <span class="keyword">override</span> </span>{ iter_.Seek(EncodeKey(&tmp_, k)); }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">SeekToFirst</span><span class="params">()</span> <span class="keyword">override</span> </span>{ iter_.SeekToFirst(); }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">SeekToLast</span><span class="params">()</span> <span class="keyword">override</span> </span>{ iter_.SeekToLast(); }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Next</span><span class="params">()</span> <span class="keyword">override</span> </span>{ iter_.Next(); }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Prev</span><span class="params">()</span> <span class="keyword">override</span> </span>{ iter_.Prev(); }</span><br><span class="line"> <span class="function">Slice <span class="title">key</span><span class="params">()</span> <span class="keyword">const</span> <span class="keyword">override</span> </span>{ <span class="keyword">return</span> GetLengthPrefixedSlice(iter_.key()); }</span><br><span class="line"> <span class="function">Slice <span class="title">value</span><span class="params">()</span> <span class="keyword">const</span> <span class="keyword">override</span> </span>{</span><br><span class="line"> Slice key_slice = GetLengthPrefixedSlice(iter_.key());</span><br><span class="line"> <span class="keyword">return</span> GetLengthPrefixedSlice(key_slice.data() + key_slice.size());</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function">Status <span class="title">status</span><span class="params">()</span> <span class="keyword">const</span> <span class="keyword">override</span> </span>{ <span class="keyword">return</span> Status::OK(); }</span><br><span class="line"></span><br><span class="line"> <span class="keyword">private</span>:</span><br><span class="line"> MemTable::Table::Iterator iter_;</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> tmp_; <span class="comment">// For passing to EncodeKey</span></span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>这里所继承的 Iterator 仅仅提供了纯虚函数;然后通过 <strong>iter_</strong> 来实现功能;</p><h2 id="比较器"><a href="#比较器" class="headerlink" title="比较器"></a>比较器</h2><p>MemTable 中的跳表模板参数是这样的:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">KeyComparator</span> {</span></span><br><span class="line"> <span class="keyword">const</span> InternalKeyComparator comparator;</span><br><span class="line"> <span class="function"><span class="keyword">explicit</span> <span class="title">KeyComparator</span><span class="params">(<span class="keyword">const</span> InternalKeyComparator& c)</span> : <span class="title">comparator</span><span class="params">(c)</span> </span>{}</span><br><span class="line"> <span class="function"><span class="keyword">int</span> <span class="title">operator</span><span class="params">()</span><span class="params">(<span class="keyword">const</span> <span class="keyword">char</span>* a, <span class="keyword">const</span> <span class="keyword">char</span>* b)</span> <span class="keyword">const</span></span>;</span><br><span class="line">};</span><br><span class="line"><span class="keyword">typedef</span> SkipList<<span class="keyword">const</span> <span class="keyword">char</span>*, KeyComparator> Table;</span><br></pre></td></tr></table></figure><p>其中 <code>Key = const char*,Comparator = KeyComparator</code>;key 仅仅是一个字符串首地址,这个字符串其实就是经过序列化后的 <strong>{key, value} pair</strong>;为了对字符串进行比较而不是对 <strong>const char*</strong> 进行比较,首先要对序列化后的 <strong>{key, value} pair</strong> 反序列化得到对应的 key,然后就可以比较了;</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">int</span> MemTable::KeyComparator::<span class="keyword">operator</span>()(<span class="keyword">const</span> <span class="keyword">char</span>* aptr,</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">char</span>* bptr) <span class="keyword">const</span> {</span><br><span class="line"> <span class="comment">// Internal keys are encoded as length-prefixed strings.</span></span><br><span class="line"> Slice a = GetLengthPrefixedSlice(aptr); <span class="comment">// 先反序列化 {key, value} pair,得到 a</span></span><br><span class="line"> Slice b = GetLengthPrefixedSlice(bptr); <span class="comment">// 先反序列化 {key, value} pair,得到 b</span></span><br><span class="line"> <span class="keyword">return</span> comparator.Compare(a, b); <span class="comment">// 然后就可以比较 a,b 了</span></span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>接下来具体分下,MemTable Add 和 Get 的方法</p><h2 id="MemTable-Add"><a href="#MemTable-Add" class="headerlink" title="MemTable::Add"></a>MemTable::Add</h2><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">MemTable::Add</span><span class="params">(SequenceNumber s, ValueType type, <span class="keyword">const</span> Slice& key,</span></span></span><br><span class="line"><span class="function"><span class="params"> <span class="keyword">const</span> Slice& value)</span> </span>{</span><br><span class="line"> <span class="comment">// Format of an entry is concatenation of:</span></span><br><span class="line"> <span class="comment">// key_size : varint32 of internal_key.size()</span></span><br><span class="line"> <span class="comment">// key bytes : char[internal_key.size()]</span></span><br><span class="line"> <span class="comment">// tag : uint64((sequence << 8) | type)</span></span><br><span class="line"> <span class="comment">// value_size : varint32 of value.size()</span></span><br><span class="line"> <span class="comment">// value bytes : char[value.size()]</span></span><br><span class="line"> <span class="keyword">size_t</span> key_size = key.size();</span><br><span class="line"> <span class="keyword">size_t</span> val_size = value.size();</span><br><span class="line"> <span class="keyword">size_t</span> internal_key_size = key_size + <span class="number">8</span>;</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> encoded_len = VarintLength(internal_key_size) +</span><br><span class="line"> internal_key_size + VarintLength(val_size) +</span><br><span class="line"> val_size;</span><br><span class="line"> <span class="keyword">char</span>* buf = arena_.Allocate(encoded_len); <span class="comment">// 这里没用对齐分配啊</span></span><br><span class="line"> <span class="keyword">char</span>* p = EncodeVarint32(buf, internal_key_size);</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">memcpy</span>(p, key.data(), key_size);</span><br><span class="line"> p += key_size;</span><br><span class="line"> EncodeFixed64(p, (s << <span class="number">8</span>) | type);</span><br><span class="line"> p += <span class="number">8</span>;</span><br><span class="line"> p = EncodeVarint32(p, val_size);</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">memcpy</span>(p, value.data(), val_size);</span><br><span class="line"> assert(p + val_size == buf + encoded_len);</span><br><span class="line"> table_.Insert(buf);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>最后一行代码 <code>table_.Insert(buf)</code> 可以看到是往跳表里插入了一个 key,而这个 key 是一个 buffer,里面存的就是经过序列化后的 <strong>{key,value} pair</strong>,每一条记录(Entry)的格式如下图所示:</p><p><img src="https://s2.loli.net/2022/07/20/HbpvSM4N3zZdhR9.png" alt="node_entry.PNG"></p><p>其中需要注意的一点是:编码的时候,tag 被直接编码在了 key 之后,并且 tag 占固定的 8 个 bytes;调用 Get 进行解码时注意这一点即可;</p><h2 id="MemTable-Get"><a href="#MemTable-Get" class="headerlink" title="MemTable::Get"></a>MemTable::Get</h2><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">bool</span> <span class="title">MemTable::Get</span><span class="params">(<span class="keyword">const</span> LookupKey& key, <span class="built_in">std</span>::<span class="built_in">string</span>* value, Status* s)</span> </span>{</span><br><span class="line"> Slice memkey = key.memtable_key();</span><br><span class="line"> <span class="function">Table::Iterator <span class="title">iter</span><span class="params">(&table_)</span></span>;</span><br><span class="line"> iter.Seek(memkey.data());</span><br><span class="line"> <span class="keyword">if</span> (iter.Valid()) {</span><br><span class="line"> <span class="comment">// entry format is:</span></span><br><span class="line"> <span class="comment">// klength varint32</span></span><br><span class="line"> <span class="comment">// userkey char[klength]</span></span><br><span class="line"> <span class="comment">// tag uint64</span></span><br><span class="line"> <span class="comment">// vlength varint32</span></span><br><span class="line"> <span class="comment">// value char[vlength]</span></span><br><span class="line"> <span class="comment">// Check that it belongs to same user key. We do not check the</span></span><br><span class="line"> <span class="comment">// sequence number since the Seek() call above should have skipped</span></span><br><span class="line"> <span class="comment">// all entries with overly large sequence numbers.</span></span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">char</span>* entry = iter.key();</span><br><span class="line"> <span class="keyword">uint32_t</span> key_length;</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">char</span>* key_ptr = GetVarint32Ptr(entry, entry + <span class="number">5</span>, &key_length);</span><br><span class="line"> <span class="keyword">if</span> (comparator_.comparator.user_comparator()->Compare(</span><br><span class="line"> Slice(key_ptr, key_length - <span class="number">8</span>), key.user_key()) == <span class="number">0</span>) {</span><br><span class="line"> <span class="comment">// Correct user key</span></span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">uint64_t</span> tag = DecodeFixed64(key_ptr + key_length - <span class="number">8</span>);</span><br><span class="line"> <span class="keyword">switch</span> (<span class="keyword">static_cast</span><ValueType>(tag & <span class="number">0xff</span>)) {</span><br><span class="line"> <span class="keyword">case</span> kTypeValue: {</span><br><span class="line"> Slice v = GetLengthPrefixedSlice(key_ptr + key_length);</span><br><span class="line"> value->assign(v.data(), v.size());</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">case</span> kTypeDeletion:</span><br><span class="line"> *s = Status::NotFound(Slice());</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>这里的整个流程是这样的:</p><ul><li>找到 memtable_key(就是 key + tag),这样就找到了记录(Entry)</li><li>比较 key 是否和 user_key 一致</li><li>取出 tag,如果是 kTypeValue,那么就设置对应的 value;如果是 kTypeDeletion 那么说明该 key 应该被删除,状态码返回 NotFound</li></ul><p>这里有一个关键还没有分析好,那就是 <strong>LookupKey</strong>,因为可能 MemTable 中有多个冗余的 user_key 存在,只是因为 sequence 不同,memtable_key (key + tag)都不一样罢了;但是查找的时候我们希望找到最新的 memtable_key 这时候该怎么办呢?我猜是根据 sequence 来实现的;sequence 这一部分涉及了 snapshot 以及 version 部分的代码,之后再写;</p>]]></content>
<summary type="html"><h1 id="LevelDB-源码分析【3】——-内存表-MemTable"><a href="#LevelDB-源码分析【3】——-内存表-MemTable" class="headerlink" title="LevelDB 源码分析【3】—— 内存表 MemTable"></summary>
</entry>
<entry>
<title>leveldb 源码分析 [2] —— 数据变更 DBImpl::Write</title>
<link href="https://codroc.github.io/2022/08/02/leveldb2_data_mutation/"/>
<id>https://codroc.github.io/2022/08/02/leveldb2_data_mutation/</id>
<published>2022-08-02T11:57:16.000Z</published>
<updated>2022-08-02T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="LevelDB-源码分析【2】——-数据变更-DBImpl-Write"><a href="#LevelDB-源码分析【2】——-数据变更-DBImpl-Write" class="headerlink" title="LevelDB 源码分析【2】—— 数据变更 DBImpl::Write"></a>LevelDB 源码分析【2】—— 数据变更 DBImpl::Write</h1><p><strong>leveldb::Slice</strong> 就是其实就是 c++17 中的 <strong>std::string_view</strong>,多了一个 <code>ToString</code> 的功能而已;</p><p>看一下一个 <strong>leveldb::DB::Put</strong> 的执行流程:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">leveldb::DB::Put --> WriteBatch::Put --> DBImpl::Write(会被阻塞,直到 1. WAL 完成;2. 插入 memtable 完成)</span><br><span class="line">DBImpl::Write 内部执行流程:</span><br><span class="line">--> DBImpl::MakeRoomForWrite --> DBImpl::BuildBatchGroup --> Writer::AddRecord(执行 1. WAL) --> WriteBatchInternal::InsertInto(执行 2. 把数据插入 memtable)</span><br></pre></td></tr></table></figure><p><strong>DBImpl::Write</strong> 就是把对数据库的变更写入数据库,这里涉及到两步,也是 LSMTree 的核心:</p><ul><li>制作 WAL,并写入 disk,对应 <strong>Writer::AddRecord</strong></li><li>把数据变更插入到 memtable 中,对应 <strong>WriteBatchInternal::InsertInto</strong></li></ul><p>因为 Writer::AddRecord 是会 flush 数据到磁盘的,这里 leveldb 为了减少 flush 次数来降低延迟,它用 <strong>DBImpl::BuildBatchGroup</strong> 把一些 WriteBatch 整合成一个 WriteBatch,然后再调用 Writer::AddRecord;</p><p>具体看下代码吧</p><h2 id="源码分析"><a href="#源码分析" class="headerlink" title="源码分析"></a>源码分析</h2><h3 id="WriteBatch-Put"><a href="#WriteBatch-Put" class="headerlink" title="WriteBatch::Put"></a>WriteBatch::Put</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">WriteBatch::Put</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">const</span> Slice& value)</span> </span>{</span><br><span class="line"> <span class="comment">// 以第9个字节为起始地址,设置count(int32, 小端字节序)大小</span></span><br><span class="line"> WriteBatchInternal::SetCount(<span class="keyword">this</span>, WriteBatchInternal::Count(<span class="keyword">this</span>) + <span class="number">1</span>);</span><br><span class="line"> <span class="comment">// 第13个字节记录type, 0->delete 1->value</span></span><br><span class="line"> rep_.push_back(<span class="keyword">static_cast</span><<span class="keyword">char</span>>(kTypeValue)); </span><br><span class="line"> PutLengthPrefixedSlice(&rep_, key); <span class="comment">// 第14个字节开始记录varint32(key的大小)+char*(key的data)</span></span><br><span class="line"> PutLengthPrefixedSlice(&rep_, value); <span class="comment">// 继续在后面写入varint32(value的大小)+char*(value的data)</span></span><br><span class="line"> <span class="comment">// 至此,WriteBatch的rep_中记录了一些信息,我画个图表示下</span></span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>这里 WriteBatch 其实就是一个 buffer,或者 string;用于存储序列化后的<strong>操作</strong>,当然在最前面要有一个 header 来表示有多少个操作;</p><p>一个操作可以用 <strong>type key [value]</strong> 表示,key 和 value 都是字符串类型,其中 type 有两类:</p><ul><li>put:修改/添加一个数据</li><li>delete:删除一个数据</li></ul><p>value 根据具体的操作是可选的,对于 put 来说就必须要有 value;对于 delete 来说就不需要 value 了;</p><p>使用 varint 来表示字符串长度;这样的话整个 WriteBatch 的底层实际上就是一个序列化后的字节流,如下图所示:</p><p><img src="https://s2.loli.net/2022/07/18/tD9J6ENCOsL14nj.png" alt="WriteBatch0.png"></p><h3 id="DBImpl-BuildBatchGroup"><a href="#DBImpl-BuildBatchGroup" class="headerlink" title="DBImpl::BuildBatchGroup"></a>DBImpl::BuildBatchGroup</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">WriteBatch* <span class="title">DBImpl::BuildBatchGroup</span><span class="params">(Writer** last_writer)</span> </span>{</span><br><span class="line">... ...</span><br><span class="line"> Writer* first = writers_.front();</span><br><span class="line"> WriteBatch* result = first->batch;</span><br><span class="line"> <span class="comment">// 把能够整合的 WriteBatch 都整合到一个 batch 里面</span></span><br><span class="line"> <span class="keyword">auto</span> max_size = <span class="comment">// 自定义能整合得到的最大 batch 的 size</span></span><br><span class="line"> *last_writer = first;</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">deque</span><Writer*>::iterator iter = writers_.begin();</span><br><span class="line"> ++iter; <span class="comment">// Advance past "first"</span></span><br><span class="line"> <span class="keyword">for</span> (; iter != writers_.end(); ++iter) {</span><br><span class="line"> Writer* w = *iter;</span><br><span class="line"> <span class="keyword">if</span> (w->sync && !first->sync) { <span class="comment">// 这里是指不要把同步 write 的内容整合进来</span></span><br><span class="line"> <span class="comment">// Do not include a sync write into a batch handled by a non-sync write.</span></span><br><span class="line"> <span class="keyword">break</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> (w->batch != <span class="literal">nullptr</span>) {</span><br><span class="line"> size += WriteBatchInternal::ByteSize(w->batch);</span><br><span class="line"> <span class="keyword">if</span> (size > max_size) {</span><br><span class="line"> <span class="comment">// Do not make batch too big</span></span><br><span class="line"> <span class="keyword">break</span>;</span><br><span class="line"> }</span><br><span class="line"><span class="comment">// 开始整合,就是把 其他的 WriteBatch 中的除了 header 的数据全部 append 到一个 WriteBatch 中去</span></span><br><span class="line"> <span class="comment">// Append to *result</span></span><br><span class="line"> <span class="keyword">if</span> (result == first->batch) {</span><br><span class="line"> <span class="comment">// Switch to temporary batch instead of disturbing caller's batch</span></span><br><span class="line"> result = tmp_batch_;</span><br><span class="line"> assert(WriteBatchInternal::Count(result) == <span class="number">0</span>);</span><br><span class="line"> WriteBatchInternal::Append(result, first->batch);</span><br><span class="line"> }</span><br><span class="line"> WriteBatchInternal::Append(result, w->batch);</span><br><span class="line"> }</span><br><span class="line"> *last_writer = w;</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> result;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>这里就是尽量把能整合的 WriteBatch 整合到一个 WriteBatch 中去,想来是为了减少 flush 的次数,降低延迟;</p><h3 id="Writer-AddRecord"><a href="#Writer-AddRecord" class="headerlink" title="Writer::AddRecord"></a>Writer::AddRecord</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">Status <span class="title">Writer::AddRecord</span><span class="params">(<span class="keyword">const</span> Slice& slice)</span> </span>{</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">char</span>* ptr = slice.data();</span><br><span class="line"> <span class="keyword">size_t</span> left = slice.size();</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Fragment the record if necessary and emit it. Note that if slice</span></span><br><span class="line"> <span class="comment">// is empty, we still want to iterate once to emit a single</span></span><br><span class="line"> <span class="comment">// zero-length record</span></span><br><span class="line"> Status s;</span><br><span class="line"> <span class="keyword">bool</span> begin = <span class="literal">true</span>;</span><br><span class="line"> <span class="keyword">do</span> {</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">int</span> leftover = kBlockSize - block_offset_;</span><br><span class="line"> assert(leftover >= <span class="number">0</span>);</span><br><span class="line"> <span class="keyword">if</span> (leftover < kHeaderSize) {</span><br><span class="line"> <span class="comment">// Switch to a new block</span></span><br><span class="line"> <span class="keyword">if</span> (leftover > <span class="number">0</span>) {</span><br><span class="line"> <span class="comment">// Fill the trailer (literal below relies on kHeaderSize being 7)</span></span><br><span class="line"> <span class="keyword">static_assert</span>(kHeaderSize == <span class="number">7</span>, <span class="string">""</span>);</span><br><span class="line"> dest_->Append(Slice(<span class="string">"\x00\x00\x00\x00\x00\x00"</span>, leftover));</span><br><span class="line"> }</span><br><span class="line"> block_offset_ = <span class="number">0</span>;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Invariant: we never leave < kHeaderSize bytes in a block.</span></span><br><span class="line"> assert(kBlockSize - block_offset_ - kHeaderSize >= <span class="number">0</span>);</span><br><span class="line"></span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> avail = kBlockSize - block_offset_ - kHeaderSize;</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">size_t</span> fragment_length = (left < avail) ? left : avail;</span><br><span class="line"></span><br><span class="line"> RecordType type;</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">bool</span> end = (left == fragment_length);</span><br><span class="line"> <span class="comment">// 根据 begin 和 end 来识别</span></span><br><span class="line"> <span class="comment">// 1. 一条完整的记录</span></span><br><span class="line"> <span class="comment">// 2. 第一条记录</span></span><br><span class="line"> <span class="comment">// 3. 最后一条记录</span></span><br><span class="line"> <span class="comment">// 4. 中间记录</span></span><br><span class="line"> <span class="keyword">if</span> (begin && end) {</span><br><span class="line"> type = kFullType;</span><br><span class="line"> } <span class="keyword">else</span> <span class="keyword">if</span> (begin) {</span><br><span class="line"> type = kFirstType;</span><br><span class="line"> } <span class="keyword">else</span> <span class="keyword">if</span> (end) {</span><br><span class="line"> type = kLastType;</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> type = kMiddleType;</span><br><span class="line"> }</span><br><span class="line"> <span class="comment">// emitphysicalrecord 函数里面会 flush,因此这里可能很慢</span></span><br><span class="line"> s = EmitPhysicalRecord(type, ptr, fragment_length); </span><br><span class="line"> ptr += fragment_length;</span><br><span class="line"> left -= fragment_length;</span><br><span class="line"> begin = <span class="literal">false</span>;</span><br><span class="line"> } <span class="keyword">while</span> (s.ok() && left > <span class="number">0</span>);</span><br><span class="line"> <span class="keyword">return</span> s;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>这个函数能够把 WriteBatch 制作成一条日志记录并放入一个 block 中,最终 flush 到磁盘上面去;磁盘和内存之间传输的最小数据单元的大小是 kBlockSize;日志记录的格式如下图所示:</p><p><img src="https://s2.loli.net/2022/07/19/J8wtLShUziYlMVv.png" alt="log_record.png"></p><p><strong>EmitPhysicalRecord</strong>,这个函数就是具体制作日志记录的函数,可以看到传给了他 3 个参数:</p><ul><li>type:表示 record 以什么形式存在 block 中,分开存储还是完整存储</li><li>ptr:指向 WriteBatch 字节流的第一个字节</li><li>fragment_length:适应一个 block 大小的 WriteBatch <strong>字节流分片</strong></li></ul><p>根据 WriteBatch 的大小可以产生多种不同的存储形式:</p><ul><li>WriteBatch > block 中可用空间的大小:这种情况,一条日志记录中的数据可能会存储在多个块上;具体分以下两种情况;<ul><li>kMiddleType:存在中间数据存在整个块中的情况;</li><li>kFirstType,kLastType:这种情况是一条记录中的数据被分开存在了两个块上;</li></ul></li><li>WriteBatch < block 中可用空间的大小:这种情况,日志记录作为一条完整的记录(kFullType)追加到块的末尾</li></ul><p><strong>当 WAL 完成之后,接下来就是更新 memtable 了</strong></p><h3 id="WriteBatchInternal-InsertInto"><a href="#WriteBatchInternal-InsertInto" class="headerlink" title="WriteBatchInternal::InsertInto"></a>WriteBatchInternal::InsertInto</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">Status <span class="title">WriteBatchInternal::InsertInto</span><span class="params">(<span class="keyword">const</span> WriteBatch* b, MemTable* memtable)</span> </span>{</span><br><span class="line"> MemTableInserter inserter;</span><br><span class="line"> inserter.sequence_ = WriteBatchInternal::Sequence(b);</span><br><span class="line"> inserter.mem_ = memtable;</span><br><span class="line"> <span class="keyword">return</span> b->Iterate(&inserter);</span><br><span class="line">}</span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">LEVELDB_EXPORT</span> <span class="title">WriteBatch</span> {</span></span><br><span class="line"> <span class="keyword">public</span>:</span><br><span class="line"> <span class="class"><span class="keyword">class</span> <span class="title">LEVELDB_EXPORT</span> <span class="title">Handler</span> {</span></span><br><span class="line"> <span class="keyword">public</span>:</span><br><span class="line"> <span class="keyword">virtual</span> ~Handler();</span><br><span class="line"> <span class="function"><span class="keyword">virtual</span> <span class="keyword">void</span> <span class="title">Put</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">const</span> Slice& value)</span> </span>= <span class="number">0</span>;</span><br><span class="line"> <span class="function"><span class="keyword">virtual</span> <span class="keyword">void</span> <span class="title">Delete</span><span class="params">(<span class="keyword">const</span> Slice& key)</span> </span>= <span class="number">0</span>;</span><br><span class="line"> };</span><br><span class="line"> ... ...</span><br><span class="line"> <span class="function">Status <span class="title">Iterate</span><span class="params">(Handler* handler)</span> <span class="keyword">const</span></span>;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>感觉这里 leveldb 用了 <a href="https://www.cnblogs.com/bytesfly/p/visitor-pattern.html">visitor 模式</a>,<strong>WriteBatch::Iterate</strong> 接受一个 Handler,这个 Handler 可以是各种 visitor;例如 <strong>MemTableInserter</strong> 的实现:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">namespace</span> {</span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">MemTableInserter</span> :</span> <span class="keyword">public</span> WriteBatch::Handler {</span><br><span class="line"> <span class="keyword">public</span>:</span><br><span class="line"> SequenceNumber sequence_;</span><br><span class="line"> MemTable* mem_; <span class="comment">// MemTable 中有一个 table_ 成员变量,其实就是 跳表</span></span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Put</span><span class="params">(<span class="keyword">const</span> Slice& key, <span class="keyword">const</span> Slice& value)</span> <span class="keyword">override</span> </span>{</span><br><span class="line"> mem_->Add(sequence_, kTypeValue, key, value);</span><br><span class="line"> sequence_++;</span><br><span class="line"> }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">Delete</span><span class="params">(<span class="keyword">const</span> Slice& key)</span> <span class="keyword">override</span> </span>{</span><br><span class="line"> mem_->Add(sequence_, kTypeDeletion, key, Slice());</span><br><span class="line"> sequence_++;</span><br><span class="line"> }</span><br><span class="line">};</span><br><span class="line">} <span class="comment">// namespace</span></span><br></pre></td></tr></table></figure><h3 id="WriteBatch-Iterate"><a href="#WriteBatch-Iterate" class="headerlink" title="WriteBatch::Iterate"></a>WriteBatch::Iterate</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">Status <span class="title">WriteBatch::Iterate</span><span class="params">(Handler* handler)</span> <span class="keyword">const</span> </span>{</span><br><span class="line"> <span class="function">Slice <span class="title">input</span><span class="params">(rep_)</span></span>;</span><br><span class="line"> <span class="keyword">if</span> (input.size() < kHeader) {</span><br><span class="line"> <span class="keyword">return</span> Status::Corruption(<span class="string">"malformed WriteBatch (too small)"</span>);</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> input.remove_prefix(kHeader);</span><br><span class="line"> Slice key, value;</span><br><span class="line"> <span class="keyword">int</span> found = <span class="number">0</span>;</span><br><span class="line"> <span class="keyword">while</span> (!input.empty()) {</span><br><span class="line"> found++;</span><br><span class="line"> <span class="keyword">char</span> tag = input[<span class="number">0</span>];</span><br><span class="line"> input.remove_prefix(<span class="number">1</span>);</span><br><span class="line"> <span class="keyword">switch</span> (tag) {</span><br><span class="line"> <span class="keyword">case</span> kTypeValue:</span><br><span class="line"> <span class="keyword">if</span> (GetLengthPrefixedSlice(&input, &key) &&</span><br><span class="line"> GetLengthPrefixedSlice(&input, &value)) {</span><br><span class="line"> handler->Put(key, value);</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="keyword">return</span> Status::Corruption(<span class="string">"bad WriteBatch Put"</span>);</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">break</span>;</span><br><span class="line"> <span class="keyword">case</span> kTypeDeletion:</span><br><span class="line"> <span class="keyword">if</span> (GetLengthPrefixedSlice(&input, &key)) {</span><br><span class="line"> handler->Delete(key);</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="keyword">return</span> Status::Corruption(<span class="string">"bad WriteBatch Delete"</span>);</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">break</span>;</span><br><span class="line"> <span class="keyword">default</span>:</span><br><span class="line"> <span class="keyword">return</span> Status::Corruption(<span class="string">"unknown WriteBatch tag"</span>);</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">if</span> (found != WriteBatchInternal::Count(<span class="keyword">this</span>)) {</span><br><span class="line"> <span class="keyword">return</span> Status::Corruption(<span class="string">"WriteBatch has wrong count"</span>);</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="keyword">return</span> Status::OK();</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>这段代码,无非就是<strong>根据反序列化方式,遍历整合后的 WriteBatch 字节流</strong>,然后根据 type 是 put 还是 delete,提取出 <strong>{key,value} pairs 或 key</strong> 并应用到 memtable;</p><p>接下来看下 Memtable 是怎么插入一个数据变更的</p><p>从 Memtable 的成员变量中,发现它是由跳表实现的,因此重点在于 <code>db/skiplist.h</code> ,这里太长了,还是记录到下一篇 blog 中吧~</p><h1 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h1><ol><li><a href="https://riverferry.site/2021-10-13-leveldb%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90(4)%20memtable%20and%20log/">leveldb源码分析(4) memtable and log</a></li></ol>]]></content>
<summary type="html"><h1 id="LevelDB-源码分析【2】——-数据变更-DBImpl-Write"><a href="#LevelDB-源码分析【2】——-数据变更-DBImpl-Write" class="headerlink" title="LevelDB 源码分析【2】—— 数据变更</summary>
</entry>
<entry>
<title>leveldb 源码分析 [1] —— 内存管理 Arena</title>
<link href="https://codroc.github.io/2022/08/01/leveldb1_arena/"/>
<id>https://codroc.github.io/2022/08/01/leveldb1_arena/</id>
<published>2022-08-01T11:57:16.000Z</published>
<updated>2022-08-01T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="LevelDB-源码分析【1】——内存管理-Arena"><a href="#LevelDB-源码分析【1】——内存管理-Arena" class="headerlink" title="LevelDB 源码分析【1】——内存管理 Arena"></a>LevelDB 源码分析【1】——内存管理 Arena</h1><p>看了网上的资料,突然发现一个很好用的工具 <strong>gperftools</strong>,它实现了一套高性能的 malloc,除此之外还提供了一些性能分析工具,能够对 heap,cpu 进行分析;</p><p>如果是这样那还不能让我狂喜;真正的价值在于,它既然能对 heap 进行分析,就能图形化地显示出整个程序执行进程,这对于阅读大型源代码来说无疑是有力的助手啊!!!</p><h2 id="安装-gperftools"><a href="#安装-gperftools" class="headerlink" title="安装 gperftools"></a>安装 gperftools</h2><p><strong>体系结构:</strong>x86</p><p><strong>环境:</strong>任意 Linux 系统</p><p>直接去 github 上查看安装流程 <a href="https://github.com/gperftools/gperftools">https://github.com/gperftools/gperftools</a></p><p>由于官方文档中提到,可能系统自带的 <strong>libunwind</strong> 会引发一些 bug,因此直接去 github 安装最新的 <strong>libunwind</strong> <a href="https://github.com/libunwind/libunwind">https://github.com/libunwind/libunwind</a></p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> git <span class="built_in">clone</span> https://github.com/libunwind/libunwind</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> <span class="built_in">cd</span> libunwind/</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> autoreconf -i</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> ./configure</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> make</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> make install prefix=/usr/<span class="built_in">local</span></span></span><br></pre></td></tr></table></figure><p>接下来就可以安装 <strong>gperftools</strong> 了</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> git <span class="built_in">clone</span> https://github.com/gperftools/gperftools</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> <span class="built_in">cd</span> gperftools/</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> ./autogen.sh</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> ./configure</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> make</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> make install prefix=/usr/<span class="built_in">local</span></span></span><br></pre></td></tr></table></figure><p>ubuntu 安装图形化工具</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> sudo apt install graphviz ghostscript</span></span><br></pre></td></tr></table></figure><h2 id="测试-leveldb-的-heapprofile"><a href="#测试-leveldb-的-heapprofile" class="headerlink" title="测试 leveldb 的 heapprofile"></a>测试 leveldb 的 heapprofile</h2><p>开启一个 leveldb 数据库,并向里面大量写数据;这样就可以看到哪个函数去分配了大量内存,以及分配内存的整个函数调用过程;</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// test_leveldb.cpp</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><leveldb/db.h></span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><iostream></span></span></span><br><span class="line"><span class="keyword">using</span> <span class="keyword">namespace</span> <span class="built_in">std</span>;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span> </span>{</span><br><span class="line"> leveldb::DB *db;</span><br><span class="line"> leveldb::Options options;</span><br><span class="line"></span><br><span class="line"> options.create_if_missing = <span class="literal">true</span>;</span><br><span class="line"></span><br><span class="line"> leveldb::DB::Open(options, <span class="string">"/tmp/testdb"</span>, &db);</span><br><span class="line"></span><br><span class="line"> <span class="built_in">string</span> key = <span class="string">"MyKey29"</span>, value = <span class="string">"Hello World!"</span>, result;</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="built_in">string</span> <span class="title">prefix</span><span class="params">(<span class="number">100</span>, <span class="string">'c'</span>)</span></span>;</span><br><span class="line"></span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">int</span> i = <span class="number">0</span>;i < <span class="number">12000000</span>; ++i) {</span><br><span class="line"> key = prefix + to_string(i);</span><br><span class="line"> db->Put(leveldb::WriteOptions(), key, value + to_string(i));</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="keyword">delete</span> db;</span><br><span class="line"> <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>官方文档给出了几种使用 <strong>gperftools</strong> 的方式,见 <a href="https://gperftools.github.io/gperftools/cpuprofile.html">https://gperftools.github.io/gperftools/cpuprofile.html</a></p><p>我是直接使用动态链接的方式,把 <strong>libperftools.so</strong> 链接到自己的可执行文件上</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> g++ -g test_leveldb.cpp -o test_leveldb -lleveldb -lpthread -ltcmalloc -lprofiler -Wl,-rpath=/usr/<span class="built_in">local</span>/lib</span></span><br></pre></td></tr></table></figure><p>然后通过环境变量来开启 heap 分析</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> mkdir -p /tmp/code/leveldb</span></span><br><span class="line"><span class="meta">$</span><span class="bash"> HEAPPROFILE=<span class="string">"/tmp/code/leveldb/run"</span> ./test_leveldb</span></span><br></pre></td></tr></table></figure><p>最后就可以使用 <strong>pprof</strong> 工具对输出的结果进行分析了</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> pprof --pdf run run.0001.heap >run.pdf</span></span><br></pre></td></tr></table></figure><p>最终结果如图所示:</p><p><img src="https://s2.loli.net/2022/07/18/9uXPxLzypGg18jb.png" alt="0.PNG"></p><p>图太大,就截了重要的部分,可以看到最终内存的分配都是 <strong>AllocateNewBlock</strong> 去做的,并且有两条路能够到到这个函数</p><ul><li><strong>Allocate</strong></li><li><strong>AllocateAligned</strong></li></ul><p>接下来就根据图示流程来分析下具体的函数</p><h2 id="源码分析"><a href="#源码分析" class="headerlink" title="源码分析"></a>源码分析</h2><h3 id="Arena-Allocate"><a href="#Arena-Allocate" class="headerlink" title="Arena::Allocate"></a>Arena::Allocate</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">inline</span> <span class="keyword">char</span>* <span class="title">Arena::Allocate</span><span class="params">(<span class="keyword">size_t</span> bytes)</span> </span>{</span><br><span class="line"> <span class="comment">// The semantics of what to return are a bit messy if we allow</span></span><br><span class="line"> <span class="comment">// 0-byte allocations, so we disallow them here (we don't need</span></span><br><span class="line"> <span class="comment">// them for our internal use).</span></span><br><span class="line"> assert(bytes > <span class="number">0</span>);</span><br><span class="line"> <span class="keyword">if</span> (bytes <= alloc_bytes_remaining_) {</span><br><span class="line"> <span class="keyword">char</span>* result = alloc_ptr_;</span><br><span class="line"> alloc_ptr_ += bytes;</span><br><span class="line"> alloc_bytes_remaining_ -= bytes;</span><br><span class="line"> <span class="keyword">return</span> result;</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> AllocateFallback(bytes);</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h3 id="Arena-AllocateAligned"><a href="#Arena-AllocateAligned" class="headerlink" title="Arena::AllocateAligned"></a>Arena::AllocateAligned</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">char</span>* <span class="title">Arena::AllocateAligned</span><span class="params">(<span class="keyword">size_t</span> bytes)</span> </span>{</span><br><span class="line"> <span class="keyword">const</span> <span class="keyword">int</span> align = (<span class="keyword">sizeof</span>(<span class="keyword">void</span>*) > <span class="number">8</span>) ? <span class="keyword">sizeof</span>(<span class="keyword">void</span>*) : <span class="number">8</span>;</span><br><span class="line"> <span class="keyword">static_assert</span>((align & (align - <span class="number">1</span>)) == <span class="number">0</span>,</span><br><span class="line"> <span class="string">"Pointer size should be a power of 2"</span>);</span><br><span class="line"> <span class="keyword">size_t</span> current_mod = <span class="keyword">reinterpret_cast</span><<span class="keyword">uintptr_t</span>>(alloc_ptr_) & (align - <span class="number">1</span>);</span><br><span class="line"> <span class="keyword">size_t</span> slop = (current_mod == <span class="number">0</span> ? <span class="number">0</span> : align - current_mod);</span><br><span class="line"> <span class="keyword">size_t</span> needed = bytes + slop;</span><br><span class="line"> <span class="keyword">char</span>* result;</span><br><span class="line"> <span class="keyword">if</span> (needed <= alloc_bytes_remaining_) {</span><br><span class="line"> result = alloc_ptr_ + slop;</span><br><span class="line"> alloc_ptr_ += needed;</span><br><span class="line"> alloc_bytes_remaining_ -= needed;</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="comment">// AllocateFallback always returned aligned memory</span></span><br><span class="line"> result = AllocateFallback(bytes);</span><br><span class="line"> }</span><br><span class="line"> assert((<span class="keyword">reinterpret_cast</span><<span class="keyword">uintptr_t</span>>(result) & (align - <span class="number">1</span>)) == <span class="number">0</span>);</span><br><span class="line"> <span class="keyword">return</span> result;</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h3 id="Arena-AllocateFallback"><a href="#Arena-AllocateFallback" class="headerlink" title="Arena::AllocateFallback"></a>Arena::AllocateFallback</h3><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">char</span>* <span class="title">Arena::AllocateFallback</span><span class="params">(<span class="keyword">size_t</span> bytes)</span> </span>{</span><br><span class="line"> <span class="keyword">if</span> (bytes > kBlockSize / <span class="number">4</span>) {</span><br><span class="line"> <span class="comment">// Object is more than a quarter of our block size. Allocate it separately</span></span><br><span class="line"> <span class="comment">// to avoid wasting too much space in leftover bytes.</span></span><br><span class="line"> <span class="keyword">char</span>* result = AllocateNewBlock(bytes);</span><br><span class="line"> <span class="keyword">return</span> result;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// We waste the remaining space in the current block.</span></span><br><span class="line"> alloc_ptr_ = AllocateNewBlock(kBlockSize);</span><br><span class="line"> alloc_bytes_remaining_ = kBlockSize;</span><br><span class="line"></span><br><span class="line"> <span class="keyword">char</span>* result = alloc_ptr_;</span><br><span class="line"> alloc_ptr_ += bytes;</span><br><span class="line"> alloc_bytes_remaining_ -= bytes;</span><br><span class="line"> <span class="keyword">return</span> result;</span><br><span class="line">}</span><br></pre></td></tr></table></figure>]]></content>
<summary type="html"><h1 id="LevelDB-源码分析【1】——内存管理-Arena"><a href="#LevelDB-源码分析【1】——内存管理-Arena" class="headerlink" title="LevelDB 源码分析【1】——内存管理 Arena"></a>Level</summary>
</entry>
<entry>
<title>故障可恢复事务</title>
<link href="https://codroc.github.io/2022/07/16/%E6%95%85%E9%9A%9C%E5%8F%AF%E6%81%A2%E5%A4%8D%E4%BA%8B%E5%8A%A1/"/>
<id>https://codroc.github.io/2022/07/16/%E6%95%85%E9%9A%9C%E5%8F%AF%E6%81%A2%E5%A4%8D%E4%BA%8B%E5%8A%A1/</id>
<published>2022-07-16T11:57:16.000Z</published>
<updated>2022-07-16T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="故障可恢复事务"><a href="#故障可恢复事务" class="headerlink" title="故障可恢复事务"></a>故障可恢复事务</h1><p>虽然没学过数据库的使用,但是它本身作为一个系统,它也必定遵守系统开发的基本概念,例如,容错,故障自动恢复,持久化等;</p><p>看了 MIT 莫里斯 大佬的课程,记录下一个<strong>简单的事务数据库的设计思想</strong></p><h3 id="事务"><a href="#事务" class="headerlink" title="事务"></a>事务</h3><p>事务的特性 ACID,在网上资料多的是;</p><ul><li>原子性</li><li>一致性</li><li>隔离性</li><li>持久性</li></ul><p>大佬是这样介绍事务的:</p><p>事务把一些列操作打包成一个原子操作,并顺序执行这些操作;</p><p>举例:例如一个银行系统,有一个转账的操作;X 转账 10 块钱给 Y;用事务表示就是:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">BEGIN</span><br><span class="line">X = X - 10</span><br><span class="line">Y = Y + 10</span><br><span class="line">END</span><br></pre></td></tr></table></figure><p>我们希望数据库有这样的效果:</p><ul><li>能顺序地执行这些操作,并且<strong>不允许客户看到执行的中间状态</strong>;</li><li>同时,我们还要<strong>允许系统发生故障</strong>,在故障恢复后,事务中的所有操作要么全部被执行,要不全部都没有执行;</li><li>当数据库重启后<strong>数据不会丢失</strong>;</li></ul><h3 id="怎么实现事务"><a href="#怎么实现事务" class="headerlink" title="怎么实现事务"></a>怎么实现事务</h3><p>概念上:<strong>事务通过对涉及到的每一份数据加锁来实现</strong>。在整个事务的过程中,都对 X,Y 加了锁。并且只有当<strong>事务结束、提交并且持久化存储之后</strong>,锁才会被释放。</p><p>具体实现:</p><p>我们考虑简单的事务数据库的实现,即 <strong>单机 + 本地磁盘</strong> 来做存储;那么数据记录都存在磁盘中,可能会用 B 树来做索引的数据结构;那么,他的结构大概是这样的;X,Y 肯定是存在于某个 disk block 中的,disk block 中一般存有很多数据,而 X,Y 仅仅占其中的某些 bits</p><p><img src="https://s2.loli.net/2022/07/16/twK49TpZVm28GQC.png" alt="事务数据库设计0.PNG"></p><ol><li>进程开启一个事务,然后按照索引找到具体的 disk block,为了读取 X,Y 所在的 disk block,CPU 向 Disk 驱动发读取 disk block 的命令;然后就进程进入阻塞态主动让出 CPU,等待磁盘读取完成;</li><li>磁盘驱动将 X,Y 所在的 disk block 加载到内存中并用 LRU Buffer Cache 缓存起来,然后通过中断通知 CPU 读任务完成;</li><li>CPU 将原来的进程设置为就绪态,然后经过一定时间后重新得到 CPU 的使用权;对内存中的 X,Y 进行操作;首先会制作操作日志,上述的事务会产生三条日志,前两条记录了原始的(original)X 和 Y 的值,以及操作执行后(new)X,Y 的值,最后一条是 Commit 日志,表示着整个事务的结束,并提交;在同一个事务中的所有日志带上事务 ID,用于唯一辨别一个事务;</li><li>进程将操作日志 flush 到 Disk(这里可能是 lazy flush,等累计了足够多的事务日志后再一次性 flush),然后更新 X,Y 在内存中的值,并响应客户成功执行了一个事务;</li></ol><h3 id="故障分析"><a href="#故障分析" class="headerlink" title="故障分析"></a>故障分析</h3><p>接下来有两种情况:</p><ul><li><p>如果数据库没有崩溃</p><p>那么在它的内存中,X,Y 对应的数值分别是 290 和 410;最终数据库会将内存中的数值写入到磁盘对应的位置</p></li><li><p>如果数据库在将内存中的数值写入到磁盘之前就崩溃了</p><p>这样磁盘中的 disk block 中仍然是旧的数值。当数据库重启时,恢复软件会扫描 WAL 日志,发现对应事务的 log,并发现事务的commit 记录,那么恢复软件会将新的数值写入到磁盘中。这被称为 redo/replay,它会重新执行事务中的写操作</p></li></ul><h1 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h1><ol><li><a href="https://www.zhihu.com/column/c_1273718607160393728"><a href="https://zhuanlan.zhihu.com/p/232339119">故障可恢复事务(Crash Recoverable Transaction)</a></a></li></ol>]]></content>
<summary type="html"><h1 id="故障可恢复事务"><a href="#故障可恢复事务" class="headerlink" title="故障可恢复事务"></a>故障可恢复事务</h1><p>虽然没学过数据库的使用,但是它本身作为一个系统,它也必定遵守系统开发的基本概念,例如,容错,故障自动</summary>
</entry>
<entry>
<title>Zookeeper 笔记</title>
<link href="https://codroc.github.io/2022/07/15/zookeeper%E7%AC%94%E8%AE%B0/"/>
<id>https://codroc.github.io/2022/07/15/zookeeper%E7%AC%94%E8%AE%B0/</id>
<published>2022-07-15T11:57:16.000Z</published>
<updated>2022-07-15T11:57:16.000Z</updated>
<content type="html"><![CDATA[<p><strong>转载自:</strong> <a href="https://juejin.cn/post/6844903891146915848">经典分布式论文阅读:Zookeeper</a></p><p>本文是 ZooKeeper 论文的阅读笔记,ZooKeeper 用于协调分布式系统中的进程,为分布式系统提供消息群发、共享寄存器、分布式锁这些中心化的服务。</p><p>分布式系统中需要的协调服务包括:配置、组成员关系、领导选举和锁服务。ZooKeeper 并没有直接提供这些服务,因为更强的原语可以用来实现较弱的原语,ZooKeeper 提供了API 供开发者实现自己的原语。ZooKeeper 的 API 操作类似文件系统的层级结构上的免等待数据对象,同时保证所有操作的<strong>客户端先进先出</strong>和<strong>串行写入</strong>。ZooKpeer 使用管道架构实现高吞吐和低延迟,更新操作采用 Zab 保证线性,读取操作在服务器本地进行,不需要确定顺序。观察机制在数据更新之后通知客户端,使得客户端能够快速获取最新数据。</p><h2 id="ZooKeeper-服务"><a href="#ZooKeeper-服务" class="headerlink" title="ZooKeeper 服务"></a>ZooKeeper 服务</h2><p>ZooKeeper以库的形式向客户端提供API,库也负责客户端到ZooKeeper服务器的连接。ZooKeeper中的数据节点称为<strong>znode</strong>,以树型命名空间组织。客户端连接服务器后建立<strong>会话</strong>,通过会话句柄发送请求。</p><h3 id="服务总览"><a href="#服务总览" class="headerlink" title="服务总览"></a>服务总览</h3><p>ZooKeeper给客户端提供了数据对象的抽象(znode)。</p><p><img src="https://s2.loli.net/2022/07/15/rJi5ZPxEtfLS3pe.png" alt="zookeeper0.PNG"></p><p>znode有两种类型:</p><ul><li><strong>常规</strong>:数据对象正常创建和删除。</li><li><strong>临时</strong>:创建对象的会话终止之后,对象会被删除。</li></ul><p>如果在创建文件的时候设置<code>SEQUENTIAL</code>标志,那么会在文件名后增加一个自动增加的计数器。ZooKeeper实现了观测(watch)机制,能够在数据对象更新后通知客户端,观测只会触发一次。</p><p><strong>数据模型</strong>:ZooKeeper中的数据模型是只支持全量读写的文件系统,znode保存应用程序的抽象概念,用来存储配置、元数据等信息。</p><p><strong>会话</strong>:客户端连接ZooKeeper后建立会话,会话用来标识客户端。</p><h3 id="客户端API"><a href="#客户端API" class="headerlink" title="客户端API"></a>客户端API</h3><ul><li><code>create(path, data, flags)</code>:创建一个路径为<code>path</code>的znode,将<code>data[]</code>保存到其中,返回新znode的名称,<code>flags</code>用来设置znode类型:普通或者临时,以及设置<code>SEQUENTIAL</code>标志。</li><li><code>delete(path, version)</code>:如果版本匹配,删除<code>path</code>对应的znode。</li><li><code>exists(path, watch)</code>:如果<code>path</code>对应的znode存在,那么返回真,否则返回假。<code>watch</code>标志让客户端观测这个znode。</li><li><code>getData(path, watch)</code>:返回znode对应的数据和元数据,<code>watch</code>功能类似。</li><li><code>setData(path, data, version)</code>:如果版本匹配,将<code>data[]</code>写入到<code>path</code>对应的znode中。</li><li><code>getChildren(path, watch)</code>:返回znode的子节点集合。</li><li><code>sync(path)</code>:等待目前所有未决的更新,<code>path</code>没什么用。</li></ul><p>以上全部的方法提供了阻塞版本和非阻塞版本,如果传入版本号为-1,那么不进行版本检查。</p><h3 id="ZooKeeper保证"><a href="#ZooKeeper保证" class="headerlink" title="ZooKeeper保证"></a>ZooKeeper保证</h3><p>ZooKeeper有两项基本的顺序保证</p><ul><li><strong>线性写入</strong>:所有改变ZooKeeper状态的更新都是串行的;</li><li><strong>客户端先进先出</strong>:所有来自客户端的请求按照先进先出顺序执行。</li></ul><p>可以举个例子演示这两个保证如何保障系统运行。假设一个系统选举主节点管理其他节点,主节点随后需要更新一些配置,然后通知其他节点,要求:</p><ul><li>主节点在修改配置过程,不希望其他节点访问正在被修改的配置</li><li>主节点在更新完成前崩溃,不希望其他节点访问这些破碎的配置</li></ul><p>可以设置一个<code>ready</code>znode解决,主节点可以在配置前删除,完成后重新建立。当其他节点看到<code>ready</code>不存在时就不读取配置。</p><p>但是还会存在问题:如果其他节点看到<code>ready</code>后读取配置,但是主节点随即删除开始修改配置,那么其他节点将得到过时的配置。这个问题可以采用观测机制来解决,<code>ready</code>删除后会及时通知其他节点。</p><p>ZooKeeper两个耐久性保证:</p><ul><li>如果大部分服务器都活跃,那么服务就是可用的</li><li>如果ZooKeeper成功响应了一个修改请求,只要大部分的节点都可以最终恢复,那么修改就可以在无数次故障中保持持久。</li></ul><h3 id="原语例子"><a href="#原语例子" class="headerlink" title="原语例子"></a>原语例子</h3><ul><li><p><strong>配置管理</strong>:只需要将配置保存在一个znode中,各个进程可以通过观测来获取配置更新通知。</p></li><li><p><strong>会合</strong>:很多分布式系统包含主节点和工作节点,但是节点的调度由调度器决定,可以将主节点信息放在一个znode,供工作节点找到主节点。</p></li><li><p><strong>组成员关系</strong>:组成员进程上线之后可以在组对应的znode之下创建对应的临时子znode,成员进程退出之后临时znode也被删除,因此可以通过组znode的子znode获取组成员状态。</p></li><li><p><strong>简单锁</strong>:锁可以创建一个对应的znode实现。如果创建成功,那么获取锁。如果已经存在,那么需要等待锁被释放(znode被删除)后才能获取锁(创建znode)。</p></li><li><p><strong>无羊群效应的简单锁</strong>:简单锁会出现大量进程竞争的情况,可以将锁请求排序后,按次序分配锁。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">CREATE("f", data, sequential=TRUE, ephemeral=TRUE)</span><br><span class="line">WHILE TRUE:</span><br><span class="line"> LIST("f*")</span><br><span class="line"> IF NO LOWER #FILE: RETURN</span><br><span class="line"> IF EXIST(NEXT LOWER #FILE, watch=TRUE):</span><br><span class="line"> WAIT</span><br></pre></td></tr></table></figure></li><li><p><strong>读写锁</strong>:写锁和普通锁类似,和其他的锁互斥。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">Write Lock</span><br><span class="line">1 n = create(l + “/write-”, EPHEMERAL|SEQUENTIAL)</span><br><span class="line">2 C = getChildren(l, false)</span><br><span class="line">3 if n is lowest znode in C, exit</span><br><span class="line">4 p = znode in C ordered just before n</span><br><span class="line">5 if exists(p, true) wait for event</span><br><span class="line">6 goto 2</span><br></pre></td></tr></table></figure></li><li><p>读锁之间可以互相兼容,和写锁互斥。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">Read Lock</span><br><span class="line">1 n = create(l + “/read-”, EPHEMERAL|SEQUENTIAL)</span><br><span class="line">2 C = getChildren(l, false)</span><br><span class="line">3 if no write znodes lower than n in C, exit</span><br><span class="line">4 p = write znode in C ordered just before n</span><br><span class="line">5 if exists(p, true) wait for event</span><br><span class="line">6 goto 3</span><br></pre></td></tr></table></figure></li><li><p><strong>双栅栏</strong>:双栅栏用来保证多个客户端的计算同时开始和同时结束。客户端开始计算之前添加znode到栅栏对应的znode之下,结束计算之后删除znode。客户端需要等待栅栏znode的子znode数量到达一定阈值后才能开始计算,客户端可以等待一个特殊的<code>ready</code>的znode的创建,当数量到达阈值后创建。客户端退出的时候需要等待子znode全部被删除,同样可以通过删除<code>ready</code>删除。</p></li></ul><h2 id="ZooKeeper应用"><a href="#ZooKeeper应用" class="headerlink" title="ZooKeeper应用"></a>ZooKeeper应用</h2><ul><li><strong>解析服务</strong>:在雅虎的爬虫系统的解析服务中,主节点需要告知解析节点系统配置,解析节点需要报告自己的状态。因此,解析服务使用ZooKeeper<strong>管理配置</strong>和<strong>领导选举</strong>。下图是系统读写操作情况,可以发现读取操作占大头。</li><li><strong><a href="https://link.juejin.cn/?target=http://katta.sourceforge.net/">Katta</a><strong>:Katta是一个分布式索引,主节点将分片分配给从节点并追踪进度,主要使用ZooKeeper进行组成员</strong>关系管理</strong>、<strong>领导选举</strong>和<strong>配置管理</strong>。</li><li><strong>雅虎消息中介</strong>:雅虎消息中介负责无数话题下的消息的发布和接收,这些话题分布在多个服务器上,每个服务器采用主从备份。系统的znode结构如下图所示,类似于<code>shutdown</code>、<code>migration_prohibited</code>是系统的配置信息,<code>nodes</code>保存了属于组成员的服务器信息,而<code>topics</code>保存了负责具体话题对应的主服务器已经从服务器,另外在主节点奔溃后需要<strong>领导选举</strong>。</li></ul><h2 id="ZooKeeper实现"><a href="#ZooKeeper实现" class="headerlink" title="ZooKeeper实现"></a>ZooKeeper实现</h2><p>ZooKeeper的组件如下图所示,ZooKeeper的数据副本保存在每一个服务器上,写操作需要通过一致性协议提交到数据库,而读取请求可以直接访问服务器本地数据库获得。ZooKeeper在应用修改到数据库之前会写入到磁盘,故障后采用快照加日志的方式进行故障。根据一致协议,写入请求会转发到领导(leader)节点。</p><p><img src="https://s2.loli.net/2022/07/15/23BXaARJCq6YTzi.png" alt="zookeeper1.PNG"></p><h3 id="请求处理器"><a href="#请求处理器" class="headerlink" title="请求处理器"></a>请求处理器</h3><p>请求处理器收到写入请求之后,会将其转换为幂等的事务,根据请求内容计算出新的数据、版本号和时间戳,等待应用到数据库中。</p><h3 id="原子广播"><a href="#原子广播" class="headerlink" title="原子广播"></a>原子广播</h3><p>ZooKeeper使用Zab作为原子广播协议,使用简单的多数认同达成一致性。Zab保证广播发送和接受的顺序是一致的,领导节点广播之前需要确保已经收到了前一个领导的广播。</p><h3 id="多副本数据库"><a href="#多副本数据库" class="headerlink" title="多副本数据库"></a>多副本数据库</h3><p>当服务器故障后,使用周期性的快照和快照之后的日志恢复。创建快照的时候并不需要锁定,因为事务都是幂等的,因此再次应用已经应用的修改没有影响。</p><h3 id="客户端-服务器交互"><a href="#客户端-服务器交互" class="headerlink" title="客户端-服务器交互"></a>客户端-服务器交互</h3><p>当服务器执行一个写入操作后,会通知观测的客户端并清除观测,每个服务器只负责通知自己连接的客户端。每个读取请求对应着一个<code>zxid</code>,对应服务器上看到的最后一个写入事务的ID。因为读取是在服务器本地进行,可能在读取之前的一些写入没有同步到客户端连接的服务器,ZooKeeper提供了<code>sync</code>操作,保证<code>sync</code>之后的读取操作都能够获得发生在<code>sync</code>之前的写入结果。客户端会从服务器获取最新<code>zxid</code>,<code>zxid</code>另外一个作用就是保证客户端在切换服务器后,新服务器看到视图不能比客户端之前看到的视图落后,也就是服务器<code>zxid</code>不能早于客户端的<code>zxid</code>。如果检测客户端故障,会话是有超时时间的,客户端在没有活动期间也要发送心跳避免超时。</p>]]></content>
<summary type="html"><p><strong>转载自:</strong> <a href="https://juejin.cn/post/6844903891146915848">经典分布式论文阅读:Zookeeper</a></p>
<p>本文是 ZooKeeper 论文的阅读笔记,ZooKeeper</summary>
</entry>
<entry>
<title>Serialization</title>
<link href="https://codroc.github.io/2022/07/12/Serialize/"/>
<id>https://codroc.github.io/2022/07/12/Serialize/</id>
<published>2022-07-12T11:57:16.000Z</published>
<updated>2022-07-12T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h1 id="Serialize"><a href="#Serialize" class="headerlink" title="Serialize"></a>Serialize</h1><p>为什么需要将数据序列化?</p><ul><li>方便数据存储</li><li>方便数据传递</li></ul><p> 对于数据存储来说,一般进程通过直接访问内存来读写数据。例如初始化一个对象,就需要对对象中的各个变量进行读写。如果需要持久化一个对象怎么办,当然是写入磁盘,那么怎么把一个存储在内存中的对象写入到磁盘中呢?我们既要知道变量的类型,又要知道变量的值。</p><p> 再从数据传递的角度来说,数据往往需要通过网络传递给另一台计算机,而我们知道在链路层上传输的数据都是字节流,那么如何把一个存储在内存中的对象变成字节流传递给网络上的其他计算机呢?</p><p> 这一系列问题都可以用序列化和反序列化来解决。</p><h1 id="Google-ProtoBuf"><a href="#Google-ProtoBuf" class="headerlink" title="Google ProtoBuf"></a>Google ProtoBuf</h1><p> Protocol Buffers 是一种开源跨平台的序列化数据结构的协议。其对于存储数据或在网络上进行通信的程序是很有用的。</p><p> 这里需要掌握的是,ProtoBuf 是如何编码以及序列化数据的。可以看这篇文章 <a href="https://www.jianshu.com/p/73c9ed3a4877">深入 ProtoBuf - 编码</a>。</p><p> 我参考 ProtoBuf 设计了一套简易的 序列化工具。它支持 LV 格式,以及 Varint 编码,我目前还没搞清楚 Tag 的作用,其实用 LV 格式已经足够我目前使用了。。。。。</p><p> Serialize 库总共支持 4 种类型:<code>string</code>、<code>varint</code> 、<code>fixed32</code> 、<code>fixed64</code></p><ul><li>string:任意字符串</li><li>varint:int8_t、int16_t、int32_t、int64_t、uint8_t、uint16_t、uint32_t、uint64_t</li><li>fixed32:int32_t、uint32_t、float</li><li>fixed64:int64_t、uint64_t、double</li></ul><h1 id="Interface"><a href="#Interface" class="headerlink" title="Interface"></a>Interface</h1><p>参考 ProtoBuf,用户自己先定义一个类,然后根据类内成员变量按简单的规则定制接口。</p><ul><li>对象——>序列化字符串,序列化字符串——>对象</li><li>对象——>文件,文件——>对象</li></ul><p>简单的规则定义如下:</p><ol><li>仅支持 <code>u/sint8~64_t</code>,<code>string</code>,<code>float</code>,<code>double</code>,<code>bool</code> 等基本类型</li><li>按照成员变量的声明顺序进行序列化与反序列化</li></ol><h4 id="对象-lt-——-gt-字符串"><a href="#对象-lt-——-gt-字符串" class="headerlink" title="对象<——>字符串"></a>对象<——>字符串</h4><p>提供 <code>serializeToString</code> 和 <code>deserializeToPerson</code> 接口来实现 对象<——>字符串 之间的 序列化与反序列化:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// example.hpp</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">Person</span> {</span></span><br><span class="line"> Person(<span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">string</span>& n, <span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">string</span>& s, <span class="keyword">uint8_t</span> a, <span class="keyword">uint32_t</span> p)</span><br><span class="line"> : name(n),</span><br><span class="line"> sex(s),</span><br><span class="line"> age(a),</span><br><span class="line"> property(p)</span><br><span class="line"> {}</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> name;</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> sex;</span><br><span class="line"> <span class="keyword">uint8_t</span> age;</span><br><span class="line"> <span class="keyword">uint32_t</span> property;</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="built_in">std</span>::<span class="built_in">string</span> <span class="title">serializeToString</span><span class="params">()</span></span>;</span><br><span class="line"> <span class="function"><span class="keyword">static</span> Person <span class="title">deserializeToPerson</span><span class="params">(<span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">string</span>& serialized_string)</span></span>;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>**其中 <code>serializeToString</code> 接口用于将对象序列化成一个字符串 <em>serialized_string</em>,<code>deserializeToPerson</code> 接口用于将序列化后的字符串 <em>serialized_string</em> 反序列**化成一个对象。</p><p>这两个接口的实现很简单:</p><p><code>serializeToString</code> 只需要依次对成员变量序列化:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="built_in">std</span>::<span class="built_in">string</span> <span class="title">Person::serializeToString</span><span class="params">()</span> </span>{</span><br><span class="line"> <span class="function">Serialize <span class="title">se</span><span class="params">(Serialize::SERIALIZER)</span></span>;</span><br><span class="line"> <span class="comment">// 依次序列化</span></span><br><span class="line"> se.writeString(name);</span><br><span class="line"> se.writeString(sex);</span><br><span class="line"> se.writeVarUint8(age);</span><br><span class="line"> se.writeVarUint32(property);</span><br><span class="line"> <span class="keyword">return</span> se.toString();</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><code>deserializeToPerson</code> 只需要依次反序列化就可以了:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">Person <span class="title">Person::deserializeToPerson</span><span class="params">(<span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">string</span>& str)</span> </span>{</span><br><span class="line"> Serialize de{Serialize::DESERIALIZER, str};</span><br><span class="line"> <span class="keyword">return</span> {</span><br><span class="line"> de.readString(),</span><br><span class="line"> de.readString(),</span><br><span class="line"> de.readVarUint8(),</span><br><span class="line"> de.readVarUint32()</span><br><span class="line"> };</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h4 id="对象-lt-——-gt-文件"><a href="#对象-lt-——-gt-文件" class="headerlink" title="对象<——>文件"></a>对象<——>文件</h4><p>包含两个固定的接口:<code>serializeToFile</code>,<code>deserializeFromFile</code> 。一般情况下对于不同的类型只需要修改下 <code>deserializeFromFile</code> 的返回类型就可以了。</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">bool</span> <span class="title">Person::serializeToFile</span><span class="params">(<span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">string</span>& filepath)</span> </span>{</span><br><span class="line"> <span class="keyword">if</span> (!Serialize::toFile(filepath, serializeToString())) <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">}</span><br><span class="line"><span class="function">Person <span class="title">Person::deserializeFromFile</span><span class="params">(<span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">string</span>& filepath)</span> </span>{</span><br><span class="line"> <span class="keyword">return</span> deserializeToPerson(Serialize::fromFile(filepath));</span><br><span class="line">}</span><br></pre></td></tr></table></figure>]]></content>
<summary type="html"><h1 id="Serialize"><a href="#Serialize" class="headerlink" title="Serialize"></a>Serialize</h1><p>为什么需要将数据序列化?</p>
<ul>
<li>方便数据存储</li>
<li></summary>
</entry>
<entry>
<title>GFS 笔记</title>
<link href="https://codroc.github.io/2022/07/08/gfs%E7%AC%94%E8%AE%B0/"/>
<id>https://codroc.github.io/2022/07/08/gfs%E7%AC%94%E8%AE%B0/</id>
<published>2022-07-08T11:57:16.000Z</published>
<updated>2022-07-08T11:57:16.000Z</updated>
<content type="html"><![CDATA[<p><strong>GFS 是什么:大规模可扩展容错的分布式文件系统</strong></p><h3 id="具有的特性:"><a href="#具有的特性:" class="headerlink" title="具有的特性:"></a>具有的特性:</h3><ul><li>容错:<strong>组件失效被认为是常态事件</strong>,而不是意外事件</li><li>运行在普通机器上:GFS包括几百甚至几千台普通的廉价设备组装的存储机器,同时被相当数量的客户机访问</li><li>大文件数据处理:<strong>以通常的标准衡量,我们的文件非常巨大</strong>。数GB的文件非常普遍。</li><li>在尾部追加数据是最常见的:<strong>绝大部分文件的修改是采用在文件尾部追加数据</strong>,而不是覆盖原有数据的方式。对文件的随机写入操作在实际中几乎不存在。一旦写完之后,对文件的操作就只有读,而且通常是按顺序读。</li><li>应用程序和文件系统 API 的协同设计</li><li>能够高效、行为定义明确地实现多客户端并行追加数据到同一个文件</li></ul><h4 id="接口"><a href="#接口" class="headerlink" title="接口"></a>接口</h4><p> 类似传统文件系统 API。文件以分层目录的形式组织,用路径名来标识。我们支持常用的操作,如创建新文件、删除文件、打开文件、关闭文件、读和写文件。 </p><p> 重要接口是:<strong>快照和记录追加操作</strong>(Snapshot and AppendEntriyes)</p><hr><h3 id="架构"><a href="#架构" class="headerlink" title="架构"></a>架构</h3><p> <strong>GFS 集群</strong>由一个 master 多个 chunkserver 以及多个 clients 构成。上述的每一个成员都是以<strong>用户态进程</strong>的形式运行在普通的 Linux 机器上。允许 chunkserver 进程和 client 进程运行在同一台机器上。</p><p> <strong>文件</strong>被分片成固定大小的 chunk,每个 chunk 都有一个 64bits 的全局唯一可不变的 chunk handle 来标识,这个 chunk handle 是chunk 被创建的时候,master 分配的。 chunks 存储在某台机器的本地磁盘中,以 linux 文件格式的形式组织。通过 chunk handle 以及字节范围来读写 chunk;为了可靠性,每个 chunk 都在 3 个不同的机器上有备份。</p><p> master 维护了整个文件系统的<strong>元数据</strong>。master 通过心跳包周期性地与每一个 chunkserver 通信,给它们发送指令并收集它们的状态。</p><p> <strong>数据流</strong>。client 和 master 通信只是想要获取元数据,真正的文件数据是直接向 chunkserver 获取的。</p><p> clients 和 chunkserver 都不缓存文件数据。因为 GFS 处理的一般都是大文件,并且一般都是 stream 读的情况,缓存不了,缓存了也不一定会被再次用到。然而 clients 会缓存元数据。</p><hr><h3 id="client-的简单读的流程"><a href="#client-的简单读的流程" class="headerlink" title="client 的简单读的流程"></a>client 的简单读的流程</h3><p><img src="https://s2.loli.net/2022/06/06/KpelqfMPQ2RoAIH.png" alt="GFS0.PNG"></p><ul><li>客户端给出想要读取的文件名 file name,以及 offset</li><li>根据固定大小的 chunk(64MB),得到 chunk index,然后将 (file name,chunk index)发送给 master</li><li>master 查看自己的元数据,响应客户端,传回 (chunk handle,location of replicas)</li><li>client 缓存 (chunk handle,location of replicas),把 (file name,chunk index)作为 key;缓存有效期内不必再询问 master</li><li>client 从 location of replicas 中挑一个 replica (可能是物理位置最近的那个 replica),把(chunk handle,byte range)发给它</li></ul><hr><h3 id="Chunk-Size-固定成多大比较好?"><a href="#Chunk-Size-固定成多大比较好?" class="headerlink" title="Chunk Size 固定成多大比较好?"></a>Chunk Size 固定成多大比较好?</h3><p>论文中固定 chuck size = 64MB</p><p>chuck size 的选择是至关重要的,为什么选择 64MB?这个数远远大于 OS 页帧大小;</p><p><strong>选择大的 chuck size 是有好处的</strong>:</p><ol><li><p>chuck size 大了,整个文件所对应的 chuck 的数量就少了,这样客户端就能轻易缓存好几个 TB 大小的文件的 chuck handle 以及对应的 location,这样就可以减少跟 master 的交互,降低 master 的压力(只要缓存不过期,就不需要询问 master 了),这一点可以归结为 <em>chuck size 越大,元信息越少</em>;</p></li><li><p>这一点直接使用了第一点的特性即 <em>chuck size 越大,元信息越少</em>,因为元信息少,master 可以把全部的元信息直接放在内存上,加快访问速度;</p></li><li><p>由于是使用一个大的chunk,客户端可以在一个chunk上完成更多的操作,它可以通过维持一个到chunk server的TCP长连接来减少网络管理量(overhead,负载?)</p></li></ol><p><strong>选择大的 chuck size 也有坏处</strong>:</p><p>在小文件的情况下,会出现热点问题;如果一个文件很小,不到 64 MB,那么只有一个 chuck,加上备份的,那么总共三台 Server 存储了这个 chuck,如果此时有大量客户访问这个文件,那么这三台 Server 就变成了热点,立马有成百上千的并发访问到达,服务器立刻就超载了。</p><p>怎么解决因为小文件引起的热点问题?</p><ol><li>提高备份级别(多备份几份,原来是有 3 份,那么现在可以搞成 10 份)</li><li>不要让很多客户端在一个时间段能同时并发访问,把访问时间隔开</li><li>使用 P2P 的方式</li></ol><hr><h3 id="元数据"><a href="#元数据" class="headerlink" title="元数据"></a>元数据</h3><p>master 维护了三种元数据:</p><ul><li>file/chunk namespace</li><li>从 file 到 chunks 的映射关系</li><li>每一个 chunk 及其副本的位置</li></ul><p> 所有元数据都保存在 master 的内存中,前两种元数据需要通过 WAL 的方式持久化存储,并做远程备份(replication)。第三种元数据,是 master 询问 chuck server 得到的。<strong>使用 WAL 可以抵御因 master crash 导致的数据不一致的风险。</strong></p><p> metadata 保存在内存中的好处:</p><ul><li>访问内存比访问磁盘快多了</li><li>有利于 master 后台线程周期性地扫描整个状态</li><li>周期性的扫描可以方便的进行:chunk 垃圾回收,重复制,在 chunkserver 之间进行 chunk 迁移来实现 <strong>负载均衡</strong> 和 <strong>磁盘空间使用率的均衡</strong></li></ul><p><strong>唯一潜在的问题</strong>:单台机器的内存有上限</p><p><em>为什么 master 不用持久化保存 chunk 所在的副本位置的信息?</em></p><h4 id="操作日志-Operation-Log"><a href="#操作日志-Operation-Log" class="headerlink" title="操作日志 Operation Log"></a>操作日志 Operation Log</h4><p>操作日志保存了关键<strong>元数据</strong>变化的历史记录。<strong>它是 GFS 的核心</strong>。</p><ul><li><p>操作日志是整个系统的逻辑时间,定义了并行操作的顺序。</p></li><li><p>日志被持久化之前,对于客户端来说不可见。因为此时的数据是不可靠的(还没有持久化可能会丢失)</p></li><li><p>日志压缩方式:异步 checkpoint 其实也就是 snapshot</p></li></ul><p>Q: 如果在制作 checkpoint 的时候发生故障怎么办?这是没有问题的。因为我们每一次修改日志都会做持久化,制作 checkpoint 时发生故障,无非就是不理会这个没有完成的 checkpoint,重放日志记录就可以恢复到故障前的状态了;</p><hr><h3 id="GFS-的一致性是怎么实现的?"><a href="#GFS-的一致性是怎么实现的?" class="headerlink" title="GFS 的一致性是怎么实现的?"></a>GFS 的一致性是怎么实现的?</h3><p>首先 GFS 支持的一致性是什么?</p><p>GFS 支持<strong>宽松一致性</strong>,从下图就可以看出;</p><p><strong>对于这个表格需要注意的是</strong> Write 指的是 in-place write;Record Append 指 append write;<em>defined</em> 和 <em>consistent</em> 针对的是文件的某一个<strong>数据区域</strong>而言的,不是针对整个文件数据;</p><p><em>consistent</em>:无论从哪个副本读,所有 clients 看到的文件区域中的数据都是一样时,这个文件区域具有一致性</p><p><em>defined</em>:满足两个条件,1. 文件区域具有一致性;2. 所有客户端能够看到<strong>完整的变更</strong>;那么这个文件区域就是已定义的。</p><p>对于 <em>defined</em> 中<strong>完整的变更</strong>怎么理解?变更嘛,对于文件来说无非就三种,串行 in-place write,并发 in-place write,record append;你做变更的时候肯定要指定<strong>变更数据</strong>是不是?完整的变更就是指,一次变更结束后,你立马读那块区域的数据,读出来的数据就是你写进去的变更数据,而不是其他莫名其妙的数据。</p><p><img src="https://s2.loli.net/2022/06/06/2oqGktIrcKexzMV.png" alt="GFS1.PNG"></p><ul><li>namespace 的变更是原子操作,WAL 保证全局操作顺序</li><li>成功的串行 in-place write 所操作的那块文件区域必定是已定义的</li><li>成功的并发 in-place write 所操作的那块文件区域是一致的但是未定义的</li><li>record append 所操作的那块文件区域必定是已定义的,但是在它前面区域可能是不一致的</li></ul><h3 id="租约与变更顺序"><a href="#租约与变更顺序" class="headerlink" title="租约与变更顺序"></a>租约与变更顺序</h3><p> GFS 使用<strong>租约机制</strong>来维护多副本的<strong>数据变更顺序一致性</strong>;除此之外,<strong>租约机制也大大减轻了 master 的负担</strong>,因为所有的写请求就不需要通过 master 而是直接通过 primary 就可以了;</p><ul><li>master 利用租约,保证在任意时刻,副本中至多只有一个 primary</li><li>primary 将所有对 chunk 的变更操作标号排序,得到一个统一的变更顺序,然后让 secondary 按照这个顺序来应用变更</li></ul><p><strong>和 Raft 的区别</strong>:Raft 使用 Leader Election 来选举出一个集群中唯一的领导,然后把所有的读写请求作为日志记录通过领导复制给其他副本,达到多副本的数据一致性;而 GFS 则由 master 通过租约的形式来委任一个 primary,让 primary 给所有的变更规划一个统一顺序,然后让 secondary 按照这个顺序来应用变更来达到多副本数据一致;<strong>所以区别在于,Raft 是选举得到唯一的话事人,GFS 是通过租约得到唯一的话事人</strong></p><blockquote><p>其实从上述分析中可以得到 Raft 和 GFS 的共同点:都需要得到一个唯一话事人来规划一个统一的顺序</p></blockquote><h3 id="client-写的流程"><a href="#client-写的流程" class="headerlink" title="client 写的流程"></a>client 写的流程</h3><p>client 写可以分成 7 个步骤,2 条数据流(控制数据和文件数据)</p><p><img src="https://s2.loli.net/2022/06/06/l4VGaILz81vjSF6.png" alt="GFS2.PNG"></p><ul><li>client 询问 master,我要写的那个 chunk 所在的 primary chunkserver 是谁;如果 master 发现那个 chunk 对应的所有 chunkserver 没有一个持有租约,则找到最新的那个副本(master 会维护一个版本号来识别哪个副本是最新的),让他成为 primary</li><li>master 响应客户端 primary 以及 secondary 的位置,客户端缓存这些信息直到租约过期或与 primary 失去联系</li><li>client 把文件数据<strong>通过 pipeline 的方式</strong>推送给最近的副本,然后让那个副本同样用 pipeline 的方式继续推送文件数据;副本收到数据后,把数据缓存在 LRU Buffer Cache 中直到数据被使用或过期</li><li>一旦<strong>所有的副本</strong>都收到了数据并且响应 client 后,client 才会发送写请求给 primary。primary 给所有的数据变更(可能来自多个 clients)排一个序,然后根据顺序应用这些变更到自己本地状态机</li><li>primary 将写请求转发给所有的 secondary,每一个 secondary 都按照相同的顺序应用变更到本地状态机</li><li>secondary 响应 primary 表示应用变更成功</li><li>primary 响应 client 。任何副本碰到的 error 都会返回给 client。 client 仅仅通过重试写请求来处理 error。它会首先在3-7步骤间进行一些尝试后在重新从头重试这个写操作</li></ul><blockquote><p>整个过程需要注意的点:</p><ol><li><p>如果某一时刻某个 chunk 没有 primary,那么 master 怎么从所有的副本中找到最新的副本并给他租约?</p><ul><li>master 给每个 chunk 维护了一个版本号,只要副本中的版本号与 master 中所维护的那个一致,那么它就是最新的副本</li></ul></li><li><p>为什么使用 pipeline 的方式推送数据?</p><ul><li>使用 pipeline 的方式相当于是链式拓扑推送数据,一台机器只需要往外推送一次数据就可以了,减轻了网络带宽的负载;又由于不需要等待数据完全达到就可以继续往下传递,大大减少了延迟;</li></ul></li><li><p>如果有某个副本突然下线,那么就不能收到数据并响应 client 了,此时 client 不能收到所有副本的响应,此时该怎么办?</p></li><li><p>primary 告诉所有的副本去执行数据追加操作,某些成功了,某些没成功,所以现在,一个 chunk 的部分副本成功完成了数据追加,而另一部分没有成功,此时读数据会发生什么?</p><ul><li><p>此时读到的数据可能是最新的,也可能是旧的,取决于你读的是哪个副本;对于 GFS 来说,这种状态是可接受的,没什么需要恢复的;</p><p>那么如果我想要读到新的数据该怎么办?</p></li></ul></li><li><p>如果一个写操作跨越了 chunk 边界怎么办?</p><ul><li>GFS 库会把这次写操作拆分成多次写操作;这样的坏处是,丢失了原子性,多次写操作中间可能会插入其他 client 的写操作,这样文件区域的状态处在<strong>一致但未定义</strong>(与并发写的结果一样)</li></ul></li></ol></blockquote><hr><h3 id="原子记录追加"><a href="#原子记录追加" class="headerlink" title="原子记录追加"></a>原子记录追加</h3><p> GFS 支持两种写,in-place write 和 append,对于前者,需要给出具体的偏移量,而后者只要给出数据就可以了。并发地 in-place write 是不保证串行的,因此结果可能是所写区域的尾部数据由多个 client 的数据的片段构成;因此并发的 in-place 写能保证一致性,但文件区域会处在一种一致但未定义的状态;</p><p> <strong>对于 append,GFS 保证至少原子地追加一次到文件末尾!</strong></p><p> append 的整个流程和上述的 client 写的流程大差不差,只是 primary 需要多一个逻辑判断;如果 append 数据后超过了整个 chunk 的大小,那么 primary 会先填充完 chunk 中的剩余空间,然后告诉 client 让它向下一个 chunk 重新发起一次 append;当所有副本收到数据,并且 primary 肯定 append 不会导致 chunk 溢出后,就把 append 操作应用到本地,并且让其他副本也在相同的偏移位置(primary append 的偏移位置)应用这个 append 操作;最终所有副本都成功后,primary 响应 client,告知成功;一旦某一个副本没成功,primary 就会让 client retry;</p><p> <strong>append 操作的数据大小必须小于 chunk 的 1/4</strong>,也就是 16MB;</p><p> 由于 append 有 atomically at least once 特性,那么 append 一定是已定义的(也包括了一致性),但是可能存在某些副本在 client 经过多次 append 请求后才真正进行 append,因此<strong>在 append 前的数据区域可能是未定义的。</strong> 例如:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">S0: a b b c d c</span><br><span class="line">S1: a * b c d c</span><br><span class="line">S2: a * b * d c</span><br></pre></td></tr></table></figure><h3 id="snapshot"><a href="#snapshot" class="headerlink" title="snapshot"></a>snapshot</h3><hr><p> 对于快照这一块,需要搞清楚:</p><ul><li>快照是 master metadata 的快照,还是 chunkserver 中 chunk 的快照?<ul><li>我认为是 master metadata 的快照,因为只有 metadata 这一块使用了 WAL 来保证一致性以及故障恢复;</li></ul></li><li>快照使用了什么技术来让整个过程轻量化?<ul><li>使用了 COW?怎么用的?</li></ul></li></ul><hr><h3 id="Master-的工作"><a href="#Master-的工作" class="headerlink" title="Master 的工作"></a>Master 的工作</h3><p>master 干了以下工作:</p><ul><li>所有涉及 namespace 的操作</li><li>对副本的放置做决定</li><li>创建新的副本以及 chunks</li><li>通过重复制保持 chunk 的复制级别</li><li>在 chunkservers 间进行负载均衡(网络带宽负载,磁盘使用率负载)</li><li>未使用空间的垃圾回收</li></ul><h3 id="如何保证-namespace-的变更是原子的?"><a href="#如何保证-namespace-的变更是原子的?" class="headerlink" title="如何保证 namespace 的变更是原子的?"></a>如何保证 namespace 的变更是原子的?</h3><p>对 namespace 的修改操作要求串行化执行,为了灵活性,采用读写锁,对目录加读锁,对文件加写锁。</p><p>每个 master 在执行操作之前都需要获得锁的集合,比如,如果它想操作 <code>/d1/d2…/dn/leaf</code>,那么它需要获得 <code>/d1,/d1/d2……,/d1/d2…/dn</code> 这些目录的读锁,然后才能得到路径 <code>/d1/d2…/dn/leaf</code> 的读锁或者写锁。</p><p><strong>这种锁模式的一个好处就是它允许对相同目录的并发变更操作</strong>。比如多个文件的创建可以在相同目录下并发创建:每个获得该目录的一个读锁,以及文件的一个写锁。</p><ul><li>目录名称上的读锁足够可以防止目录被删除,重命名或者快照。</li><li>文件名称上的写锁将会保证重复创建相同名称的文件的操作只会被执行一次。</li></ul><p><strong>加锁顺序</strong>很重要,可以有效避免死锁:锁是按照一个一致的全序关系进行获取的:首先根据所处的 namespace 树的级别,相同级别的则根据字典序。</p><h3 id="副本放置的位置"><a href="#副本放置的位置" class="headerlink" title="副本放置的位置"></a>副本放置的位置</h3><p>chunk 的<strong>备份放置策略</strong>服务于两个目的:最大化数据可靠性和可用性,最小化网络带宽的使用。</p><p>论文中的做法是将备份放在不同机柜的机器上,这样既能做到机柜级别的容错,读取操作也能利用多个机柜的带宽;</p><h3 id="chunk-的创建,重复制,重平衡"><a href="#chunk-的创建,重复制,重平衡" class="headerlink" title="chunk 的创建,重复制,重平衡"></a>chunk 的创建,重复制,重平衡</h3><p>创建 chunk 的时机:</p><ul><li>写操作可能需要新的 chunk</li><li>chunk 的可用备份数低于用户设定的目标时,master 会进行重复制</li><li>周期性地重平衡</li></ul><p>为 chunk 选择 chunkserver 时需要考虑:</p><ul><li><p>考虑到平均磁盘使用率</p></li><li><p>chunkserver 上最近的 chunk 创建数或 clone 数</p></li><li><p>在不同机柜间放置</p></li></ul><h3 id="垃圾回收"><a href="#垃圾回收" class="headerlink" title="垃圾回收"></a>垃圾回收</h3><p><strong>懒回收机制</strong></p><ol><li>当应用删除文件时,master 先把该文件改名为隐藏文件,并标上时间戳</li><li>master 会周期性地扫描文件系统的 namespace,会定期删除那些超过 3 天的隐藏文件</li><li>在类似的 master 扫描程序中,会检测 chunk namespace,如果发现了过期的 chunks(即那些没有相关联的文件的 chunks)则删除所有与之相关的元数据</li><li>在 master 与 chunkserver 的周期性心跳中,chunkserver 会报告自己所持有的所有 chunk handle,master 会查看自己所持有的所有 chunk handle,把那些已经删除的 chunk 的 chunk handle 通过心跳包传给 chunkserver,然后 chunkserver 就可以自己去删除它本地的 chunk 数据了</li></ol><h3 id="识别陈旧副本"><a href="#识别陈旧副本" class="headerlink" title="识别陈旧副本"></a>识别陈旧副本</h3><p>对于每一个 chunk,master 都为其维护了一个 version number 来识别是否是最新副本。</p><p>master 每次在一个 chunk 上授权新的租约的时候,都会增加这个 chunk 的 version number;</p><p>master 和所有的副本都会记录这个最新的 version number,并持久化 </p><p>如果另一个副本当前不可用,它的 chunk 版本号就不会被更新。当 chunkserver 重启或者报告它的 chunk 和对应的版本号的时候, master 会检测该 chunkserver 是否包含过期副本。</p><p>陈旧的副本会被 master 的周期性扫描程序通过垃圾回收的方式删除;</p><p>当 client 询问 master 关于 chunk 所在的 chunkserver 时,master 只会把最新的 chunk 所在的 chunkserver 告知 client;并且为了更加安全,每次 client 与 chunkserver 通信时都会通过 version number 再次确定 chunkserver 中的 chunk 是最新的!</p>]]></content>
<summary type="html"><p><strong>GFS 是什么:大规模可扩展容错的分布式文件系统</strong></p>
<h3 id="具有的特性:"><a href="#具有的特性:" class="headerlink" title="具有的特性:"></a>具有的特性:</h3><ul>
<li</summary>
</entry>
<entry>
<title>Config System</title>
<link href="https://codroc.github.io/2022/07/05/%E9%85%8D%E7%BD%AE%E7%B3%BB%E7%BB%9F/"/>
<id>https://codroc.github.io/2022/07/05/%E9%85%8D%E7%BD%AE%E7%B3%BB%E7%BB%9F/</id>
<published>2022-07-05T11:57:21.573Z</published>
<updated>2022-07-05T11:57:21.573Z</updated>
<content type="html"><![CDATA[<h2 id="配置系统"><a href="#配置系统" class="headerlink" title="配置系统"></a>配置系统</h2><p><strong>配置系统有什么用?</strong></p><p>我的理解是方便程序的运行和发布。把配置变量都抽离出来放在配置文件中,如果要修改配置变量,就直接在配置文件里修改,然后重新运行程序就可以了。如果没有配置系统的情况下要修改配置变量,一般都是直接改程序源代码,然后重新编译连接,毫无疑问这将会是费时费力的(找对应版本的各种库,对应版本的编译器等等,还要等待漫长的编译连接过程。。。),对于那些非开源软件,想改源代码就更不可能了。。。。</p><p>配置系统就能够很好地解决这些问题。</p><h3 id="YAML"><a href="#YAML" class="headerlink" title="YAML"></a>YAML</h3><p>选择一种用于配置文件的语言,我选的是 YAML。它是专门用来写配置文件的语言,非常简洁和强大,远比 JSON 格式方便。</p><p>YAML 实质上是一种通用的数据串行化格式。它的基本语法规则如下:</p><ul><li>大小写敏感</li><li>使用缩进表示层级关系</li><li>缩进时不允许使用 Tab 键,只允许使用空格。</li><li>缩进的空格数目不重要,只要相同层级的元素左侧对齐即可</li></ul><p><code>#</code> 表示注释,从这个字符一直到行尾,都会被解析器忽略。</p><p>YAML 支持的数据结构有三种:</p><ul><li>对象:键值对的集合,又称为映射(mapping)/ 哈希(hashes) / 字典(dictionary)</li><li>数组:一组按次序排列的值,又称为序列(sequence) / 列表(list)</li><li>纯量(scalars):单个的、不可再分的值</li></ul><p>和 JSON 对比,它数据类型更单调简单(JSON 有 6 种类型)</p><p>YAML 下载与安装:</p><p><code>yaml-cpp: github repo</code></p><p><code>mkdir build && cd build && cmake .. && make install</code></p><h3 id="基于-YAML-实现-配置系统"><a href="#基于-YAML-实现-配置系统" class="headerlink" title="基于 YAML 实现 配置系统"></a>基于 YAML 实现 配置系统</h3><p>配置系统的原则:</p><ul><li><strong>约定优于配置:</strong>约定即源代码中写死的值,而配置是指在配置文件 (.yaml) 中指定的值。</li><li><strong>不能无中生有:</strong>在源文件中未定义的配置变量,即使在配置文件 (.yaml) 中定义了也不会生效。</li></ul><p>总体的结构是这样的:</p><p> 由于配置变量一般都由:变量名,变量值,变量描述构成。因此可以抽一个基类出来存放这些共通的属性,必然的,有时候会需要把配置变量输出到控制台给用户看,或根据字符串来重置变量值,因此还需要一个 fromString 和 toString 方法:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">ConfigVarBase</span> {</span></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="keyword">using</span> ptr = <span class="built_in">std</span>::<span class="built_in">shared_ptr</span><ConfigVarBase>;</span><br><span class="line"> <span class="function"><span class="keyword">virtual</span> <span class="keyword">bool</span> <span class="title">fromString</span><span class="params">(<span class="built_in">std</span>::<span class="built_in">string</span> str)</span></span>; <span class="comment">// 根据 str 来设置 配置变量值</span></span><br><span class="line"> <span class="function"><span class="keyword">virtual</span> <span class="keyword">void</span> <span class="title">toString</span><span class="params">()</span></span>; <span class="comment">// 把配置变量值转成字符串,便于输出</span></span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> _name;</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> _description;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p> 其中变量名和变量描述由于类型固定,可以放在基类中,而变量值则不固定了,它可以是任意类型,因此就可以根据基类派生出一个模板子类来表示具体的配置变量:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">template</span><<span class="class"><span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">ConfigVar</span> :</span> <span class="keyword">public</span> ConfigVarBase {</span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="function"><span class="keyword">bool</span> <span class="title">fromString</span><span class="params">(<span class="built_in">std</span>::<span class="built_in">string</span> str)</span> <span class="keyword">override</span></span>;</span><br><span class="line"> <span class="function"><span class="built_in">std</span>::<span class="built_in">string</span> <span class="title">toString</span><span class="params">()</span> <span class="keyword">override</span></span>;</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> T _val;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p> 现在有配置变量了,缺一个管理这些配置变量的类,我使用 map 来进行管理</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">Config</span> {</span></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> ...</span><br><span class="line"> static std::map<std::string, ConfigVarBase::ptr>& GetConfigVars() {</span><br><span class="line"> <span class="keyword">static</span> <span class="built_in">std</span>::<span class="built_in">map</span><<span class="built_in">std</span>::<span class="built_in">string</span>, ConfigVarBase::ptr> g_configVars;</span><br><span class="line"> <span class="keyword">return</span> g_configVars;</span><br><span class="line"> }</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>这里为什么要用 static 函数来返回一个 local static 变量 map 呢?这是因为,配置系统可以被其他编译单元内的数据结构使用,如果其他编译单元想要使用 <strong>g_configVars</strong> 时,它还没有初始化完毕就会产生 runtime error,这种情况就是所谓的 non-local static 初始化顺序不一致。可以使用 local static 的方式来解决,也就是让别的编译单元通过调用函数的方式获取 <strong>g_configVars</strong>,这样使用它之前肯定被初始化好了。</p><p>这样一来 <strong>约定的变量</strong> 就实现了!</p><p>接下来就是怎么实现,从配置文件 (.yaml) 中读取配置变量。yaml-cpp 库提供了 LoadFile 函数,能从 .yaml 文件中读取 YAML::Node。</p><p>由于 .yaml 中的格式和我源代码中变量名字的格式是不一样的:</p><p>yaml 中是:</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">A:</span></span><br><span class="line"><span class="attr">B:</span> <span class="number">10</span></span><br><span class="line"><span class="attr">C:</span> <span class="number">20</span></span><br></pre></td></tr></table></figure><p>源文件中的变量名则为:A.B = 10,A.C = 20</p><p>因此这里需要一个从 YAML 名称格式到 源代码中的变量名称格式的转换。可以借助 yaml-cpp 中的 <code>IsNull, IsScalar, IsSequence, Ismap</code> 对 node 进行递归解析,然后将变量名进行转换。只有对象类型才需要递归解析下去</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">listAllNodes</span><span class="params">(<span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">string</span>& name, <span class="keyword">const</span> YAML::Node& node, <span class="built_in">std</span>::<span class="built_in">vector</span><<span class="built_in">std</span>::<span class="built_in">pair</span><<span class="built_in">std</span>::<span class="built_in">string</span>, YAML::Node>>& allNodes)</span> </span>{</span><br><span class="line"> allNodes.push_back(<span class="built_in">std</span>::<span class="built_in">make_pair</span>(name, node));</span><br><span class="line"> <span class="keyword">if</span> (node.IsNull()) {</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">else</span> <span class="keyword">if</span> (node.IsScalar()) {</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">else</span> <span class="keyword">if</span> (node.IsSequence()) {</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">else</span> <span class="keyword">if</span> (node.IsMap()) {</span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">auto</span> it = node.begin(); it != node.end(); ++it) {</span><br><span class="line"> listAllNodes(name.empty() ? it->first.as<<span class="built_in">std</span>::<span class="built_in">string</span>>() :</span><br><span class="line"> name + <span class="string">"."</span> + it->first.Scalar(), it->second, allNodes); <span class="comment">// 这里是名字转换的关键所在</span></span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line">}</span><br><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">Config::loadFromYaml</span><span class="params">(<span class="keyword">const</span> <span class="keyword">char</span>* filename)</span> </span>{</span><br><span class="line"> YAML::Node node = YAML::LoadFile(filename);</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">vector</span><<span class="built_in">std</span>::<span class="built_in">pair</span><<span class="built_in">std</span>::<span class="built_in">string</span>, YAML::Node>> allNodes;</span><br><span class="line"> listAllNodes(<span class="string">""</span>, node, allNodes);</span><br><span class="line"></span><br><span class="line"> <span class="keyword">for</span> (<span class="keyword">auto</span> i : allNodes) { <span class="comment">// 对所有的 name、node 进行遍历</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> name = i.first;</span><br><span class="line"> <span class="keyword">if</span> (name.empty())</span><br><span class="line"> <span class="keyword">continue</span>;</span><br><span class="line"> ConfigVarBase::ptr p = Config::find(name);</span><br><span class="line"> <span class="keyword">if</span> (p) { <span class="comment">// 这里保证了不会无中生有的原则</span></span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">stringstream</span> ss;</span><br><span class="line"> ss << i.second;</span><br><span class="line"> p->fromString(ss.str());</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>我们都知道,YAML 文件中仅仅是一些文本,而我们需要根据 YAML 文本得到内存中的对象;或者根据内存中的对象得到 YAML 文本;这本质就是序列化和反序列化。和 JSON 的十分相似。<br>我做出了这样的总结:</p><ul><li>YAML::Node –> std::string –> Type<ul><li>通过 std::stringstream 来实现 YAML::Node –> std::string. 这一部分 YAML 库已经做好了</li><li>通过 自己实现的 <code>LexicalCast\<F, T\></code> 来做 std::string –> Type 的转换</li></ul></li><li>Type –> std::string –> YAML::Node<ul><li>通过 YAML::Load 来实现 std::string –> YAML::Node. 这一部分 YAML 库已经做好了</li><li>通过 自己实现的 <code>LexicalCast\<F, T\></code> 来做 Type –> std::string 的转换</li></ul></li></ul><h3 id="fromStr-和-toStr-的实现"><a href="#fromStr-和-toStr-的实现" class="headerlink" title="fromStr 和 toStr 的实现"></a>fromStr 和 toStr 的实现</h3><p>对于普通的内置类型可以用 boost::lexical_cast 来实现,而对于复杂的数据类型,例如:vector,list,set,map,unordered_set,unordered_map, 自定义类型 等,就要自己去实现了。</p><p><strong>STL 类型的支持:</strong></p><p>可以实现一个 LexicalCast 模板类,然后根据具体的 STL 容器对 LexicalCast 进行偏特化就行了。</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// 对于普通内置类型</span></span><br><span class="line"><span class="keyword">template</span><<span class="class"><span class="keyword">class</span> <span class="title">F</span>, <span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">LexicalCast</span> {</span></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="function">T <span class="title">operator</span><span class="params">()</span><span class="params">(<span class="keyword">const</span> F& val)</span> </span>{</span><br><span class="line"> <span class="keyword">return</span> boost::lexical_cast<T>(val);</span><br><span class="line"> }</span><br><span class="line">};</span><br><span class="line"></span><br><span class="line"><span class="comment">// cast from std::string to std::vector<T></span></span><br><span class="line"><span class="keyword">template</span><<span class="class"><span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">LexicalCast</span><</span><span class="built_in">std</span>::<span class="built_in">string</span>, <span class="built_in">std</span>::<span class="built_in">vector</span><T>> {</span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="function"><span class="built_in">std</span>::<span class="built_in">vector</span><T> <span class="title">operator</span><span class="params">()</span><span class="params">(<span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">string</span>& str)</span> </span>{</span><br><span class="line"> ...<span class="comment">// 利用 yaml-cpp 的 Load 得到 node 然后遍历 node,利用 stringstream 格式化</span></span><br><span class="line"> }</span><br><span class="line">};</span><br><span class="line"><span class="comment">// cast from std::vector<T> to std::string</span></span><br><span class="line"><span class="keyword">template</span><<span class="class"><span class="keyword">class</span> <span class="title">T</span>></span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">LexicalCast</span><</span><span class="built_in">std</span>::<span class="built_in">vector</span><T>, <span class="built_in">std</span>::<span class="built_in">string</span>> {</span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="function"><span class="built_in">std</span>::<span class="built_in">string</span> <span class="title">operator</span><span class="params">()</span><span class="params">(<span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">vector</span><T>& v)</span> </span>{</span><br><span class="line"> ...</span><br><span class="line"> }</span><br><span class="line">};</span><br><span class="line"></span><br><span class="line">....</span><br><span class="line"></span><br><span class="line"><span class="keyword">template</span><<span class="class"><span class="keyword">class</span> <span class="title">T</span>, <span class="keyword">class</span> <span class="title">FromStr</span> =</span> LexicalCast<<span class="built_in">std</span>::<span class="built_in">string</span>, T>, </span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">ToStr</span> =</span> LexicalCast<T, <span class="built_in">std</span>::<span class="built_in">string</span>>></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">ConfigVar</span> :</span> <span class="keyword">public</span> ConfigVarBase {</span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="function"><span class="keyword">bool</span> <span class="title">fromString</span><span class="params">(<span class="built_in">std</span>::<span class="built_in">string</span> str)</span> <span class="keyword">override</span> </span>{</span><br><span class="line"> _val = FromStr()(str);</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line"> }</span><br><span class="line"> <span class="function"><span class="built_in">std</span>::<span class="built_in">string</span> <span class="title">toString</span><span class="params">()</span> <span class="keyword">override</span> </span>{</span><br><span class="line"> <span class="keyword">return</span> ToStr()(_val);</span><br><span class="line"> }</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> T _val;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p><strong>自定义类型的支持:</strong></p><p>自定义类型,需要实现 LexicalCast 偏特化,实现后,就可以支持 Config 解析自定义类型,自定义类型可以和常规 STL 容器一起使用。</p><p>例如,增加 Person 类:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">Person</span> {</span></span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">string</span> name;</span><br><span class="line"> <span class="keyword">int</span> age;</span><br><span class="line"> <span class="keyword">bool</span> sex;</span><br><span class="line">};</span><br><span class="line"></span><br><span class="line"><span class="comment">// from std::string to Person</span></span><br><span class="line"><span class="keyword">template</span><></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">LexicalCast</span><</span><span class="built_in">std</span>::<span class="built_in">string</span>, Person> {</span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="function">Person <span class="title">operator</span><span class="params">()</span><span class="params">(<span class="keyword">const</span> <span class="built_in">std</span>::<span class="built_in">string</span>& str)</span> </span>{</span><br><span class="line"> YAML::Node node = YAML::Load(str);</span><br><span class="line"> Person p;</span><br><span class="line"> p.name = node[<span class="string">"name"</span>].as<<span class="built_in">std</span>::<span class="built_in">string</span>>();</span><br><span class="line"> p.age = node[<span class="string">"age"</span>].as<<span class="keyword">int</span>>();</span><br><span class="line"> p.sex = node[<span class="string">"sex"</span>].as<<span class="keyword">bool</span>>();</span><br><span class="line"> <span class="keyword">return</span> p;</span><br><span class="line"> }</span><br><span class="line">};</span><br><span class="line"></span><br><span class="line"><span class="comment">// from Person to std::string</span></span><br><span class="line"><span class="keyword">template</span><></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">LexicalCast</span><</span>Person, <span class="built_in">std</span>::<span class="built_in">string</span>> {</span><br><span class="line"><span class="keyword">public</span>:</span><br><span class="line"> <span class="function"><span class="built_in">std</span>::<span class="built_in">string</span> <span class="title">operator</span><span class="params">()</span><span class="params">(<span class="keyword">const</span> Person& p)</span> </span>{</span><br><span class="line"> YAML::Node node;</span><br><span class="line"> node[<span class="string">"name"</span>] = p.name;</span><br><span class="line"> node[<span class="string">"age"</span>] = p.age;</span><br><span class="line"> node[<span class="string">"sex"</span>] = p.sex;</span><br><span class="line"> <span class="built_in">std</span>::<span class="built_in">stringstream</span> ss;</span><br><span class="line"> ss << node;</span><br><span class="line"> <span class="keyword">return</span> ss.str();</span><br><span class="line"> }</span><br><span class="line">};</span><br></pre></td></tr></table></figure><h3 id="配置的事件机制"><a href="#配置的事件机制" class="headerlink" title="配置的事件机制"></a>配置的事件机制</h3><p>当一个配置项发生修改的时候,可以反向通知对应的代码。</p><p>这个其实挺容易实现的,在 ConfigVar 模板类中添加一个 OnChangeCallBack _cb 回调,它是 </p><p>std::function<const T& oldVal, const T& newVal> 类型的,每当要改变 ConfigVar::_val 时,先判断一下,新的值是否与旧值不同,如果是的化则回调 _cb</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"> <span class="class"><span class="keyword">class</span> <span class="title">ConfigVar</span> {</span></span><br><span class="line">... </span><br><span class="line"><span class="function">T <span class="title">getValue</span><span class="params">()</span> <span class="keyword">const</span> </span>{ <span class="keyword">return</span> _val; }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">setValue</span><span class="params">(<span class="keyword">const</span> T& v)</span> </span>{</span><br><span class="line"> <span class="keyword">if</span> (v == _val)</span><br><span class="line"> <span class="keyword">return</span>;</span><br><span class="line"> <span class="keyword">if</span> (_cb)</span><br><span class="line"> _cb(_val, v);</span><br><span class="line"> _val = v;</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">setOnChangeCallBack</span><span class="params">(OnChangeCallBack cb)</span> </span>{ _cb = cb; }</span><br><span class="line"> <span class="function"><span class="keyword">void</span> <span class="title">delOnChangeCallBack</span><span class="params">()</span> </span>{ _cb = <span class="literal">nullptr</span>; }</span><br><span class="line"><span class="keyword">private</span>:</span><br><span class="line"> T _val;</span><br><span class="line"> OnChangeCallBack _cb;</span><br><span class="line">};</span><br></pre></td></tr></table></figure><p>这样配置系统基本就完成了!</p>]]></content>
<summary type="html"><h2 id="配置系统"><a href="#配置系统" class="headerlink" title="配置系统"></a>配置系统</h2><p><strong>配置系统有什么用?</strong></p>
<p>我的理解是方便程序的运行和发布。把配置变量都抽离出来放在</summary>
</entry>
<entry>
<title>rapidjson——dump/load json data to/from file</title>
<link href="https://codroc.github.io/2022/06/14/rw_file_by_json/"/>
<id>https://codroc.github.io/2022/06/14/rw_file_by_json/</id>
<published>2022-06-14T11:57:16.000Z</published>
<updated>2022-06-14T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h2 id="rapidjson-dump-load-json-data-to-from-file"><a href="#rapidjson-dump-load-json-data-to-from-file" class="headerlink" title="rapidjson: dump/load json data to/from file"></a>rapidjson: dump/load json data to/from file</h2><p>这里不讨论为什么要用 json,仅仅记录如何把内存中的结构体存储为 json 格式的文件,以及如何把 json 格式的文件内容读入内存;</p><h4 id="从内存中的字符串得到-json-对象"><a href="#从内存中的字符串得到-json-对象" class="headerlink" title="从内存中的字符串得到 json 对象"></a>从内存中的字符串得到 json 对象</h4><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 假设我们用 C 语言的字符串储存一个 JSON(const char* json):</span></span><br><span class="line"><span class="comment">{</span></span><br><span class="line"><span class="comment"> "hello": "world",</span></span><br><span class="line"><span class="comment"> "t": true ,</span></span><br><span class="line"><span class="comment"> "f": false,</span></span><br><span class="line"><span class="comment"> "n": null,</span></span><br><span class="line"><span class="comment"> "i": 123,</span></span><br><span class="line"><span class="comment"> "pi": 3.1416,</span></span><br><span class="line"><span class="comment"> "a": [1, 2, 3, 4]</span></span><br><span class="line"><span class="comment">}</span></span><br><span class="line"><span class="comment">*/</span></span><br><span class="line"><span class="comment">// 把它解析至一个 Document:</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"rapidjson/document.h"</span></span></span><br><span class="line"></span><br><span class="line"><span class="keyword">using</span> <span class="keyword">namespace</span> rapidjson;</span><br><span class="line"></span><br><span class="line"><span class="comment">// ...</span></span><br><span class="line">Document document;</span><br><span class="line">document.Parse(json);</span><br></pre></td></tr></table></figure><h4 id="通过自定义从-0-构造一个-json-对象"><a href="#通过自定义从-0-构造一个-json-对象" class="headerlink" title="通过自定义从 0 构造一个 json 对象"></a>通过自定义从 0 构造一个 json 对象</h4><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"rapidjson/document.h"</span></span></span><br><span class="line"></span><br><span class="line"><span class="keyword">using</span> <span class="keyword">namespace</span> rapidjson;</span><br><span class="line"></span><br><span class="line"><span class="comment">// ...</span></span><br><span class="line">Document d;</span><br><span class="line">Document::AllocatorType& allocator = d.GetAllocator();</span><br><span class="line"><span class="comment">// Create the block object at root of DOM</span></span><br><span class="line">d.SetObject();</span><br><span class="line">d.AddMember(<span class="string">"ID"</span>, <span class="number">8086</span>, allocator);</span><br></pre></td></tr></table></figure><h4 id="从文件解析一个-json"><a href="#从文件解析一个-json" class="headerlink" title="从文件解析一个 json"></a>从文件解析一个 json</h4><ul><li>如果文件很小,可以全部读入内存,那么可以使用<strong>内存流</strong>把 json 存储在内存中,然后去解析</li><li>如果文件很大,不能一次性全部读入内存,那么可以使用<strong>文件流</strong>,每次读入一部分去做解析</li></ul><p><strong>内存流输入:</strong></p><p>用 <strong>StringStream</strong></p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"rapidjson/document.h"</span> <span class="comment">// 会包含 "rapidjson/rapidjson.h"</span></span></span><br><span class="line"> </span><br><span class="line"><span class="keyword">using</span> <span class="keyword">namespace</span> rapidjson;</span><br><span class="line"> </span><br><span class="line"><span class="comment">// ...</span></span><br><span class="line"><span class="keyword">const</span> <span class="keyword">char</span> json[] = <span class="string">"[1, 2, 3, 4]"</span>; <span class="comment">// json 可以看作是从文件中读入的内容</span></span><br><span class="line"><span class="function">StringStream <span class="title">s</span><span class="params">(json)</span></span>;</span><br><span class="line"> </span><br><span class="line">Document d;</span><br><span class="line">d.ParseStream(s);</span><br></pre></td></tr></table></figure><p><strong>内存流输出:</strong></p><p>用 <strong>StringBuffer</strong></p><p><strong>StringBuffer</strong> 是一个简单的输出流。它分配一个内存缓冲区,供写入整个 json。可使用 <code>GetString()</code> 来获取该缓冲区。</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"rapidjson/stringbuffer.h"</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><rapidjson/writer.h></span></span></span><br><span class="line"></span><br><span class="line">StringBuffer buffer;</span><br><span class="line"><span class="function">Writer<StringBuffer> <span class="title">writer</span><span class="params">(buffer)</span></span>;</span><br><span class="line">d.Accept(writer);</span><br><span class="line"></span><br><span class="line"><span class="keyword">const</span> <span class="keyword">char</span>* output = buffer.GetString();</span><br></pre></td></tr></table></figure><ul><li><code>Writer<StringBuffer> writer(buffer);</code> 表示:告诉 writer 把最终 json 字符串写到 buffer 中去;</li><li><code>d.Accept(writer);</code> 表示:Document 把要写的内容告诉 writer;</li><li>可使用 <code>GetString()</code> 来获取该缓冲区。</li></ul><p><strong>文件流输入:</strong></p><p>用 <strong>FileReadStream</strong></p><p><strong>FileReadStream</strong> 通过 <strong>FILE</strong> 指针读取文件。使用者需要提供一个缓冲区。</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"rapidjson/filereadstream.h"</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><cstdio></span></span></span><br><span class="line"> </span><br><span class="line"><span class="keyword">using</span> <span class="keyword">namespace</span> rapidjson;</span><br><span class="line"> </span><br><span class="line">FILE* fp = fopen(<span class="string">"big.json"</span>, <span class="string">"rb"</span>); <span class="comment">// 非 Windows 平台使用 "r"</span></span><br><span class="line"> </span><br><span class="line"><span class="keyword">char</span> readBuffer[<span class="number">65536</span>];</span><br><span class="line"><span class="function">FileReadStream <span class="title">is</span><span class="params">(fp, readBuffer, <span class="keyword">sizeof</span>(readBuffer))</span></span>; <span class="comment">// 通过 readBuffer 从文件 fp 中读取 sizeof(readBuffer) 个字节到 is 中</span></span><br><span class="line"> </span><br><span class="line">Document d;</span><br><span class="line">d.ParseStream(is);</span><br><span class="line"> </span><br><span class="line">fclose(fp);</span><br></pre></td></tr></table></figure><p><strong>文件流输出:</strong></p><p>用 <strong>FileWriteStream</strong></p><p><strong>FileWriteStream</strong> 是一个含缓冲功能的输出流。它的用法与 <strong>FileReadStream</strong> 非常相似。</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"rapidjson/filewritestream.h"</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><rapidjson/writer.h></span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string"><cstdio></span></span></span><br><span class="line"> </span><br><span class="line"><span class="keyword">using</span> <span class="keyword">namespace</span> rapidjson;</span><br><span class="line"> </span><br><span class="line">Document d;</span><br><span class="line">d.Parse(json);</span><br><span class="line"><span class="comment">// ...</span></span><br><span class="line"> </span><br><span class="line">FILE* fp = fopen(<span class="string">"output.json"</span>, <span class="string">"wb"</span>); <span class="comment">// 非 Windows 平台使用 "w"</span></span><br><span class="line"> </span><br><span class="line"><span class="keyword">char</span> writeBuffer[<span class="number">65536</span>];</span><br><span class="line"><span class="function">FileWriteStream <span class="title">os</span><span class="params">(fp, writeBuffer, <span class="keyword">sizeof</span>(writeBuffer))</span></span>;</span><br><span class="line"> </span><br><span class="line"><span class="function">Writer<FileWriteStream> <span class="title">writer</span><span class="params">(os)</span></span>;</span><br><span class="line">d.Accept(writer);</span><br><span class="line"> </span><br><span class="line">fclose(fp);</span><br></pre></td></tr></table></figure><blockquote><p>可以看到输入流都是,<code>Document::ParseStream(istream)</code>,输出流都是 <code>Document::Accept(Document::Writer)</code>。</p></blockquote><h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><ol><li><a href="http://rapidjson.org/zh-cn/md_doc_stream_8zh-cn.html">流</a></li></ol>]]></content>
<summary type="html"><h2 id="rapidjson-dump-load-json-data-to-from-file"><a href="#rapidjson-dump-load-json-data-to-from-file" class="headerlink" title="rapidjso</summary>
</entry>
<entry>
<title>MIT 6.824 Lab1 MapReduce</title>
<link href="https://codroc.github.io/2022/06/10/6.824Lab1MapReduce/"/>
<id>https://codroc.github.io/2022/06/10/6.824Lab1MapReduce/</id>
<published>2022-06-10T11:57:16.000Z</published>
<updated>2022-06-10T11:57:16.000Z</updated>
<content type="html"><![CDATA[<h2 id="MIT-6-824-Lab1-MapReduce"><a href="#MIT-6-824-Lab1-MapReduce" class="headerlink" title="MIT 6.824 Lab1 MapReduce"></a>MIT 6.824 Lab1 MapReduce</h2><blockquote><p>MapReduce is a programming model and an associated implementation for processing and generating large data sets.</p></blockquote><p>论文《MapReduce: Simplified Data Processing on Large Clusters》开篇第一句话就将明白了 <strong>MapReduce <strong>的性质:</strong>用于处理或生成大数据的相关实现或编程模型</strong></p><p>如果想看论文翻译请移步 <a href="https://codroc.github.io/2022/06/06/MapReduce/">MapReduce 中文翻译</a>.</p><h3 id="编程模型"><a href="#编程模型" class="headerlink" title="编程模型"></a>编程模型</h3><p>整个计算过程,输入是一系列的 key/value 对,输出也是一系列的 key/value 对。这个计算过程叫做 map 与 reduce,其实就是 <strong>拆分与合并</strong>,其实很像 <strong>分而治之</strong> 的思想。</p><p><strong>例如:把一辆汽车拆成各种零件,再把这些零件组装成一个变形金刚。</strong></p><p>用户只要指定 <em>map</em> 函数和 <em>reduce</em> 函数就可以了,把剩下的交给 <strong>MapReduce 库</strong>去做就可以了。map 和 reduce 函数应该设计成这样:</p><ul><li><em>map</em> 函数接收输入的 key/value pairs,把它们变成 <em>intermediate</em> key/value pairs,MapReduce 库把所有 <em>intermediate</em> key = X (相同 key)的 pairs 集中起来传给 <em>reduce</em> 函数。</li><li><em>reduce</em> 函数接受 <em>intermeidate</em> key = X (相同 key)的 pairs,然后通过某一种规则把他们 merge 起来,生成较少的输出;一般 reduce 函数针对某一个 key 仅仅产生 0 个或 1 个输出。它可以处理那些很大的文件,例如在内存中放不下的文件,它可以以迭代的方式读取并去做 reduce。</li></ul><p>例如,word count 的 map 函数和 reduce 函数,用 go 就是下面这样实现的。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// The map function is called once for each file of input. The first</span></span><br><span class="line"><span class="comment">// argument is the name of the input file, and the second is the</span></span><br><span class="line"><span class="comment">// file's complete contents. You should ignore the input file name,</span></span><br><span class="line"><span class="comment">// and look only at the contents argument. The return value is a slice</span></span><br><span class="line"><span class="comment">// of key/value pairs.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">Map</span><span class="params">(filename <span class="keyword">string</span>, contents <span class="keyword">string</span>)</span> []<span class="title">mr</span>.<span class="title">KeyValue</span></span> {</span><br><span class="line"> <span class="comment">// function to detect word separators.</span></span><br><span class="line"> ff := <span class="function"><span class="keyword">func</span><span class="params">(r <span class="keyword">rune</span>)</span> <span class="title">bool</span></span> { <span class="keyword">return</span> !unicode.IsLetter(r) }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// split contents into an array of words.</span></span><br><span class="line"> words := strings.FieldsFunc(contents, ff)</span><br><span class="line"></span><br><span class="line"> kva := []mr.KeyValue{}</span><br><span class="line"> <span class="keyword">for</span> _, w := <span class="keyword">range</span> words {</span><br><span class="line"> kv := mr.KeyValue{w, <span class="string">"1"</span>}</span><br><span class="line"> kva = <span class="built_in">append</span>(kva, kv)</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> kva</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// The reduce function is called once for each key generated by the</span></span><br><span class="line"><span class="comment">// map tasks, with a list of all the values created for that key by</span></span><br><span class="line"><span class="comment">// any map task.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">Reduce</span><span class="params">(key <span class="keyword">string</span>, values []<span class="keyword">string</span>)</span> <span class="title">string</span></span> {</span><br><span class="line"> <span class="comment">// return the number of occurrences of this word.</span></span><br><span class="line"> <span class="keyword">return</span> strconv.Itoa(<span class="built_in">len</span>(values))</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>分析以下这个 map 和 reduce 具体是做什么的:map 函数接受一个 key/value pair,其中 key 是 filename(文件名),value 是 contents(文件内容),输出是 {word, 1} 的数组;reduce 函数接受某一个 key 的所有 value,输出是一个 value。</p><p><strong>那么 MapReduce 库是怎么让多个 machine 之间协同合作来一起完成 map 任务和 reduce 任务,最终完成所有任务的呢?这就要看一下 MapReduce 库的工作流程了。</strong></p><p><img src="https://s2.loli.net/2022/06/10/bqKmukjXd8YveT3.png" alt="MapReduce0.PNG"></p><p>可以看到它是一个 master 多个 worker 的模式,master 不对任务进行处理,只对 worker 进行协调,worker 去执行具体的 map task 和 reduce task。worker 会以请求任务的方式向 master 要任务去做。这种模式有以下优劣:</p><ul><li><p>把整个库分成两个部分,master 和 worker,这使得整个工作流程清晰易懂。</p></li><li><p>一个 master 而非多个 master,使得编程变得很容易,因为不需要考虑 master 之间的通信,一致性等问题;有好处就会有坏处,只有一个 master 会导致系统可用性,可靠性变差,如果 master 节点 down 了,那么整个服务就不可用了,也就是<strong>不具备容错能力</strong>;</p></li><li><p>多个 worker 使得 MapReduce 库的性能得到大大提高,处理 TB 级别的大量无依赖的数据时,将大大减少处理时间;同时将具有很好的扩展性。</p></li></ul><p>可不可以用一句话来概括 MapReduce 得思想?</p><hr><h3 id="用-Go-具体实现-MapReduce-库"><a href="#用-Go-具体实现-MapReduce-库" class="headerlink" title="用 Go 具体实现 MapReduce 库"></a>用 Go 具体实现 MapReduce 库</h3><ul><li><p>一台 machine 上分配一个 Coordinator 出来,用于协调 worker 之间的工作,并回应 worker 的任务请求。</p></li><li><p>多台 machine 上分配多个 worker 出来,向 Coordinator 索求任务(可以是 map task 也可以是 reduce task,Coordinator 给什么就做什么)。</p></li><li><p>考虑 worker 处理太慢或者突然 down 掉的情况,Coordinator 需要重新分配任务。</p></li><li><p>不考虑 Coordinator down 掉的情况。Coordinator 只会在所有任务都完成后退出,此时 Worker 与 Coordinator 的 RPC 通信会超时并返回错误,这时候就可以知道所有任务都结束了(或者 Worker 的网络状况出问题了),此时 Worker 只需要 exit 就行。</p></li></ul><p>Coordinator 和 Worker 分别需要哪些数据结构?</p><p><strong>Coordinator:</strong></p><ul><li>需要知道最终有多少个 reduce task</li><li>需要给 Worker 分配 id</li><li>需要知道当前是 map 任务还是 reduce 任务</li><li>需要知道是否所有任务都已经做完</li><li>因为某些 Worker 可能 down 掉,因此要记录哪些 Worker 正在做哪些 task,以及 Worker 的状态</li><li>用超时定时器判断 Worker 是否已经 down 了</li><li>需要记录 map task 结束后生成的所有中间结果</li></ul><p>因此最终的数据结构是这样的:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> WorkerStatus <span class="keyword">int</span></span><br><span class="line"><span class="keyword">const</span> (</span><br><span class="line"> Free WorkerStatus = <span class="literal">iota</span></span><br><span class="line"> Busy</span><br><span class="line"> Timeout</span><br><span class="line">)</span><br><span class="line"></span><br><span class="line"><span class="keyword">type</span> Coordinator <span class="keyword">struct</span> {</span><br><span class="line"> <span class="comment">// Your definitions here.</span></span><br><span class="line"> mu sync.Mutex</span><br><span class="line"></span><br><span class="line"> MapTaskFinished <span class="keyword">bool</span></span><br><span class="line"> MapTaskRemain <span class="keyword">int</span> <span class="comment">// 还剩多少 map task 任务可以分配</span></span><br><span class="line"> ReduceTaskFinished <span class="keyword">bool</span></span><br><span class="line"> ReduceTaskRemain <span class="keyword">int</span> <span class="comment">// 还剩多少 reduce task 任务可以分配</span></span><br><span class="line"></span><br><span class="line"> Workers <span class="keyword">int</span></span><br><span class="line"> WS <span class="keyword">map</span>[<span class="keyword">int</span>] WorkerStatus <span class="comment">// WorkerStatus 表示工人目前的状态,0-表示空闲,1-表示正在做任务,2-表示 coordinator 已经联系不到工人了</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// map task</span></span><br><span class="line"> WorkerToMapTask <span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">string</span><span class="comment">// worker i 正在做 文件 filename 的 map task</span></span><br><span class="line"> IntermediateFiles []<span class="keyword">string</span></span><br><span class="line"> RecordFiles <span class="keyword">map</span>[<span class="keyword">string</span>] <span class="keyword">bool</span> <span class="comment">// 用于记录哪些中间文件已经出现过了</span></span><br><span class="line"> MapTask <span class="keyword">map</span>[<span class="keyword">string</span>] <span class="keyword">int</span> <span class="comment">// map task 需要完成的文件还有哪些, 2 表示已经完成, 1 表示还未完成, 0 表示还未分配</span></span><br><span class="line"> MapTaskBaseFilename <span class="keyword">string</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// reduce task</span></span><br><span class="line"> NReduce <span class="keyword">int</span></span><br><span class="line"> WorkerToReduceTask <span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">int</span><span class="comment">// worker i 正在做 第 j 个 reduce task</span></span><br><span class="line"> ReduceTask <span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">int</span> <span class="comment">// reduce task 需要完成的任务还有哪些, 2 表示已经完成, 1 表示还未完成, 0 表示还未分配</span></span><br><span class="line"> ReduceTaskBaseFilename <span class="keyword">string</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// crash</span></span><br><span class="line"> Timer <span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">int</span></span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><strong>Worker:</strong></p><p>其实 Worker 的很多数据都是 Coordinator 给的,因此不需要特意为他设计。</p><p>它的处理流程的主要结构是</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">for</span> !IsDone() { <span class="comment">// 只要 Coordinator 没结束</span></span><br><span class="line"> AskTask(); <span class="comment">// 向 Coordinator 请求任务</span></span><br><span class="line"> <span class="keyword">if</span> is_map_task() { <span class="comment">// 如果是 map task</span></span><br><span class="line"> <span class="keyword">if</span> has_task_to_do() {</span><br><span class="line"> do_map_task()</span><br><span class="line"> reply := ReportTask() <span class="comment">// 做完任务后向 Coordinator 汇报结果</span></span><br><span class="line"> <span class="keyword">if</span> is_good_job(reply) {</span><br><span class="line"> <span class="comment">// 如果 Coordinator 认可我的工作</span></span><br><span class="line"> <span class="comment">// 因为可能出现,Coordinator 以为我 down 了,把原来我的工作分配给其他人,那么我做的就是无用功了。。。</span></span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="comment">// 暂时没有任务可做,这种情况会在所有 map task 都被分配出去了,但是还没有都完成的情况下出现</span></span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> { <span class="comment">// 如果是 reduce task</span></span><br><span class="line"> <span class="keyword">if</span> has_task_to_do() {</span><br><span class="line"> do_reduce_task()</span><br><span class="line"> reply := ReportTask() <span class="comment">// 做完任务后向 Coordinator 汇报结果</span></span><br><span class="line"> <span class="keyword">if</span> is_good_job(reply) {</span><br><span class="line"> <span class="comment">// 如果 Coordinator 认可我的工作</span></span><br><span class="line"> <span class="comment">// 因为可能出现,Coordinator 以为我 down 了,把原来我的工作分配给其他人,那么我做的就是无用功了。。。</span></span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="comment">// 暂时没有任务可做,这种情况会在所有 reduce task 都被分配出去了,但是还没有都完成的情况下出现</span></span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>根据其主要流程就知道要设计三种 RPC 与 Coordinator 通信:</p><ul><li><p>IsDone:coordinator 是否已经完成了所有任务</p></li><li><p>AskTask:请 coordinator 给我一个任务</p></li><li><p>ReportTask:向 coordinator 汇报我完成的任务</p></li></ul><hr><p>最终代码如下:</p><p><code>mr/rpc.go:</code></p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">package</span> mr</span><br><span class="line"></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// RPC definitions.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// remember to capitalize all names.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> <span class="string">"os"</span></span><br><span class="line"><span class="keyword">import</span> <span class="string">"strconv"</span></span><br><span class="line"></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// example to show how to declare the arguments</span></span><br><span class="line"><span class="comment">// and reply for an RPC.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">type</span> ExampleArgs <span class="keyword">struct</span> {</span><br><span class="line"> X <span class="keyword">int</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">type</span> ExampleReply <span class="keyword">struct</span> {</span><br><span class="line"> Y <span class="keyword">int</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">// Add your RPC definitions here.</span></span><br><span class="line"></span><br><span class="line"><span class="comment">// AskTask: 向 coordinator 请求任务</span></span><br><span class="line"><span class="keyword">type</span> AskTaskArgs <span class="keyword">struct</span> {</span><br><span class="line"> WorkerId <span class="keyword">int</span> <span class="comment">// 当前 worker 的 id,刚开始没分配时为 nil</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">type</span> AskTaskReply <span class="keyword">struct</span> {</span><br><span class="line"> IsMapTask <span class="keyword">bool</span></span><br><span class="line"> IsReduceTask <span class="keyword">bool</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// map task</span></span><br><span class="line"> Filename <span class="keyword">string</span> <span class="comment">// 需要做 map 的文件名字</span></span><br><span class="line"> MapTaskBaseFilename <span class="keyword">string</span> <span class="comment">// 把 intermediate key 放到 MapTaskBaseFilename-WokerId-X 文件中去</span></span><br><span class="line"> WorkerId <span class="keyword">int</span> <span class="comment">// coordinator 分配给当前 worker 的 id,只要它还活着,除了他自己以外就没人会占用这个 id</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// reduce task</span></span><br><span class="line"> NReduce <span class="keyword">int</span> <span class="comment">// 总共需要多少个 reduce</span></span><br><span class="line"> ReduceTaskBaseFilename <span class="keyword">string</span> <span class="comment">// reduce 任务的 base filename</span></span><br><span class="line"> XReduce <span class="keyword">int</span> <span class="comment">// woker 要处理第 X 个 reduce 任务,并把输出放到 ReduceTaskBaseFilename-X 中去</span></span><br><span class="line"> AllFiles []<span class="keyword">string</span> <span class="comment">// 所有的中间文件名</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">// AskStatus: 询问 coordinator 当前的状态</span></span><br><span class="line"><span class="keyword">type</span> AskStatusArgs <span class="keyword">struct</span> {</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">type</span> AskStatusReply <span class="keyword">struct</span> {</span><br><span class="line"> IsDone <span class="keyword">bool</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">// ReportTask: worker 完成了一个任务,向 coordinator 汇报该任务完成情况</span></span><br><span class="line"><span class="keyword">type</span> ReportTaskArgs <span class="keyword">struct</span> {</span><br><span class="line"> WorkerId <span class="keyword">int</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// map task</span></span><br><span class="line"> MapTaskFilename <span class="keyword">string</span></span><br><span class="line"> IntermediateFile []<span class="keyword">string</span> <span class="comment">// map 任务产生的中间文件</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// reduce task</span></span><br><span class="line"> XReduce <span class="keyword">int</span> <span class="comment">// worker 做的是第 XReduce 个 reduce task</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">type</span> ReportTaskReply <span class="keyword">struct</span> {</span><br><span class="line"> GoodJob <span class="keyword">bool</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">// Cook up a unique-ish UNIX-domain socket name</span></span><br><span class="line"><span class="comment">// in /var/tmp, for the coordinator.</span></span><br><span class="line"><span class="comment">// Can't use the current directory since</span></span><br><span class="line"><span class="comment">// Athena AFS doesn't support UNIX-domain sockets.</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">coordinatorSock</span><span class="params">()</span> <span class="title">string</span></span> {</span><br><span class="line"> s := <span class="string">"/var/tmp/824-mr-"</span></span><br><span class="line"> s += strconv.Itoa(os.Getuid())</span><br><span class="line"> <span class="keyword">return</span> s</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><code>mr/coordinator.go:</code></p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br><span class="line">172</span><br><span class="line">173</span><br><span class="line">174</span><br><span class="line">175</span><br><span class="line">176</span><br><span class="line">177</span><br><span class="line">178</span><br><span class="line">179</span><br><span class="line">180</span><br><span class="line">181</span><br><span class="line">182</span><br><span class="line">183</span><br><span class="line">184</span><br><span class="line">185</span><br><span class="line">186</span><br><span class="line">187</span><br><span class="line">188</span><br><span class="line">189</span><br><span class="line">190</span><br><span class="line">191</span><br><span class="line">192</span><br><span class="line">193</span><br><span class="line">194</span><br><span class="line">195</span><br><span class="line">196</span><br><span class="line">197</span><br><span class="line">198</span><br><span class="line">199</span><br><span class="line">200</span><br><span class="line">201</span><br><span class="line">202</span><br><span class="line">203</span><br><span class="line">204</span><br><span class="line">205</span><br><span class="line">206</span><br><span class="line">207</span><br><span class="line">208</span><br><span class="line">209</span><br><span class="line">210</span><br><span class="line">211</span><br><span class="line">212</span><br><span class="line">213</span><br><span class="line">214</span><br><span class="line">215</span><br><span class="line">216</span><br><span class="line">217</span><br><span class="line">218</span><br><span class="line">219</span><br><span class="line">220</span><br><span class="line">221</span><br><span class="line">222</span><br><span class="line">223</span><br><span class="line">224</span><br><span class="line">225</span><br><span class="line">226</span><br><span class="line">227</span><br><span class="line">228</span><br><span class="line">229</span><br><span class="line">230</span><br><span class="line">231</span><br><span class="line">232</span><br><span class="line">233</span><br><span class="line">234</span><br><span class="line">235</span><br><span class="line">236</span><br><span class="line">237</span><br><span class="line">238</span><br><span class="line">239</span><br><span class="line">240</span><br><span class="line">241</span><br><span class="line">242</span><br><span class="line">243</span><br><span class="line">244</span><br><span class="line">245</span><br><span class="line">246</span><br><span class="line">247</span><br><span class="line">248</span><br><span class="line">249</span><br><span class="line">250</span><br><span class="line">251</span><br><span class="line">252</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">package</span> mr</span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> <span class="string">"log"</span></span><br><span class="line"><span class="keyword">import</span> <span class="string">"net"</span></span><br><span class="line"><span class="keyword">import</span> <span class="string">"os"</span></span><br><span class="line"><span class="keyword">import</span> <span class="string">"net/rpc"</span></span><br><span class="line"><span class="keyword">import</span> <span class="string">"net/http"</span></span><br><span class="line"><span class="keyword">import</span> <span class="string">"sync"</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">type</span> WorkerStatus <span class="keyword">int</span></span><br><span class="line"><span class="keyword">const</span> (</span><br><span class="line"> Free WorkerStatus = <span class="literal">iota</span></span><br><span class="line"> Busy</span><br><span class="line"> Timeout</span><br><span class="line">)</span><br><span class="line"></span><br><span class="line"><span class="keyword">type</span> Coordinator <span class="keyword">struct</span> {</span><br><span class="line"> <span class="comment">// Your definitions here.</span></span><br><span class="line"> mu sync.Mutex</span><br><span class="line"></span><br><span class="line"> MapTaskFinished <span class="keyword">bool</span></span><br><span class="line"> MapTaskRemain <span class="keyword">int</span> <span class="comment">// 还剩多少 map task 任务可以分配</span></span><br><span class="line"> ReduceTaskFinished <span class="keyword">bool</span></span><br><span class="line"> ReduceTaskRemain <span class="keyword">int</span> <span class="comment">// 还剩多少 reduce task 任务可以分配</span></span><br><span class="line"></span><br><span class="line"> Workers <span class="keyword">int</span></span><br><span class="line"> WS <span class="keyword">map</span>[<span class="keyword">int</span>] WorkerStatus <span class="comment">// WorkerStatus 表示工人目前的状态,0-表示空闲,1-表示正在做任务,2-表示 coordinator 已经联系不到工人了</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// map task</span></span><br><span class="line"> WorkerToMapTask <span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">string</span><span class="comment">// worker i 正在做 文件 filename 的 map task</span></span><br><span class="line"> IntermediateFiles []<span class="keyword">string</span></span><br><span class="line"> RecordFiles <span class="keyword">map</span>[<span class="keyword">string</span>] <span class="keyword">bool</span> <span class="comment">// 用于记录哪些中间文件已经出现过了</span></span><br><span class="line"> MapTask <span class="keyword">map</span>[<span class="keyword">string</span>] <span class="keyword">int</span> <span class="comment">// map task 需要完成的文件还有哪些, 2 表示已经完成, 1 表示还未完成, 0 表示还未分配</span></span><br><span class="line"> MapTaskBaseFilename <span class="keyword">string</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// reduce task</span></span><br><span class="line"> NReduce <span class="keyword">int</span></span><br><span class="line"> WorkerToReduceTask <span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">int</span><span class="comment">// worker i 正在做 第 j 个 reduce task</span></span><br><span class="line"> ReduceTask <span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">int</span> <span class="comment">// reduce task 需要完成的任务还有哪些, 2 表示已经完成, 1 表示还未完成, 0 表示还未分配</span></span><br><span class="line"> ReduceTaskBaseFilename <span class="keyword">string</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// crash</span></span><br><span class="line"> Timer <span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">int</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">// Your code here -- RPC handlers for the worker to call.</span></span><br><span class="line"></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// an example RPC handler.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// the RPC argument and reply types are defined in rpc.go.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *Coordinator)</span> <span class="title">Example</span><span class="params">(args *ExampleArgs, reply *ExampleReply)</span> <span class="title">error</span></span> {</span><br><span class="line"> reply.Y = args.X + <span class="number">1</span></span><br><span class="line"> <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *Coordinator)</span> <span class="title">IsDone</span><span class="params">(args *AskStatusArgs, reply *AskStatusReply)</span> <span class="title">error</span></span> {</span><br><span class="line"> <span class="comment">// 由于 Done 是线程安全的,因此 IsDone 也是线程安全的</span></span><br><span class="line"> reply.IsDone = c.Done()</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">}</span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *Coordinator)</span> <span class="title">AskTask</span><span class="params">(args *AskTaskArgs, reply *AskTaskReply)</span> <span class="title">error</span></span> {</span><br><span class="line"> c.mu.Lock()</span><br><span class="line"> <span class="keyword">defer</span> c.mu.Unlock()</span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> args.WorkerId == <span class="number">-1</span> {</span><br><span class="line"> args.WorkerId = c.Workers</span><br><span class="line"> c.Workers++</span><br><span class="line"> }</span><br><span class="line"> <span class="comment">// TODO</span></span><br><span class="line"> <span class="comment">// 分配任务</span></span><br><span class="line"> worker_id := args.WorkerId</span><br><span class="line"> reply.WorkerId = worker_id</span><br><span class="line"> reply.NReduce = c.NReduce</span><br><span class="line"> reply.XReduce = <span class="number">-1</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> !c.MapTaskFinished {</span><br><span class="line"> reply.IsMapTask = <span class="literal">true</span></span><br><span class="line"> <span class="keyword">for</span> filename, val := <span class="keyword">range</span> c.MapTask {</span><br><span class="line"> <span class="keyword">if</span> val == <span class="number">0</span> {</span><br><span class="line"> reply.Filename = filename</span><br><span class="line"> reply.MapTaskBaseFilename = c.MapTaskBaseFilename</span><br><span class="line"> c.MapTask[filename] = <span class="number">1</span></span><br><span class="line"> c.WorkerToMapTask[worker_id] = filename</span><br><span class="line"> c.WS[worker_id] = Busy</span><br><span class="line"> c.Timer[worker_id] = <span class="number">0</span></span><br><span class="line"> <span class="keyword">break</span></span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> <span class="keyword">if</span> !c.ReduceTaskFinished {</span><br><span class="line"> reply.IsReduceTask = <span class="literal">true</span></span><br><span class="line"> <span class="keyword">for</span> xreduce, val := <span class="keyword">range</span> c.ReduceTask {</span><br><span class="line"> <span class="keyword">if</span> val == <span class="number">0</span> {</span><br><span class="line"> reply.XReduce = xreduce</span><br><span class="line"> reply.ReduceTaskBaseFilename = c.ReduceTaskBaseFilename</span><br><span class="line"> reply.AllFiles = c.IntermediateFiles</span><br><span class="line"> c.ReduceTask[xreduce] = <span class="number">1</span></span><br><span class="line"> c.WorkerToReduceTask[worker_id] = xreduce</span><br><span class="line"> c.WS[worker_id] = Busy</span><br><span class="line"> c.Timer[worker_id] = <span class="number">0</span></span><br><span class="line"> <span class="keyword">break</span></span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *Coordinator)</span> <span class="title">is_timeout</span><span class="params">(worker_id <span class="keyword">int</span>)</span> <span class="title">bool</span></span> {</span><br><span class="line"> c.mu.Lock()</span><br><span class="line"> <span class="keyword">defer</span> c.mu.Unlock()</span><br><span class="line"> <span class="keyword">return</span> c.WS[worker_id] == Timeout</span><br><span class="line">}</span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *Coordinator)</span> <span class="title">ReportTask</span><span class="params">(args *ReportTaskArgs, reply *ReportTaskReply)</span> <span class="title">error</span></span> {</span><br><span class="line"> worker_id := args.WorkerId</span><br><span class="line"> <span class="comment">// 如果超时了则不理他</span></span><br><span class="line"> <span class="keyword">if</span> c.is_timeout(worker_id) {</span><br><span class="line"> reply.GoodJob = <span class="literal">false</span></span><br><span class="line"> c.WS[worker_id] = Free</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> c.mu.Lock()</span><br><span class="line"> <span class="keyword">defer</span> c.mu.Unlock()</span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> !c.MapTaskFinished {</span><br><span class="line"> <span class="keyword">if</span> c.WorkerToMapTask[worker_id] == args.MapTaskFilename && c.WS[worker_id] == Busy {</span><br><span class="line"> reply.GoodJob = <span class="literal">true</span></span><br><span class="line"> c.WS[worker_id] = Free</span><br><span class="line"></span><br><span class="line"> <span class="keyword">for</span> _, intermediate_file := <span class="keyword">range</span> args.IntermediateFile {</span><br><span class="line"> <span class="comment">// 如果中间文件没有出现过,那么就把他加入 IntermediateFiles 中,并把他记录下了,用于去重</span></span><br><span class="line"> _, ok := c.RecordFiles[intermediate_file]</span><br><span class="line"> <span class="keyword">if</span> !ok {</span><br><span class="line"> c.IntermediateFiles = <span class="built_in">append</span>(c.IntermediateFiles, intermediate_file)</span><br><span class="line"> c.RecordFiles[intermediate_file] = <span class="literal">true</span></span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> c.MapTask[args.MapTaskFilename] = <span class="number">2</span></span><br><span class="line"></span><br><span class="line"> c.MapTaskRemain--</span><br><span class="line"> <span class="keyword">if</span> c.MapTaskRemain == <span class="number">0</span> {</span><br><span class="line"> c.MapTaskFinished = <span class="literal">true</span></span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> <span class="keyword">if</span> !c.ReduceTaskFinished{</span><br><span class="line"> <span class="keyword">if</span> c.WorkerToReduceTask[worker_id] == args.XReduce && c.WS[worker_id] == Busy {</span><br><span class="line"> reply.GoodJob = <span class="literal">true</span></span><br><span class="line"> c.WS[worker_id] = Free</span><br><span class="line"></span><br><span class="line"> c.ReduceTask[args.XReduce] = <span class="number">2</span></span><br><span class="line"></span><br><span class="line"> c.ReduceTaskRemain--</span><br><span class="line"> <span class="keyword">if</span> c.ReduceTaskRemain == <span class="number">0</span> {</span><br><span class="line"> c.ReduceTaskFinished = <span class="literal">true</span></span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="comment">// 所有任务都已经完成了</span></span><br><span class="line"> reply.GoodJob = <span class="literal">false</span></span><br><span class="line"> }</span><br><span class="line"> <span class="comment">// worker 向我汇报了,但他汇报的任务和我发布的不同或者他在 free 或 timeout 状态</span></span><br><span class="line"> <span class="comment">// 但他既然向我汇报了,那么他一定是 Free 的</span></span><br><span class="line"> c.WS[worker_id] = Free</span><br><span class="line"></span><br><span class="line"> <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">}</span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// start a thread that listens for RPCs from worker.go</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *Coordinator)</span> <span class="title">server</span><span class="params">()</span></span> {</span><br><span class="line"> rpc.Register(c)</span><br><span class="line"> rpc.HandleHTTP()</span><br><span class="line"> <span class="comment">//l, e := net.Listen("tcp", ":1234")</span></span><br><span class="line"> sockname := coordinatorSock()</span><br><span class="line"> os.Remove(sockname)</span><br><span class="line"> l, e := net.Listen(<span class="string">"unix"</span>, sockname)</span><br><span class="line"> <span class="keyword">if</span> e != <span class="literal">nil</span> {</span><br><span class="line"> log.Fatal(<span class="string">"listen error:"</span>, e)</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">go</span> http.Serve(l, <span class="literal">nil</span>)</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// main/mrcoordinator.go calls Done() periodically to find out</span></span><br><span class="line"><span class="comment">// if the entire job has finished.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(c *Coordinator)</span> <span class="title">Done</span><span class="params">()</span> <span class="title">bool</span></span> {</span><br><span class="line"> ret := <span class="literal">false</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// Your code here.</span></span><br><span class="line"> c.mu.Lock()</span><br><span class="line"> <span class="keyword">defer</span> c.mu.Unlock()</span><br><span class="line"> <span class="keyword">if</span> c.MapTaskFinished && c.ReduceTaskFinished {</span><br><span class="line"> ret = <span class="literal">true</span></span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">for</span> worker_id, _ := <span class="keyword">range</span> c.Timer {</span><br><span class="line"> c.Timer[worker_id]++</span><br><span class="line"> <span class="keyword">if</span> c.Timer[worker_id] >= <span class="number">10</span> && c.WS[worker_id] == Busy {</span><br><span class="line"> c.WS[worker_id] = Timeout</span><br><span class="line"> <span class="keyword">if</span> !c.MapTaskFinished {</span><br><span class="line"> map_task := c.WorkerToMapTask[worker_id]</span><br><span class="line"> c.WorkerToMapTask[worker_id] = <span class="string">""</span></span><br><span class="line"> c.MapTask[map_task] = <span class="number">0</span></span><br><span class="line"> } <span class="keyword">else</span> <span class="keyword">if</span> !c.ReduceTaskFinished {</span><br><span class="line"> reduce_task := c.WorkerToReduceTask[worker_id]</span><br><span class="line"> c.WorkerToReduceTask[worker_id] = <span class="number">-1</span></span><br><span class="line"> c.ReduceTask[reduce_task] = <span class="number">0</span></span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="keyword">return</span> ret</span><br><span class="line">}</span><br><span class="line"><span class="comment">// create a Coordinator.</span></span><br><span class="line"><span class="comment">// main/mrcoordinator.go calls this function.</span></span><br><span class="line"><span class="comment">// nReduce is the number of reduce tasks to use.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">MakeCoordinator</span><span class="params">(files []<span class="keyword">string</span>, nReduce <span class="keyword">int</span>)</span> *<span class="title">Coordinator</span></span> {</span><br><span class="line"> c := Coordinator{}</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Your code here.</span></span><br><span class="line"></span><br><span class="line"> c.MapTaskRemain = <span class="built_in">len</span>(files)</span><br><span class="line"> c.ReduceTaskRemain = nReduce</span><br><span class="line"> c.NReduce = nReduce</span><br><span class="line"></span><br><span class="line"> c.MapTask = <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">int</span>)</span><br><span class="line"> c.ReduceTask = <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">int</span>]<span class="keyword">int</span>)</span><br><span class="line"> c.WS = <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">int</span>] WorkerStatus)</span><br><span class="line"> c.WorkerToMapTask = <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">string</span>)</span><br><span class="line"> c.IntermediateFiles = []<span class="keyword">string</span>{}</span><br><span class="line"> c.RecordFiles = <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>] <span class="keyword">bool</span>)</span><br><span class="line"> c.WorkerToReduceTask = <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">int</span>)</span><br><span class="line"> c.Timer = <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">int</span>] <span class="keyword">int</span>)</span><br><span class="line"></span><br><span class="line"> c.MapTaskBaseFilename = <span class="string">"mr"</span></span><br><span class="line"> c.ReduceTaskBaseFilename = <span class="string">"mr-out"</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">for</span> _, file := <span class="keyword">range</span> files {</span><br><span class="line"> c.MapTask[file] = <span class="number">0</span></span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="keyword">for</span> idx := <span class="number">0</span>; idx < nReduce; idx++ {</span><br><span class="line"> c.ReduceTask[idx] = <span class="number">0</span></span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> c.server()</span><br><span class="line"> <span class="keyword">return</span> &c</span><br><span class="line">}</span><br></pre></td></tr></table></figure><p><code>mr/worker.go:</code></p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br><span class="line">172</span><br><span class="line">173</span><br><span class="line">174</span><br><span class="line">175</span><br><span class="line">176</span><br><span class="line">177</span><br><span class="line">178</span><br><span class="line">179</span><br><span class="line">180</span><br><span class="line">181</span><br><span class="line">182</span><br><span class="line">183</span><br><span class="line">184</span><br><span class="line">185</span><br><span class="line">186</span><br><span class="line">187</span><br><span class="line">188</span><br><span class="line">189</span><br><span class="line">190</span><br><span class="line">191</span><br><span class="line">192</span><br><span class="line">193</span><br><span class="line">194</span><br><span class="line">195</span><br><span class="line">196</span><br><span class="line">197</span><br><span class="line">198</span><br><span class="line">199</span><br><span class="line">200</span><br><span class="line">201</span><br><span class="line">202</span><br><span class="line">203</span><br><span class="line">204</span><br><span class="line">205</span><br><span class="line">206</span><br><span class="line">207</span><br><span class="line">208</span><br><span class="line">209</span><br><span class="line">210</span><br><span class="line">211</span><br><span class="line">212</span><br><span class="line">213</span><br><span class="line">214</span><br><span class="line">215</span><br><span class="line">216</span><br><span class="line">217</span><br><span class="line">218</span><br><span class="line">219</span><br><span class="line">220</span><br><span class="line">221</span><br><span class="line">222</span><br><span class="line">223</span><br><span class="line">224</span><br><span class="line">225</span><br><span class="line">226</span><br><span class="line">227</span><br><span class="line">228</span><br><span class="line">229</span><br><span class="line">230</span><br><span class="line">231</span><br><span class="line">232</span><br><span class="line">233</span><br><span class="line">234</span><br><span class="line">235</span><br><span class="line">236</span><br><span class="line">237</span><br><span class="line">238</span><br><span class="line">239</span><br><span class="line">240</span><br><span class="line">241</span><br><span class="line">242</span><br><span class="line">243</span><br><span class="line">244</span><br><span class="line">245</span><br><span class="line">246</span><br><span class="line">247</span><br><span class="line">248</span><br><span class="line">249</span><br><span class="line">250</span><br><span class="line">251</span><br><span class="line">252</span><br><span class="line">253</span><br><span class="line">254</span><br><span class="line">255</span><br><span class="line">256</span><br><span class="line">257</span><br><span class="line">258</span><br><span class="line">259</span><br><span class="line">260</span><br><span class="line">261</span><br><span class="line">262</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">package</span> mr</span><br><span class="line"></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"fmt"</span></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"log"</span></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"net/rpc"</span></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"hash/fnv"</span></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"os"</span></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"io/ioutil"</span></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"strconv"</span></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"strings"</span></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"sort"</span></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"encoding/json"</span></span><br><span class="line"> <span class="keyword">import</span> <span class="string">"time"</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"> <span class="comment">//</span></span><br><span class="line"> <span class="comment">// Map functions return a slice of KeyValue.</span></span><br><span class="line"> <span class="comment">//</span></span><br><span class="line"> <span class="keyword">type</span> KeyValue <span class="keyword">struct</span> {</span><br><span class="line"> Key <span class="keyword">string</span></span><br><span class="line"> Value <span class="keyword">string</span></span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// for sorting by key.</span></span><br><span class="line"> <span class="keyword">type</span> ByKey []KeyValue</span><br><span class="line"></span><br><span class="line"> <span class="comment">// for sorting by key.</span></span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="params">(a ByKey)</span> <span class="title">Len</span><span class="params">()</span> <span class="title">int</span></span> { <span class="keyword">return</span> <span class="built_in">len</span>(a) }</span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="params">(a ByKey)</span> <span class="title">Swap</span><span class="params">(i, j <span class="keyword">int</span>)</span></span> { a[i], a[j] = a[j], a[i] }</span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="params">(a ByKey)</span> <span class="title">Less</span><span class="params">(i, j <span class="keyword">int</span>)</span> <span class="title">bool</span></span> { <span class="keyword">return</span> a[i].Key < a[j].Key }</span><br><span class="line"></span><br><span class="line"> <span class="comment">//</span></span><br><span class="line"> <span class="comment">// use ihash(key) % NReduce to choose the reduce</span></span><br><span class="line"> <span class="comment">// task number for each KeyValue emitted by Map.</span></span><br><span class="line"> <span class="comment">//</span></span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="title">ihash</span><span class="params">(key <span class="keyword">string</span>)</span> <span class="title">int</span></span> {</span><br><span class="line"> h := fnv.New32a()</span><br><span class="line"> h.Write([]<span class="keyword">byte</span>(key))</span><br><span class="line"> <span class="keyword">return</span> <span class="keyword">int</span>(h.Sum32() & <span class="number">0x7fffffff</span>)</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="title">is_map_task</span><span class="params">(task AskTaskReply)</span> <span class="title">bool</span></span> {</span><br><span class="line"> <span class="keyword">return</span> task.IsMapTask</span><br><span class="line"> }</span><br><span class="line"><span class="comment">//</span></span><br><span class="line"> <span class="comment">// main/mrworker.go calls this function.</span></span><br><span class="line"> <span class="comment">//</span></span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="title">Worker</span><span class="params">(mapf <span class="keyword">func</span>(<span class="keyword">string</span>, <span class="keyword">string</span>)</span> []<span class="title">KeyValue</span>,</span></span><br><span class="line"> reducef <span class="function"><span class="keyword">func</span><span class="params">(<span class="keyword">string</span>, []<span class="keyword">string</span>)</span> <span class="title">string</span>)</span> {</span><br><span class="line"></span><br><span class="line"> <span class="comment">// Your worker implementation here.</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// uncomment to send the Example RPC to the coordinator.</span></span><br><span class="line"> <span class="comment">// CallExample()</span></span><br><span class="line"> <span class="comment">// 如果 mr 任务还没结束</span></span><br><span class="line"> <span class="keyword">var</span> nreduce <span class="keyword">int</span></span><br><span class="line"> worker_id := <span class="number">-1</span></span><br><span class="line"> total_map := <span class="number">0</span></span><br><span class="line"> total_reduce := <span class="number">0</span></span><br><span class="line"> <span class="keyword">for</span> !IsDone() {</span><br><span class="line"> <span class="comment">// 向 coordinator 要任务</span></span><br><span class="line"> task := AskTask(worker_id)</span><br><span class="line"> worker_id = task.WorkerId</span><br><span class="line"> nreduce = task.NReduce</span><br><span class="line"> buckets := <span class="built_in">make</span>([][]KeyValue, nreduce) <span class="comment">// nreduce 个 kva</span></span><br><span class="line"> <span class="keyword">if</span> is_map_task(task) {</span><br><span class="line"> filename := task.Filename</span><br><span class="line"> <span class="keyword">if</span> filename != <span class="string">""</span> {</span><br><span class="line"> file, err := os.Open(filename)</span><br><span class="line"> <span class="keyword">if</span> err != <span class="literal">nil</span> {</span><br><span class="line"> log.Fatalf(<span class="string">"Can not open file %v ai"</span>, filename)</span><br><span class="line"> }</span><br><span class="line"> content, err := ioutil.ReadAll(file)</span><br><span class="line"> file.Close()</span><br><span class="line"> <span class="keyword">if</span> err != <span class="literal">nil</span> {</span><br><span class="line"> log.Fatalf(<span class="string">"Can not read file %v"</span>, filename)</span><br><span class="line"> }</span><br><span class="line"> kva := mapf(filename, <span class="keyword">string</span>(content))</span><br><span class="line"> <span class="keyword">for</span> _, item := <span class="keyword">range</span> kva {</span><br><span class="line"> bucket_number := ihash(item.Key) % nreduce</span><br><span class="line"> buckets[bucket_number] = <span class="built_in">append</span>(buckets[bucket_number], item);</span><br><span class="line"> }</span><br><span class="line"> <span class="comment">// 对 buckets 中的 item 排序</span></span><br><span class="line"> <span class="keyword">for</span> _, bucket := <span class="keyword">range</span> buckets {</span><br><span class="line"> sort.Sort(ByKey(bucket))</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> intermediate_files := []<span class="keyword">string</span>{}</span><br><span class="line"> basename := task.MapTaskBaseFilename + <span class="string">"-"</span> + strconv.Itoa(worker_id)</span><br><span class="line"> <span class="keyword">for</span> index, bucket := <span class="keyword">range</span> buckets {</span><br><span class="line"> <span class="comment">// TODO</span></span><br><span class="line"> <span class="comment">// 创建一个临时文件,把 bucket 中的内容写入临时文件中,并在完成任务后通过 ReportTask 向 coordinator 汇报该任务,</span></span><br><span class="line"> <span class="comment">// 当收到 coordinator 的确认后再把临时文件转正</span></span><br><span class="line"> oname := basename + <span class="string">"-"</span> + strconv.Itoa(total_map) + <span class="string">"-"</span> + strconv.Itoa(index)</span><br><span class="line"> intermediate_files = <span class="built_in">append</span>(intermediate_files, oname)</span><br><span class="line"> ofile, _ := os.OpenFile(oname, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, <span class="number">0666</span>)</span><br><span class="line"> enc := json.NewEncoder(ofile)</span><br><span class="line"> <span class="keyword">for</span> _, item := <span class="keyword">range</span> bucket {</span><br><span class="line"> err := enc.Encode(&item)</span><br><span class="line"> <span class="keyword">if</span> err != <span class="literal">nil</span> {</span><br><span class="line"> log.Fatalf(<span class="string">"json encoding error!\n"</span>)</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> ofile.Close()</span><br><span class="line"> }</span><br><span class="line"> reply := ReportTask(worker_id, task.Filename, intermediate_files, task.XReduce)</span><br><span class="line"> <span class="keyword">if</span> reply.GoodJob {</span><br><span class="line"> total_map++</span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="comment">// 暂时没任务,其他 worker 正在做 map task</span></span><br><span class="line"> time.Sleep(time.Second)</span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="comment">// reduce task</span></span><br><span class="line"> <span class="keyword">if</span> task.XReduce != <span class="number">-1</span> {</span><br><span class="line"> intermediate := []KeyValue{}</span><br><span class="line"> <span class="keyword">for</span> _, file := <span class="keyword">range</span> task.AllFiles {</span><br><span class="line"> ss := strings.Split(file, <span class="string">"-"</span>)</span><br><span class="line"> sxreduce := ss[<span class="built_in">len</span>(ss) - <span class="number">1</span>]</span><br><span class="line"> xreduce, _ := strconv.Atoi(sxreduce)</span><br><span class="line"> <span class="keyword">if</span> xreduce == task.XReduce {</span><br><span class="line"> f, _ := os.Open(file)</span><br><span class="line"> dec := json.NewDecoder(f)</span><br><span class="line"> <span class="keyword">for</span> {</span><br><span class="line"> <span class="keyword">var</span> kv KeyValue</span><br><span class="line"> <span class="keyword">if</span> err := dec.Decode(&kv); err != <span class="literal">nil</span> {</span><br><span class="line"> <span class="keyword">break</span></span><br><span class="line"> }</span><br><span class="line"> intermediate = <span class="built_in">append</span>(intermediate, kv)</span><br><span class="line"> }</span><br><span class="line"> f.Close()</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> sort.Sort(ByKey(intermediate))</span><br><span class="line"></span><br><span class="line"> <span class="comment">// 输出到 ReduceTaskBaseFilename-X 去</span></span><br><span class="line"> oname := task.ReduceTaskBaseFilename + <span class="string">"-"</span> + strconv.Itoa(task.XReduce)</span><br><span class="line"> ofile, err := os.OpenFile(<span class="string">"tmp"</span> + oname + <span class="string">".tmp"</span>, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, <span class="number">0666</span>)</span><br><span class="line"> <span class="keyword">if</span> err != <span class="literal">nil</span> {</span><br><span class="line"> log.Fatalf(<span class="string">"Can not open %v\n"</span>, oname)</span><br><span class="line"> }</span><br><span class="line"> i := <span class="number">0</span></span><br><span class="line"> <span class="keyword">for</span> i < <span class="built_in">len</span>(intermediate) {</span><br><span class="line"> j := i + <span class="number">1</span></span><br><span class="line"> values := []<span class="keyword">string</span>{intermediate[i].Value}</span><br><span class="line"> <span class="keyword">for</span> j < <span class="built_in">len</span>(intermediate) && intermediate[j].Key == intermediate[i].Key {</span><br><span class="line"> values = <span class="built_in">append</span>(values, intermediate[j].Value)</span><br><span class="line"> j++</span><br><span class="line"> }</span><br><span class="line"> output := reducef(intermediate[i].Key, values)</span><br><span class="line"> fmt.Fprintf(ofile, <span class="string">"%v %v\n"</span>, intermediate[i].Key, output)</span><br><span class="line"> i = j</span><br><span class="line"> }</span><br><span class="line"> ofile.Close()</span><br><span class="line"> reply := ReportTask(worker_id, task.Filename, <span class="literal">nil</span>, task.XReduce)</span><br><span class="line"> <span class="keyword">if</span> reply.GoodJob {</span><br><span class="line"> total_reduce++</span><br><span class="line"> tmpfile, _ := os.OpenFile(<span class="string">"tmp"</span> + oname + <span class="string">".tmp"</span>, os.O_RDONLY|os.O_CREATE|os.O_APPEND, <span class="number">0666</span>)</span><br><span class="line"> realfile, _ := os.OpenFile(oname, os.O_WRONLY|os.O_CREATE|os.O_APPEND, <span class="number">0666</span>)</span><br><span class="line"> content, _ := ioutil.ReadAll(tmpfile)</span><br><span class="line"> realfile.Write(content)</span><br><span class="line"> realfile.Close()</span><br><span class="line"> tmpfile.Close()</span><br><span class="line"> }</span><br><span class="line"> } <span class="keyword">else</span> {</span><br><span class="line"> <span class="comment">// 暂时没有 reduce task 可做,其他 worker 正在做</span></span><br><span class="line"> time.Sleep(time.Second)</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> }</span><br><span class="line"> <span class="comment">//</span></span><br><span class="line"> <span class="comment">// example function to show how to make an RPC call to the coordinator.</span></span><br><span class="line"> <span class="comment">//</span></span><br><span class="line"> <span class="comment">// the RPC argument and reply types are defined in rpc.go.</span></span><br><span class="line"> <span class="comment">//</span></span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="title">CallExample</span><span class="params">()</span></span> {</span><br><span class="line"></span><br><span class="line"> <span class="comment">// declare an argument structure.</span></span><br><span class="line"> args := ExampleArgs{}</span><br><span class="line"></span><br><span class="line"> <span class="comment">// fill in the argument(s).</span></span><br><span class="line"> args.X = <span class="number">99</span></span><br><span class="line"></span><br><span class="line"> <span class="comment">// declare a reply structure.</span></span><br><span class="line"> reply := ExampleReply{}</span><br><span class="line"></span><br><span class="line"> <span class="comment">// send the RPC request, wait for the reply.</span></span><br><span class="line"> call(<span class="string">"Coordinator.Example"</span>, &args, &reply)</span><br><span class="line"></span><br><span class="line"> <span class="comment">// reply.Y should be 100.</span></span><br><span class="line"> fmt.Printf(<span class="string">"reply.Y %v\n"</span>, reply.Y)</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// coordinator 是否已经完成了所有任务</span></span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="title">IsDone</span><span class="params">()</span> <span class="title">bool</span></span> {</span><br><span class="line"> args := AskStatusArgs{}</span><br><span class="line"> reply := AskStatusReply{}</span><br><span class="line"> connect := call(<span class="string">"Coordinator.IsDone"</span>, &args, &reply)</span><br><span class="line"> <span class="keyword">if</span> !connect {</span><br><span class="line"> <span class="comment">// coordinator 已经退出了,因为所有任务都已经完成了</span></span><br><span class="line"> <span class="comment">// fmt.Printf("Coordinator is down!\n")</span></span><br><span class="line"> os.Exit(<span class="number">0</span>)</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> reply.IsDone</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">// 请 coordinator 给我一个任务</span></span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="title">AskTask</span><span class="params">(worker_id <span class="keyword">int</span>)</span> <span class="title">AskTaskReply</span></span> {</span><br><span class="line"> args := AskTaskArgs{}</span><br><span class="line"> args.WorkerId = worker_id</span><br><span class="line"> reply := AskTaskReply{}</span><br><span class="line"> <span class="comment">// why? 解除下面这条注释,就会出现问题。。。。迷惑</span></span><br><span class="line"> <span class="comment">// reply.XReduce = -1 // 表示没有 reduce task 可做</span></span><br><span class="line"> connect := call(<span class="string">"Coordinator.AskTask"</span>, &args, &reply)</span><br><span class="line"> <span class="keyword">if</span> !connect {</span><br><span class="line"> <span class="comment">// coordinator 已经退出了,因为所有任务都已经完成了</span></span><br><span class="line"> os.Exit(<span class="number">0</span>)</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> reply</span><br><span class="line"> }</span><br><span class="line"> <span class="comment">// 向 coordinator 汇报我完成的任务</span></span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="title">ReportTask</span><span class="params">(worker_id <span class="keyword">int</span>, filename <span class="keyword">string</span>, intermediate_files []<span class="keyword">string</span>, xreduce <span class="keyword">int</span>)</span> <span class="title">ReportTaskReply</span></span> {</span><br><span class="line"> args := ReportTaskArgs{}</span><br><span class="line"> args.WorkerId = worker_id</span><br><span class="line"> args.MapTaskFilename = filename</span><br><span class="line"> args.IntermediateFile = intermediate_files</span><br><span class="line"> args.XReduce = xreduce</span><br><span class="line"></span><br><span class="line"> reply := ReportTaskReply{}</span><br><span class="line"></span><br><span class="line"> connect := call(<span class="string">"Coordinator.ReportTask"</span>, &args, &reply)</span><br><span class="line"> <span class="keyword">if</span> !connect {</span><br><span class="line"> <span class="comment">// coordinator 已经退出了,因为所有任务都已经完成了</span></span><br><span class="line"> os.Exit(<span class="number">0</span>)</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">return</span> reply</span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> <span class="comment">//</span></span><br><span class="line"> <span class="comment">// send an RPC request to the coordinator, wait for the response.</span></span><br><span class="line"> <span class="comment">// usually returns true.</span></span><br><span class="line"> <span class="comment">// returns false if something goes wrong.</span></span><br><span class="line"> <span class="comment">//</span></span><br><span class="line"> <span class="function"><span class="keyword">func</span> <span class="title">call</span><span class="params">(rpcname <span class="keyword">string</span>, args <span class="keyword">interface</span>{}, reply <span class="keyword">interface</span>{})</span> <span class="title">bool</span></span> {</span><br><span class="line"> <span class="comment">// c, err := rpc.DialHTTP("tcp", "127.0.0.1"+":1234")</span></span><br><span class="line"> sockname := coordinatorSock()</span><br><span class="line"> c, err := rpc.DialHTTP(<span class="string">"unix"</span>, sockname)</span><br><span class="line"> <span class="keyword">if</span> err != <span class="literal">nil</span> {</span><br><span class="line"> log.Fatal(<span class="string">"dialing:"</span>, err)</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">defer</span> c.Close()</span><br><span class="line"></span><br><span class="line"> err = c.Call(rpcname, args, reply)</span><br><span class="line"> <span class="keyword">if</span> err == <span class="literal">nil</span> {</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">true</span></span><br><span class="line"> }</span><br><span class="line"></span><br><span class="line"> fmt.Println(err)</span><br><span class="line"> <span class="keyword">return</span> <span class="literal">false</span></span><br><span class="line"> }</span><br></pre></td></tr></table></figure>]]></content>
<summary type="html"><h2 id="MIT-6-824-Lab1-MapReduce"><a href="#MIT-6-824-Lab1-MapReduce" class="headerlink" title="MIT 6.824 Lab1 MapReduce"></a>MIT 6.824 Lab1</summary>
</entry>
<entry>
<title>MapReduce</title>
<link href="https://codroc.github.io/2022/06/06/MapReduce/"/>
<id>https://codroc.github.io/2022/06/06/MapReduce/</id>
<published>2022-06-06T11:57:16.000Z</published>
<updated>2022-06-06T11:57:16.000Z</updated>
<content type="html"><![CDATA[<p><strong>MapReduce 中文翻译</strong></p><h2 id="Abstract"><a href="#Abstract" class="headerlink" title="Abstract"></a>Abstract</h2><p>MapReduce是一个编程模型,也是一个处理和生成超大数据集的算法模型的相关实现。用户首先指定(创建?)一个Map函数处理一个基于key/value 对的数据集合,输出基于key/value 对的中间数据集合;然后再使用一个reduce函数用来合并上一步中间数据集合具有相同key值的中间value值。现实世界中有很多满足上述模型的例子,本论文将详细描述该模型。</p><p>以上诉mapreduce风格编写的的程序能够在大量的普通配置的计算机上实现并行化处理。这个系统在运行时只关心:如何分割输入数据,在由大量计算机组成的集群上的调度,集群中计算机的错误处理,管理集群中计算机之间必要的通信。采用MapReduce架构可以使那些没有并行计算和分布式处理系统开发经验的程序员有效利用分布式系统的丰富资源。</p><p>我们的MapReduce实现运行在规模可以灵活调整的由普通机器组成的集群上:一个典型的MapReduce计算往往由几千台机器组成、处理以TB计算的数据。程序员发现这个系统非常好用:已经实现了数以百计的MapReduce程序,在Google的集群上,每天都有1000多个MapReduce程序在执行。</p><h2 id="1-Introduction"><a href="#1-Introduction" class="headerlink" title="1 Introduction"></a>1 Introduction</h2><p>在过去的5年里,包括本文作者在内的Google的很多程序员,为了处理海量的原始数据,已经实现了数以百计的、专用的计算方法。这些计算方法用来处理大量的原始数据,比如,文档抓取(类似网络爬虫的程序)、Web请求日志等等;也为了计算处理各种类型的衍生数据,比如倒排索引、Web文档的图结构的各种表示形式、每台主机上网络爬虫抓取的页面数量的汇总、每天被请求的最多的查询的集合等等。大多数这样的数据处理运算在概念上很容易理解。然而由于输入的数据量巨大,因此要想在可接受的时间内完成运算,只有将这些计算分布在成百上千的主机上。如何处理并行计算、如何分发数据、如何处理错误?所有这些问题综合在一起,需要大量的代码处理,因此也使得原本简单的运算变得难以处理。</p><p>为了解决上述复杂的问题,我们设计一个新的抽象模型,使用这个抽象模型,我们只要表述我们想要执行的简单运算即可,而不必关心并行计算、容错、数据分布、负载均衡等复杂的细节,这些问题都被封装在了一个库里面。设计这个抽象模型的灵感来自Lisp和许多其他函数式语言的<em>Map</em>和<em>Reduce</em>的原语。我们意识到我们大多数的运算都包含这样的操作:在输入数据的每个逻辑记录上应用Map操作得出一个中间key/value pair集合,然后在所有具有相同key值的value值上应用reduce操作,从而恰当的合并中间的数据。使用MapReduce模型,再结合用户实现的Map和Reduce函数,我们就可以非常容易的实现大规模并行化计算;通过MapReduce模型自带的“再次执行”(re-execution)功能,也提供了初级的容灾实现方案。</p><p>这个工作(实现一个MapReduce框架模型)的主要贡献是通过简单又强大的接口来实现自动并行化和大规模的分布式计算,通过使用MapReduce模型接口实现在以普通PC机为基础的大规模集群服务器上进行高性能计算。</p><p>section2描述了mapreduce的基础编程模型和一些使用案例。section3描述了一个经过裁剪的、适合我们的基于集群的计算环境的MapReduce实现。section4描述我们认为在MapReduce编程模型中一些实用的技巧。section5测量我们实现的MapReduce对于各种不同的任务的性能。section6揭示了在Google内部如何使用MapReduce作为基础重写我们的索引系统产品,包括其它一些使用MapReduce的经验。section7讨论相关的和未来的工作。</p><h2 id="2-Programming-Model"><a href="#2-Programming-Model" class="headerlink" title="2 Programming Model"></a>2 Programming Model</h2><p>MapReduce编程模型的原理是:利用一个输入key/value pair集合来产生一个输出的key/value pair集合。MapReduce库的用户用两个函数表达这个计算:<strong>Map和Reduce</strong>。</p><p>用户自定义的<em>Map</em>函数接受一个key/value pair的输入值,然后产生一个中间key/value pair值的集合。MapReduce库把所有具有相同中间key值<em>I</em>的中间value值集合在一起后传递给<em>reduce</em>函数。</p><p>用户自定义的Reduce函数接受一个中间key的值<em>I</em>和相关的一个value值的集合。Reduce函数合并这些value值,形成一个较小的value值的集合。通常来说,每次Reduce函数调用只产生0或1个输出value值。通常我们通过一个迭代器把中间value值提供给Reduce函数,这样我们就可以处理无法全部放入内存中的大量的value值的集合(迭代器可看为一个容器,所以数据放入一个容器中,reduce函数就从这个容器中取数据即可)。</p><h2 id="2-1-Example"><a href="#2-1-Example" class="headerlink" title="2.1 Example"></a>2.1 Example</h2><p>思考一个问题:计算一个大的文档集合中每个单词出现的次数,程序可能会写出类似下面的伪代码:</p><figure class="highlight text"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">map(String key, String value):</span><br><span class="line"> // key: document name</span><br><span class="line"> // value: document contents </span><br><span class="line">for each word w in value:</span><br><span class="line"> EmitIntermediate(w, “1″); </span><br><span class="line">reduce(String key, Iterator values): </span><br><span class="line">// key: a word </span><br><span class="line">// values: a list of counts </span><br><span class="line">int result = 0; </span><br><span class="line">for each v in values: </span><br><span class="line"> result += ParseInt(v); </span><br><span class="line">Emit(AsString(result));map函数输出文档中的每个词、以及这个词的出现次数(在这个简单的例子里就是1)。reduce函数把Map函数产生的每一个特定的词的计数累加起来。</span><br></pre></td></tr></table></figure><p>另外,用户编写代码,使用输入和输出文件的名字、可选的调节参数来完成一个符合MapReduce模型规范的对象,然后调用MapReduce函数,并把这个规范对象传递给它。用户的代码和MapReduce库链接在一起(用C++实现)。附录A包含了这个实例的全部程序代码。</p><h2 id="2-2-Type"><a href="#2-2-Type" class="headerlink" title="2.2 Type"></a>2.2 Type</h2><p>尽管在前面例子的伪代码中使用了以字符串表示的输入输出值,但是在概念上,用户定义的Map和Reduce函数都有相关联的类型:</p><figure class="highlight text"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">map (k1,v1) ->list(k2,v2) </span><br><span class="line">reduce (k2,list(v2)) ->list(v2)</span><br></pre></td></tr></table></figure><p>比如,输入的key和value值与输出的key和value值在类型推导的域不同。此外,中间key和value值与输出key和value值在类型上推导的域相同。(原文中这个domain的含义不是很清楚,我参考Hadoop、KFS等实现,map和reduce都使用了泛型,因此,我把domain翻译成类型推导的域)。</p><p>我们的C++中使用字符串类型作为用户自定义函数的输入输出,用户在自己的代码中对字符串进行适当的类型转换。</p><h2 id="2-3-More-Examples"><a href="#2-3-More-Examples" class="headerlink" title="2.3 More Examples"></a>2.3 More Examples</h2><p>这里还有一些有趣的简单例子,可以很容易的使用MapReduce模型来表示:</p><p><strong>分布式的Grep</strong>:Map函数输出匹配某个模式的一行,Reduce函数是一个恒等函数,即把中间数据复制到输出。</p><p><strong>计算URL访问频率</strong>:Map函数处理日志中web页面请求的记录,然后输出(URL,1)。Reduce函数把相同URL的value值都累加起来,产生(URL,记录总数)结果。</p><p><strong>倒转网络链接图</strong>:Map函数在源页面(source)中搜索所有的链接目标(target)并输出为(target,source)。Reduce函数把给定链接目标(target)的链接组合成一个列表,输出(target,list(source))。</p><p><strong>每个主机的检索词向量</strong>:检索词向量用一个(词,频率)列表来概述出现在文档或文档集中的最重要的一些词。Map函数为每一个输入文档输出(主机名,检索词向量),其中主机名来自文档的URL。Reduce函数接收给定主机的所有文档的检索词向量,并把这些检索词向量加在一起,丢弃掉低频的检索词,输出一个最终的(主机名,检索词向量)。</p><p><strong>倒排索引</strong>:Map函数分析每个文档输出一个(词,文档号)的列表,Reduce函数的输入是一个给定词的所有(词,文档号),排序所有的文档号,输出(词,list(文档号))。所有的输出集合形成一个简单的倒排索引,它以一种简单的算法跟踪词在文档中的位置。</p><p><strong>分布式排序</strong>:Map函数从每个记录提取key,输出(key,record)。Reduce函数不改变任何的值。这个运算依赖分区机制(在4.1描述)和排序属性(在4.2描述)。</p><h2 id="3-Implementation"><a href="#3-Implementation" class="headerlink" title="3 Implementation"></a>3 Implementation</h2><p>MapReduce模型可以有多种不同的实现方式。如何正确选择取决于具体的环境。例如,一种实现方式适用于小型的共享内存方式的机器,另外一种实现方式则适用于大型NUMA架构的多处理器的主机,而有的实现方式更适合大型的网络连接集群。</p><p>本章节描述一个适用于Google内部广泛使用的运算环境的实现:用以太网交换机连接、由普通PC机组成的大型集群。在我们的环境里包括:</p><ol><li>x86架构、运行Linux操作系统、双处理器、2-4GB内存的机器。</li><li>普通的网络硬件设备,每个机器的带宽为百兆或者千兆,但是远小于网络的平均带宽的一半。</li><li>集群中包含成百上千的机器,因此,机器故障是常态。</li><li>存储为廉价的内置IDE硬盘。一个内部分布式文件系统用来管理存储在这些磁盘上的数据。文件系统通过数据复制来在不可靠的硬件上保证数据的可靠性和有效性。</li><li>用户提交工作(job)给调度系统。每个工作(job)都包含一系列的任务(task),调度系统将这些任务调度到集群中多台可用的机器上。</li></ol><h2 id="3-1-Execution-Overview"><a href="#3-1-Execution-Overview" class="headerlink" title="3.1 Execution Overview"></a>3.1 Execution Overview</h2><p>通过将Map调用的输入数据自动分割为M个数据片段的集合,Map调用被分布到多台机器上执行。输入的数据片段能够在不同的机器上并行处理。使用分区函数将Map调用产生的中间key值分成R个不同分区(例如,hash(key) mod R),Reduce调用也被分布到多台机器上执行。分区数量(R)和分区函数由用户来指定。</p><p><img src="https://s2.loli.net/2022/06/06/rLXqMlZ5TawdYKm.png" alt="MapReduce0.PNG"></p><p>图1展示了我们的MapReduce实现中操作的全部流程。当用户调用MapReduce函数时,将发生下面的一系列动作(下面的序号和图1中的序号一一对应):</p><p>1.用户程序首先调用的MapReduce库将输入文件分成M个数据片度,每个数据片段的大小一般从 16MB到64MB(可以通过可选的参数来控制每个数据片段的大小)。然后用户程序在机群中创建大量的程序副本。</p><p>2.这些程序副本中的有一个特殊的程序–master。副本中其它的程序都是worker程序,由master分配任务。有M个Map任务和R个Reduce任务将被分配,master将一个Map任务或Reduce任务分配给一个空闲的worker。</p><p>3.被分配了map任务的worker程序读取相关的输入数据片段,从输入的数据片段中解析出key/value pair,然后把key/value pair传递给用户自定义的Map函数,由Map函数生成并输出的中间key/value pair,并缓存在内存中。</p><p>4.缓存中的key/value pair通过分区函数分成R个区域,之后周期性的写入到本地磁盘上。缓存的key/value pair在本地磁盘上的存储位置将被回传给master,由master负责把这些存储位置再传送给Reduce worker。</p><p>5.当Reduce worker程序接收到master程序发来的数据存储位置信息后,使用RPC从Map worker所在主机的磁盘上读取这些缓存数据。当Reduce worker读取了所有的中间数据后,通过对key进行排序后使得具有相同key值的数据聚合在一起。由于许多不同的key值会映射到相同的Reduce任务上,因此必须进行排序。如果中间数据太大无法在内存中完成排序,那么就要在外部进行排序。</p><p>6.Reduce worker程序遍历排序后的中间数据,对于每一个唯一的中间key值,Reduce worker程序将这个key值和它相关的中间value值的集合传递给用户自定义的Reduce函数。Reduce函数的输出被追加到所属分区的输出文件。</p><p>7.当所有的Map和Reduce任务都完成之后,master唤醒用户程序。在这个时候,在用户程序里的对MapReduce调用才返回。</p><p>在成功完成任务之后,MapReduce的输出存放在R个输出文件中(对应每个Reduce任务产生一个输出文件,文件名由用户指定)。一般情况下,用户不需要将这R个输出文件合并成一个文件–他们经常把这些文件作为另外一个MapReduce的输入,或者在另外一个可以处理多个分割文件的分布式应用中使用。</p><h2 id="3-2-Master-Data-Structures"><a href="#3-2-Master-Data-Structures" class="headerlink" title="3.2 Master Data Structures"></a>3.2 Master Data Structures</h2><p>Master持有一些数据结构,它存储每一个Map和Reduce任务的状态(空闲、工作中或完成),以及Worker机器(非空闲任务的机器)的标识。</p><p>Master就像一个数据管道,中间文件存储区域的位置信息通过这个管道从Map传递到Reduce。因此,对于每个已经完成的Map任务,master存储了Map任务产生的R个中间文件存储区域的大小和位置。当Map任务完成时,Master接收到位置和大小的更新信息,这些信息被逐步递增的推送给那些正在工作的Reduce任务。</p><h2 id="3-3-Fault-Tolerance"><a href="#3-3-Fault-Tolerance" class="headerlink" title="3.3 Fault Tolerance"></a>3.3 Fault Tolerance</h2><p>因为MapReduce库的设计初衷是使用由成百上千的机器组成的集群来处理超大规模的数据,所以,这个库必须要能很好的处理机器故障。</p><h3 id="Worker-Failure"><a href="#Worker-Failure" class="headerlink" title="Worker Failure"></a>Worker Failure</h3><p>master周期性的ping每个worker。如果在一个约定的时间范围内没有收到worker返回的信息,master将把这个worker标记为失效。所有由这个失效的worker完成的Map任务被重设为初始的空闲状态,之后这些任务就可以被安排给其他的worker。同样的,worker失效时正在运行的Map或Reduce任务也将被重新置为空闲状态,等待重新调度。</p><p>当worker故障时,由于已经完成的Map任务的输出存储在这台机器上,Map任务的输出已不可访问了,因此必须重新执行。而已经完成的Reduce任务的输出存储在全局文件系统上,因此不需要再次执行。</p><p>当一个Map任务首先被worker A执行,之后由于worker A失效了又被调度到worker B执行,这个“重新执行”的动作会被通知给所有执行Reduce任务的worker。任何还没有从worker A读取数据的Reduce任务将从worker B读取数据。</p><p>MapReduce可以处理大规模worker失效的情况。比如,在一个MapReduce操作执行期间,在正在运行的集群上进行网络维护引起80台机器在几分钟内不可访问了,MapReduce master只需要简单的再次执行那些不可访问的worker完成的工作,之后继续执行未完成的任务,直到最终完成这个MapReduce操作。</p><h3 id="Master-Failure"><a href="#Master-Failure" class="headerlink" title="Master Failure"></a>Master Failure</h3><p>一个简单的解决办法是让master周期性的将上面描述的数据结构(指3.2节)写入磁盘,即检查点(checkpoint)。如果这个master任务失效了,可以从最后一个检查点(checkpoint)开始启动另一个master进程。然而,由于只有一个master进程,master失效后再恢复是比较麻烦的,因此我们现在的实现是如果master失效,就中止MapReduce运算。客户可以检查到这个状态,并且可以根据需要重新执行MapReduce操作。</p><h3 id="Semantics-in-the-Presence-of-Failures"><a href="#Semantics-in-the-Presence-of-Failures" class="headerlink" title="Semantics in the Presence of Failures"></a>Semantics in the Presence of Failures</h3><p>出现故障时的语义?故障时处理的机制?</p><p>当用户提供的Map和Reduce操作是输入确定性函数(即相同的输入产生相同的输出)时,我们的分布式实现在任何情况下的输出都和所有程序没有出现任何错误、顺序的执行产生的输出是一样的。</p><p>我们依赖对Map和Reduce任务的输出是原子提交的来完成这个特性。每个工作中的任务把它的输出写到私有的临时文件中。每个Reduce任务生成一个这样的文件,而每个Map任务则生成R个这样的文件(一个Reduce任务对应一个文件)。当一个Map任务完成的时,worker发送一个包含R个临时文件名的完成消息给master。如果master从一个已经完成的Map任务再次接收到到一个完成消息,master将忽略这个消息;否则,master将这R个文件的名字记录在数据结构里。</p><p>当Reduce任务完成时,Reduce worker进程以原子的方式把临时文件重命名为最终的输出文件。如果同一个Reduce任务在多台机器上执行,针对同一个最终的输出文件将有多个重命名操作执行。我们依赖底层文件系统提供的重命名操作的原子性来保证最终的文件系统状态仅仅包含一个Reduce任务产生的数据。</p><p>使用MapReduce模型的程序员可以很容易的理解他们程序的行为,因为我们绝大多数的Map和Reduce操作是确定性的,而且存在这样的一个事实:我们的失效处理机制等价于一个顺序的执行的操作。当Map或/和Reduce操作是不确定性的时候,我们提供虽然较弱但是依然合理的处理机制。当使用非确定操作的时候,一个Reduce任务R1的输出等价于一个非确定性程序顺序执行产生时的输出。但是,另一个Reduce任务R2的输出也许符合一个不同的非确定顺序程序执行产生的R2的输出。</p><p>考虑Map任务M和Reduce任务R1、R2的情况。我们设定e(Ri)是Ri已经提交的执行过程(有且仅有一个这样的执行过程)。当e(R1)读取了由M一次执行产生的输出,而e(R2)读取了由M的另一次执行产生的输出,导致了较弱的失效处理。</p><h2 id="3-4-Locality-存储位置"><a href="#3-4-Locality-存储位置" class="headerlink" title="3.4 Locality(存储位置)"></a>3.4 Locality(存储位置)</h2><p>在我们的计算运行环境中,网络带宽是一个相当匮乏的资源。我们通过尽量把输入数据(由GFS管理)存储在集群中机器的本地磁盘上来节省网络带宽。GFS把每个文件按64MB一个Block分隔,每个Block保存在多台机器上,环境中就存放了多份拷贝(一般是3个拷贝)。MapReduce的master在调度Map任务时会考虑输入文件的位置信息,尽量将一个Map任务调度在包含相关输入数据拷贝的机器上执行;如果上述努力失败了,master将尝试在保存有输入数据拷贝的机器附近的机器上执行Map任务(例如,分配到一个和包含输入数据的机器在一个switch里的worker机器上执行)。当在一个足够大的cluster集群上运行大型MapReduce操作的时候,大部分的输入数据都能从本地机器读取,因此消耗非常少的网络带宽。</p><h2 id="3-5-Task-Granularity"><a href="#3-5-Task-Granularity" class="headerlink" title="3.5 Task Granularity"></a>3.5 Task Granularity</h2><p>任务粒度,如前所述,我们把Map拆分成了M个片段、把Reduce拆分成R个片段执行。理想情况下,M和R应当比集群中worker的机器数量要多得多。在每台worker机器都执行大量的不同任务能够提高集群的动态的负载均衡能力,并且能够加快故障恢复的速度:失效机器上执行的大量Map任务都可以分布到所有其他的worker机器上去执行。</p><p>但是实际上,在我们的具体实现中对M和R的取值都有一定的客观限制,因为master必须执行O(M+R)次调度,并且在内存中保存O(M<em>R)个状态(对影响内存使用的因素还是比较小的:O(M</em>R)块状态,大概每对Map任务/Reduce任务1个字节就可以了)。</p><p>更进一步,R值通常是由用户指定的,因为每个Reduce任务最终都会生成一个独立的输出文件。实际使用时我们也倾向于选择合适的M值,以使得每一个独立任务都是处理大约16M到64M的输入数据(这样,上面描写的输入数据本地存储优化策略才最有效),另外,我们把R值设置为我们想使用的worker机器数量的小的倍数。我们通常会用这样的比例来执行MapReduce:M=200000,R=5000,使用2000台worker机器。</p><h2 id="3-6-Backup-Tasks"><a href="#3-6-Backup-Tasks" class="headerlink" title="3.6 Backup Tasks"></a>3.6 Backup Tasks</h2><p>任务备份,影响一个MapReduce的总执行时间最通常的因素是“落伍者”:在运算过程中,如果有一台机器花了很长的时间才完成最后几个Map或Reduce任务,导致MapReduce操作总的执行时间超过预期。出现“落伍者”的原因非常多。比如:如果一个机器的硬盘出了问题,在读取的时候要经常的进行读取纠错操作,导致读取数据的速度从30M/s降低到1M/s。如果cluster的调度系统在这台机器上又调度了其他的任务,由于CPU、内存、本地硬盘和网络带宽等竞争因素的存在,导致执行MapReduce代码的执行效率更加缓慢。我们最近遇到的一个问题是由于机器的初始化代码有bug,导致关闭了的处理器的缓存:在这些机器上执行任务的性能和正常情况相差上百倍。</p><p>我们有一个通用的机制来减少“落伍者”出现的情况。当一个MapReduce操作接近完成的时候,master调度备用(backup)任务进程来执行剩下的、处于处理中状态(in-progress)的任务。无论是最初的执行进程、还是备用(backup)任务进程完成了任务,我们都把这个任务标记成为已经完成。我们调优了这个机制,通常只会占用比正常操作多几个百分点的计算资源。我们发现采用这样的机制对于减少超大MapReduce操作的总处理时间效果显著。例如,在5.3节描述的排序任务,在关闭掉备用任务的情况下要多花44%的时间完成排序任务。</p><h2 id="4-Refinements"><a href="#4-Refinements" class="headerlink" title="4 Refinements"></a>4 Refinements</h2><p>虽然简单的Map和Reduce函数提供的基本功能已经能够满足大部分的计算需要,我们还是发掘出了一些有价值的扩展功能。本节将描述这些扩展功能。</p><h2 id="4-1-Partitioning-Function"><a href="#4-1-Partitioning-Function" class="headerlink" title="4.1 Partitioning Function"></a>4.1 Partitioning Function</h2><p>MapReduce的使用者通常会指定Reduce任务和Reduce任务输出文件的数量(R)。我们在中间key上使用分区函数来对数据进行分区,之后再输入到后续任务执行进程。一个默认的分区函数是使用hash方法(比如,hash(key) mod R)进行分区。hash方法能产生非常平衡的分区。然而,有的时候,其它的一些分区函数对key值进行的分区将非常有用。比如,输出的key值是URLs,我们希望每个主机的所有条目保持在同一个输出文件中。为了支持类似的情况,MapReduce库的用户需要提供专门的分区函数。例如,使用“hash(Hostname(urlkey)) mod R”作为分区函数就可以把所有来自同一个主机的URLs保存在同一个输出文件中。</p><h2 id="4-2-Ordering-Guarantees"><a href="#4-2-Ordering-Guarantees" class="headerlink" title="4.2 Ordering Guarantees"></a>4.2 Ordering Guarantees</h2><p>我们确保在给定的分区中,中间key/value pair数据的处理顺序是按照key值增量顺序处理的。这样的顺序保证对每个分成生成一个有序的输出文件,这对于需要对输出文件按key值随机存取的应用非常有意义,对在排序输出的数据集也很有帮助。</p><h2 id="4-3-Combiner-Function"><a href="#4-3-Combiner-Function" class="headerlink" title="4.3 Combiner Function"></a>4.3 Combiner Function</h2><p>在某些情况下,Map函数产生的中间key值的重复数据会占很大的比重,并且,用户自定义的Reduce函数满足结合律和交换律。在2.1节的词数统计程序是个很好的例子。由于词频率倾向于一个zipf分布(齐夫分布),每个Map任务将产生成千上万个这样的记录<the,1>.所有的这些记录将通过网络被发送到一个单独的Reduce任务,然后由这个Reduce任务把所有这些记录累加起来产生一个数字。我们允许用户指定一个可选的combiner函数,combiner函数首先在本地将这些记录进行一次合并,然后将合并的结果再通过网络发送出去。</p><p>Combiner函数在每台执行Map任务的机器上都会被执行一次。一般情况下,Combiner和Reduce函数是一样的。Combiner函数和Reduce函数之间唯一的区别是MapReduce库怎样控制函数的输出。Reduce函数的输出被保存在最终的输出文件里,而Combiner函数的输出被写到中间文件里,然后被发送给Reduce任务。</p><p>部分的合并中间结果可以显著的提高一些MapReduce操作的速度。附录A包含一个使用combiner函数的例子。</p><h2 id="4-4-Input-and-Output-Types"><a href="#4-4-Input-and-Output-Types" class="headerlink" title="4.4 Input and Output Types"></a>4.4 Input and Output Types</h2><p>MapReduce库支持几种不同的格式的输入数据。比如,文本模式的输入数据的每一行被视为是一个key/value pair。key是文件的偏移量,value是那一行的内容。另外一种常见的格式是以key进行排序来存储的key/value pair的序列。每种输入类型的实现都必须能够把输入数据分割成数据片段,该数据片段能够由单独的Map任务来进行后续处理(例如,文本模式的范围分割必须确保仅仅在每行的边界进行范围分割)。虽然大多数MapReduce的使用者仅仅使用很少的预定义输入类型就满足要求了,但是使用者依然可以通过提供一个简单的<em>Reader</em>接口实现就能够支持一个新的输入类型。</p><p><em>Reader</em>并非一定要从文件中读取数据,比如,我们可以很容易的实现一个从数据库里读记录的Reader,或者从内存中的数据结构读取数据的Reader。</p><p>类似的,我们提供了一些预定义的输出数据的类型,通过这些预定义类型能够产生不同格式的数据。用户采用类似添加新的输入数据类型的方式增加新的输出类型。</p><h2 id="4-5-Side-effects"><a href="#4-5-Side-effects" class="headerlink" title="4.5 Side-effects"></a>4.5 Side-effects</h2><p>在某些情况下,MapReduce的使用者发现,如果在Map和/或Reduce操作过程中增加辅助的输出文件会比较省事。我们依靠程序writer把这种“副作用”变成原子的和幂等的(幂等的指一个总是产生相同结果的数学运算)。通常应用程序首先把输出结果写到一个临时文件中,在输出全部数据之后,在使用系统级的原子操作rename重新命名这个临时文件。</p><p>如果一个任务产生了多个输出文件,我们没有提供类似两阶段提交的原子操作支持这种情况。因此,对于会产生多个输出文件、并且对于跨文件有一致性要求的任务,都必须是确定性的任务。但是在实际应用过程中,这个限制还没有给我们带来过麻烦。</p><h2 id="4-6-Skipping-Bad-Records"><a href="#4-6-Skipping-Bad-Records" class="headerlink" title="4.6 Skipping Bad Records"></a>4.6 Skipping Bad Records</h2><p>有时候,用户程序中的bug导致Map或者Reduce函数在处理某些记录的时候crash掉,MapReduce操作无法顺利完成。惯常的做法是修复bug后再次执行MapReduce操作,但是,有时候找出这些bug并修复它们不是一件容易的事情;这些bug也许是在第三方库里边,而我们手头没有这些库的源代码。而且在很多时候,忽略一些有问题的记录也是可以接受的,比如在一个巨大的数据集上进行统计分析的时候。我们提供了一种执行模式,在这种模式下,为了保证保证整个处理能继续进行,MapReduce会检测哪些记录导致确定性的crash,并且跳过这些记录不处理。</p><p>每个worker进程都设置了信号处理函数捕获内存段异常(segmentation violation)和总线错误(bus error)。在执行Map或者Reduce操作之前,MapReduce库通过全局变量保存记录序号。如果用户程序触发了一个系统信号,消息处理函数将用“最后一口气”通过UDP包向master发送处理的最后一条记录的序号。当master看到在处理某条特定记录不止失败一次时,master就标志着条记录需要被跳过,并且在下次重新执行相关的Map或者Reduce任务的时候跳过这条记录。</p><h2 id="4-7-Local-Execution"><a href="#4-7-Local-Execution" class="headerlink" title="4.7 Local Execution"></a>4.7 Local Execution</h2><p>调试Map和Reduce函数的bug是非常困难的,因为实际执行操作时不但是分布在系统中执行的,而且通常是在好几千台计算机上执行,具体的执行位置是由master进行动态调度的,这又大大增加了调试的难度。为了简化调试、profile和小规模测试,我们开发了一套MapReduce库的本地实现版本,通过使用本地版本的MapReduce库,MapReduce操作在本地计算机上顺序的执行。用户可以控制MapReduce操作的执行,可以把操作限制到特定的Map任务上。用户通过设定特别的标志来在本地执行他们的程序,之后就可以很容易的使用本地调试和测试工具(比如gdb)。</p><h2 id="4-8-Stauts-Information"><a href="#4-8-Stauts-Information" class="headerlink" title="4.8 Stauts Information"></a>4.8 Stauts Information</h2><p>master使用嵌入式的HTTP服务器(如Jetty)显示一组状态信息页面,用户可以监控各种执行状态。状态信息页面显示了包括计算执行的进度,比如已经完成了多少任务、有多少任务正在处理、输入的字节数、中间数据的字节数、输出的字节数、处理百分比等等。页面还包含了指向每个任务的stderr和stdout文件的链接。用户根据这些数据预测计算需要执行大约多长时间、是否需要增加额外的计算资源。这些页面也可以用来分析什么时候计算执行的比预期的要慢。</p><p>另外,处于最顶层的状态页面显示了哪些worker失效了,以及他们失效的时候正在运行的Map和Reduce任务。这些信息对于调试用户代码中的bug很有帮助。</p><h2 id="4-9-Counters"><a href="#4-9-Counters" class="headerlink" title="4.9 Counters"></a>4.9 Counters</h2><p>MapReduce库使用计数器统计不同事件发生次数。比如,用户可能想统计已经处理了多少个单词、已经索引的多少篇German文档等等。</p><p>为了使用这个特性,用户在程序中创建一个命名的计数器对象,在Map和Reduce函数中相应的增加计数器的值。例如:</p><figure class="highlight text"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">Counter* uppercase; </span><br><span class="line">uppercase = GetCounter(“uppercase”); </span><br><span class="line">map(String name, String contents): </span><br><span class="line"> for each word w in contents: </span><br><span class="line"> if (IsCapitalized(w)):</span><br><span class="line"> uppercase->Increment(); </span><br><span class="line"> EmitIntermediate(w, “1″);</span><br></pre></td></tr></table></figure><p>这些计数器的值周期性的从各个单独的worker机器上传递给master(附加在ping的应答包中传递)。master把执行成功的Map和Reduce任务的计数器值进行累计,当MapReduce操作完成之后,返回给用户代码。计数器当前的值也会显示在master的状态页面上,这样用户就可以看到当前计算的进度。当累加计数器的值的时候,master要检查重复运行的Map或者Reduce任务,避免重复累加(之前提到的备用任务和失效后重新执行任务这两种情况会导致相同的任务被多次执行)。</p><p>有些计数器的值是由MapReduce库自动维持的,比如已经处理的输入的key/value pair的数量、输出的key/value pair的数量等等。</p><p>用户发现计数器机制对于MapReduce操作的完整性检查非常有用。比如,在某些MapReduce操作中,用户需要确保输出的key value pair精确的等于输入的key value pair,或者处理的German文档数量在处理的整个文档数量中属于合理范围。</p><h2 id="5-Performance"><a href="#5-Performance" class="headerlink" title="5 Performance"></a>5 Performance</h2><p>本节我们用在一个大型集群上运行的两个计算来衡量MapReduce的性能。一个计算在大约1TB的数据中进行特定的模式匹配,另一个计算对大约1TB的数据进行排序。</p><p>这两个程序在大量的使用MapReduce的实际应用中是非常典型的 — 一类是对数据格式进行转换,从一种表现形式转换为另外一种表现形式;另一类是从海量数据中抽取少部分的用户感兴趣的数据。</p><h2 id="5-1-Cluster-Configuration"><a href="#5-1-Cluster-Configuration" class="headerlink" title="5.1 Cluster Configuration"></a>5.1 Cluster Configuration</h2><p>所有这些程序都运行在一个大约由1800台机器构成的集群上。每台机器配置2个2G主频、支持超线程的Intel Xeon处理器,4GB的物理内存,两个160GB的IDE硬盘和一个千兆以太网卡。这些机器部署在一个两层的树形交换网络中,在root节点大概有100-200GBPS的传输带宽。所有这些机器都采用相同的部署(对等部署),因此任意两点之间的网络来回时间小于1毫秒。</p><p>在4GB内存里,大概有1-1.5G用于运行在集群上的其他任务。测试程序在周末下午开始执行,这时主机的CPU、磁盘和网络基本上处于空闲状态。</p><h2 id="5-2-Grep"><a href="#5-2-Grep" class="headerlink" title="5.2 Grep"></a>5.2 Grep</h2><p>这个分布式的grep程序需要扫描大概10的10次方个由100个字节组成的记录,查找出现概率较小的3个字符的模式(这个模式在92337个记录中出现)。输入数据被拆分成大约64M的Block(M=15000),整个输出数据存放在一个文件中(R=1)。</p><p>figure2显示了这个运算随时间的处理过程。其中Y轴表示输入数据的处理速度。处理速度随着参与MapReduce计算的机器数量的增加而增加,当1764台worker参与计算的时,处理速度达到了30GB/s。当Map任务结束的时候,即在计算开始后80秒,输入的处理速度降到0。整个计算过程从开始到结束一共花了大概150秒。这包括了大约一分钟的初始启动阶段。初始启动阶段消耗的时间包括了是把这个程序传送到各个worker机器上的时间、等待GFS文件系统打开1000个输入文件集合的时间、获取相关的文件本地位置优化信息的时间。</p><p><img src="https://s2.loli.net/2022/06/06/RSOAWZH8ULpYfiB.png" alt="MapReduce1.PNG"></p><h2 id="5-3-Sort"><a href="#5-3-Sort" class="headerlink" title="5.3 Sort"></a>5.3 Sort</h2><p>排序程序处理10的10次方个100个字节组成的记录(大概1TB的数据)。这个程序模仿TeraSort benchmark[10]。</p><p>排序程序由不到50行代码组成。只有三行的Map函数从文本行中解析出10个字节的key值作为排序的key,并且把这个key和原始文本行作为中间的key/value pair值输出。我们使用了一个内置的恒等函数作为Reduce操作函数。这个函数把中间的key/value pair值不作任何改变输出。最终排序结果输出到两路复制的GFS文件系统(也就是说,程序输出2TB的数据)。</p><p>如前所述,输入数据被分成64MB的Block(M=15000)。我们把排序后的输出结果分区后存储到4000个文件(R=4000)。分区函数使用key的原始字节来把数据分区到R个片段中。</p><p>在这个benchmark测试中,我们使用的分区函数知道key的分区情况。通常对于排序程序来说,我们会增加一个预处理的MapReduce操作用于采样key值的分布情况,通过采样的数据来计算对最终排序处理的分区点。</p><p><img src="https://s2.loli.net/2022/06/06/Q9kjHszPTMl5gYI.png" alt="MapReduce2.PNG"></p><p>图三(a)显示了这个排序程序的正常执行过程。左上的图显示了输入数据读取的速度。数据读取速度峰值会达到13GB/s,并且所有Map任务完成之后,即大约200秒之后迅速滑落到0。值得注意的是,排序程序输入数据读取速度小于分布式grep程序。这是因为排序程序的Map任务花了大约一半的处理时间和I/O带宽把中间输出结果写到本地硬盘。相应的分布式grep程序的中间结果输出几乎可以忽略不计。</p><p>左边中间的图显示了中间数据从Map任务发送到Reduce任务的网络速度。这个过程从第一个Map任务完成之后就开始缓慢启动了。图示的第一个高峰是启动了第一批大概1700个Reduce任务(整个MapReduce分布到大概1700台机器上,每台机器1次最多执行1个Reduce任务)。排序程序运行大约300秒后,第一批启动的Reduce任务有些完成了,我们开始执行剩下的Reduce任务。所有的处理在大约600秒后结束。</p><p>左下图表示Reduce任务把排序后的数据写到最终的输出文件的速度。在第一个排序阶段结束和数据开始写入磁盘之间有一个小的延时,这是因为worker机器正在忙于排序中间数据。磁盘写入速度在2-4GB/s持续一段时间。输出数据写入磁盘大约持续850秒。计入初始启动部分的时间,整个运算消耗了891秒。这个速度和TeraSort benchmark[18]的最高纪录1057秒相差不多。</p><p>还有一些值得注意的现象:输入数据的读取速度比排序速度和输出数据写入磁盘速度要高不少,这是因为我们的输入数据本地化优化策略起了作用 — 绝大部分数据都是从本地硬盘读取的,从而节省了网络带宽。排序速度比输出数据写入到磁盘的速度快,这是因为输出数据写了两份(我们使用了2路的GFS文件系统,写入复制节点的原因是为了保证数据可靠性和可用性)。我们把输出数据写入到两个复制节点的原因是因为这是底层文件系统的保证数据可靠性和可用性的实现机制。如果底层文件系统使用类似容错编码[14](erasure coding)的方式而不是复制的方式保证数据的可靠性和可用性,那么在输出数据写入磁盘的时候,就可以降低网络带宽的使用。</p><h2 id="5-4-Effect-of-Backup-Task"><a href="#5-4-Effect-of-Backup-Task" class="headerlink" title="5.4 Effect of Backup Task"></a>5.4 Effect of Backup Task</h2><p>图三(b)显示了关闭了备用任务后排序程序执行情况。执行的过程和图3(a)很相似,除了输出数据写磁盘的动作在时间上拖了一个很长的尾巴,而且在这段时间里,几乎没有什么写入动作。在960秒后,只有5个Reduce任务没有完成。这些拖后腿的任务又执行了300秒才完成。整个计算消耗了1283秒,多了44%的执行时间。</p><h2 id="5-5-Machine-Failures"><a href="#5-5-Machine-Failures" class="headerlink" title="5.5 Machine Failures"></a>5.5 Machine Failures</h2><p>在图三(c)中演示的排序程序执行的过程中,我们在程序开始后几分钟有意的kill了1746个worker中的200个。集群底层的调度立刻在这些机器上重新开始新的worker处理进程(因为只是worker机器上的处理进程被kill了,机器本身还在工作)。</p><p>图三(c)显示出了一个“负”的输入数据读取速度,这是因为一些已经完成的Map任务丢失了(由于相应的执行Map任务的worker进程被kill了),需要重新执行这些任务。相关Map任务很快就被重新执行了。整个运算在933秒内完成,包括了初始启动时间(只比正常执行多消耗了5%的时间)。</p><h2 id="6-Experience"><a href="#6-Experience" class="headerlink" title="6 Experience"></a>6 Experience</h2><p>我们在2003年1月完成了第一个版本的MapReduce库,在2003年8月的版本有了显著的增强,这包括了输入数据本地优化、worker机器之间的动态负载均衡等等。从那以后,我们惊喜的发现,MapReduce库能广泛应用于我们日常工作中遇到的各类问题。它现在在Google内部各个领域得到广泛应用,包括:</p><p>1.大规模机器学习问题</p><p>2.Google News和Froogle产品的集群问题</p><p>3.从公众查询产品(比如Google的Zeitgeist)的报告中抽取数据。</p><p>4.从大量的新应用和新产品的网页中提取有用信息(比如,从大量的位置搜索网页中抽取地理位置信息)。</p><p>5.大规模的图形计算。</p><p><img src="https://s2.loli.net/2022/06/06/PYsAhDtLWGZaMkw.png" alt="MapReduce3.PNG"></p><p>图四显示了在我们的源代码管理系统中,随着时间推移,独立的MapReduce程序数量的显著增加。从2003年早些时候的0个增长到2004年9月份的差不多900个不同的程序。MapReduce的成功取决于采用MapReduce库能够在不到半个小时时间内写出一个简单的程序,这个简单的程序能够在上千台机器的组成的集群上做大规模并发处理,这极大的加快了开发和原形设计的周期。另外,采用MapReduce库,可以让完全没有分布式和/或并行系统开发经验的程序员很容易的利用大量的资源,开发出分布式和/或并行处理的应用。</p><p>在每个任务结束的时候,MapReduce库统计计算资源的使用状况。在表1,我们列出了2004年8月份MapReduce运行的任务所占用的相关资源。</p><h2 id="6-1-large-scaling-indexing"><a href="#6-1-large-scaling-indexing" class="headerlink" title="6.1 large-scaling indexing"></a>6.1 large-scaling indexing</h2><p>到目前为止,MapReduce最成功的应用就是重写了Google网络搜索服务所使用到的index系统。索引系统的输入数据是网络爬虫抓取回来的海量的文档,这些文档数据都保存在GFS文件系统里。这些文档原始内容(raw contents,我认为就是网页中的剔除html标记后的内容、pdf和word等有格式文档中提取的文本内容等)的大小超过了20TB。索引程序是通过一系列的MapReduce操作(大约5到10次)来建立索引。使用MapReduce(替换上一个特别设计的、分布式处理的索引程序)带来这些好处:</p><p>1.实现索引部分的代码简单、小巧、容易理解,因为对于容错、分布式以及并行计算的处理都是MapReduce库提供的。比如,使用MapReduce库,计算的代码行数从原来的3800行C++代码减少到大概700行代码。</p><p>2.MapReduce库的性能已经足够好了,因此我们可以把在概念上不相关的计算步骤分开处理,而不是混在一起以期减少数据传递的额外消耗。概念上不相关的计算步骤的隔离也使得我们可以很容易改变索引处理方式。比如,对之前的索引系统的一个小更改可能要耗费好几个月的时间,但是在使用MapReduce的新系统上,这样的更改只需要花几天时间就可以了。</p><p>3.索引系统的操作管理更容易了。因为由机器失效、机器处理速度缓慢、以及网络的瞬间阻塞等引起的绝大部分问题都已经由MapReduce库解决了,不再需要操作人员的介入了。另外,我们可以通过在索引系统集群中增加机器的简单方法提高整体处理性能。</p><h2 id="7-Relate-work"><a href="#7-Relate-work" class="headerlink" title="7 Relate work"></a>7 Relate work</h2><p>很多系统都提供了严格的编程模式,并且通过对编程的严格限制来实现并行计算。例如,一个结合函数可以通过把N个元素的数组的前缀在N个处理器上使用并行前缀算法,在log N的时间内计算完[6,9,13](???)。MapReduce可以看作是我们结合在真实环境下处理海量数据的经验,对这些经典模型进行简化和萃取的成果。更加值得骄傲的是,我们还实现了基于上千台处理器的集群的容错处理。相比而言,大部分并发处理系统都只在小规模的集群上实现,并且把容错处理交给了程序员。</p><p>Bulk Synchronous Programming[17]和一些MPI原语[11]提供了更高级别的并行处理抽象,可以更容易写出并行处理的程序。MapReduce和这些系统的关键不同之处在于,MapReduce利用限制性编程模式实现了用户程序的自动并发处理,并且提供了透明的容错处理。</p><p>我们数据本地优化策略的灵感来源于active disks[12,15]等技术,在active disks中,计算任务是尽量推送到数据存储的节点处理,这样就减少了网络和IO子系统的吞吐量。我们在挂载几个硬盘的普通机器上执行我们的运算,而不是在磁盘处理器上执行我们的工作,但是达到的目的一样的。</p><p>我们的备用任务机制和Charlotte System[3]提出的eager调度机制比较类似。Eager调度机制的一个缺点是如果一个任务反复失效,那么整个计算就不能完成。我们通过忽略引起故障的记录的方式在某种程度上解决了这个问题。</p><p>MapReduce的实现依赖于一个内部的集群管理系统,这个集群管理系统负责在一个超大的、共享机器的集群上分布和运行用户任务。虽然这个不是本论文的重点,但是有必要提一下,这个集群管理系统在理念上和其它系统,如Condor[16]是一样。</p><p>MapReduce库的排序机制和NOW-Sort[1]的操作上很类似。读取输入源的机器(map workers)把待排序的数据进行分区后,发送到R个Reduce worker中的一个进行处理。每个Reduce worker在本地对数据进行排序(尽可能在内存中排序)。当然,NOW-Sort没有给用户自定义的Map和Reduce函数的机会,因此不具备MapReduce库广泛的实用性。</p><p>River[2]提供了一个编程模型:处理进程通过分布式队列传送数据的方式进行互相通讯。和MapReduce类似,River系统尝试在不对等的硬件环境下,或者在系统颠簸的情况下也能提供近似平均的性能。River是通过精心调度硬盘和网络的通讯来平衡任务的完成时间。MapReduce库采用了其它的方法。通过对编程模型进行限制,MapReduce框架把问题分解成为大量的“小”任务。这些任务在可用的worker集群上动态的调度,这样快速的worker就可以执行更多的任务。通过对编程模型进行限制,我们可用在工作接近完成的时候调度备用任务,缩短在硬件配置不均衡的情况下缩小整个操作完成的时间(比如有的机器性能差、或者机器被某些操作阻塞了)。</p><p>BAD-FS[5]采用了和MapReduce完全不同的编程模式,它是面向广域网的。不过,这两个系统有两个基础功能很类似。(1)两个系统采用重新执行的方式来防止由于失效导致的数据丢失。(2)两个都使用数据本地化调度策略,减少网络通讯的数据量。</p><p>TACC[7]是一个用于简化构造高可用性网络服务的系统。和MapReduce一样,它也依靠重新执行机制来实现的容错处理。</p><h2 id="8-Conclusions"><a href="#8-Conclusions" class="headerlink" title="8 Conclusions"></a>8 Conclusions</h2><p>MapReduce编程模型在Google内部成功应用于多个领域。我们把这种成功归结为几个方面:首先,由于MapReduce封装了并行处理、容错处理、数据本地化优化、负载均衡等等技术难点的细节,这使得MapReduce库易于使用。即便对于完全没有并行或者分布式系统开发经验的程序员而言;其次,大量不同类型的问题都可以通过MapReduce简单的解决。比如,MapReduce用于生成Google的网络搜索服务所需要的数据、用来排序、用来数据挖掘、用于机器学习,以及很多其它的系统;第三,我们实现了一个在数千台计算机组成的大型集群上灵活部署运行的MapReduce。这个实现使得有效利用这些丰富的计算资源变得非常简单,因此也适合用来解决Google遇到的其他很多需要大量计算的问题。</p><p>我们也从MapReduce开发过程中学到了不少东西。首先,约束编程模式使得并行和分布式计算非常容易,也易于构造容错的计算环境;其次,网络带宽是稀有资源。大量的系统优化是针对减少网络传输量为目的的:本地优化策略使大量的数据从本地磁盘读取,中间文件写入本地磁盘、并且只写一份中间文件也节约了网络带宽;第三,备份服务器执行相同的任务可以减少性能缓慢的机器带来的负面影响(硬件配置的不平衡),同时解决了由于机器失效导致的数据丢失问题。</p><h2 id="Reference"><a href="#Reference" class="headerlink" title="Reference"></a>Reference</h2><p>[1] Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau,David E. Culler, Joseph M. Hellerstein, and David A. Patterson.High-performance sorting on networks of workstations.In Proceedings of the 1997 ACM SIGMOD InternationalConference on Management of Data, Tucson,Arizona, May 1997.<br>[2] Remzi H. Arpaci-Dusseau, Eric Anderson, NoahTreuhaft, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Cluster I/O with River:Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and Distributed Systems (IOPADS ’99), pages 10.22, Atlanta, Georgia, May 1999.<br>[3] Arash Baratloo, Mehmet Karaul, Zvi Kedem, and Peter Wyckoff. Charlotte: Metacomputing on the web. In Proceedings of the 9th International Conference on Parallel and Distributed Computing Systems, 1996. [4] Luiz A. Barroso, Jeffrey Dean, and Urs H¨olzle. Web search for a planet: The Google cluster architecture. IEEE Micro, 23(2):22.28, April 2003.<br>[5] John Bent, Douglas Thain, Andrea C.Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Miron Livny. Explicit control in a batch-aware distributed file system. In Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation NSDI, March 2004.<br>[6] Guy E. Blelloch. Scans as primitive parallel operations.IEEE Transactions on Computers, C-38(11), November 1989.<br>[7] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Cluster-based scalable network services. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 78. 91, Saint-Malo, France, 1997.<br>[8] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In 19th Symposium on Operating Systems Principles, pages 29.43, Lake George, New York, 2003. To appear in OSDI 2004 12<br>[9] S. Gorlatch. Systematic efficient parallelization of scan and other list homomorphisms. In L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, Euro-Par’96. Parallel Processing, Lecture Notes in Computer Science 1124, pages 401.408. Springer-Verlag, 1996.<br>[10] Jim Gray. Sort benchmark home page. <a href="http://research.microsoft.com/barc/SortBenchmark/">http://research.microsoft.com/barc/SortBenchmark/</a>.<br>[11] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge, MA, 1999.<br>[12] L. Huston, R. Sukthankar, R.Wickremesinghe, M. Satyanarayanan, G. R. Ganger, E. Riedel, and A. Ailamaki. Diamond: A storage architecture for early discard in interactive search. In Proceedings of the 2004 USENIX File and Storage Technologies FAST Conference, April 2004.<br>[13] Richard E. Ladner and Michael J. Fischer. Parallel prefix computation. Journal of the ACM, 27(4):831.838, 1980.<br>[14] Michael O. Rabin. Efficient dispersal of information for security, load balancing and fault tolerance. Journal of the ACM, 36(2):335.348, 1989.<br>[15] Erik Riedel, Christos Faloutsos, Garth A. Gibson, and David Nagle. Active disks for large-scale data processing. IEEE Computer, pages 68.74, June 2001.<br>[16] Douglas Thain, Todd Tannenbaum, and Miron Livny. Distributed computing in practice: The Condor experience. Concurrency and Computation: Practice and Experience, 2004.<br>[17] L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103.111, 1997.<br>[18] Jim Wyllie. Spsort: How to sort a terabyte quickly. <a href="http://alme1.almaden.ibm.com/cs/spsort.pdf">http://alme1.almaden.ibm.com/cs/spsort.pdf</a>.</p>]]></content>
<summary type="html"><p><strong>MapReduce 中文翻译</strong></p>
<h2 id="Abstract"><a href="#Abstract" class="headerlink" title="Abstract"></a>Abstract</h2><p>MapRedu</summary>
</entry>
<entry>
<title>GFS 中文翻译</title>
<link href="https://codroc.github.io/2022/06/06/GFS/"/>
<id>https://codroc.github.io/2022/06/06/GFS/</id>
<published>2022-06-06T11:57:16.000Z</published>
<updated>2022-06-06T11:57:16.000Z</updated>
<content type="html"><![CDATA[<p><strong>GFS 中文翻译</strong></p><h2 id="ABSTRACT"><a href="#ABSTRACT" class="headerlink" title="ABSTRACT"></a>ABSTRACT</h2><p>我们已经设计和实现了Google File System,一个适用于大规模分布式数据处理相关应用的,可扩展的分布式文件系统。它运行在廉价且普通的硬件设备上,并提供了容错的设计,并且为大量的客户端提供极高的聚合处理性能。尽管我们的设计目标和上一个版本的分布式文件系统有很多相同的地方,我们的设计是依据我们应用的工作量以及技术环境来设计的,包括现在和预期的,都有一部分和早先的文件系统的约定有所不同。这就要求我们重新审视传统的设计选择,以及探索一些在根本上不同的设计要点。这个文件系统成功的满足了我们的存储需求。这个文件系统作为那些需要大数据集服务的数据生成处理的基础存储平台而广泛部署在谷歌内部。最大的集群通过上千个计算机的数千个硬盘,提供了数百TB的存储,并且这些数据被数百个客户端并行同时操作。在这个论文里,我们展示了用于支持分布式应用的扩展文件系统接口设计,讨论了许多我们设计的方面,并且列出了我们的micro-benchmarks以及真实应用性能指标。</p><h2 id="1-INTRODUCTION"><a href="#1-INTRODUCTION" class="headerlink" title="1.INTRODUCTION"></a>1.INTRODUCTION</h2><p>为了满足google快速增长的数据处理需求,我们设计实现了google文件系统(GFS)。GFS和上一个分布式文件系统有着很多相同的设计目标,比如性能,扩展性,可靠性,以及可用性。然而,它的设计是由我们的具体应用的负载类型以及当前甚至未来技术环境的观察驱动的,所以与早期文件系统的设计假设具有明显的区别。这就要求我们重新审视传统的设计选择,探索出一些在根本上不同的设计观点。 首先,组件失败成为一种常态而不是一种错误(或者说异常)。整个文件系统是由成百上千台廉价的普通机器组成的存储机器,可以被大量的客户端访问。组件的数量和质量在本质上决定了在某一时间有一些是不可用的,并且某些机器无法从当前失败的状态中恢复。我们观察到,应用程序的bug,操作系统bug,人为的错误,硬盘的失败,内存,连接器,网络,电力供应都可以引起这样的问题。因此经常性的监控,错误检测,容错和自动恢复必须集成到系统中。 第二,与传统的标准相比,文件是巨大的。在这里,好几个G的文件是很普通的。每个文件通常包含很多的应用程序处理的对象比如网页文档。当我们日常处理的快速增长的数据集合总是达到好几个TB的大小(包含数十亿的数据),即使文件系统能够支持,我们也不希望去管理数十亿个KB级别的文件。这样设计中的一些假设和参数,比如IO操作和块大小就必须重新定义。 第三,大部分文件都是只会在文件尾新增加数据,而少见修改已有数据的。对一个文件的随机写操作在实际上几乎是不存在的。一旦写完,文件就是只读的,并且一般都是顺序读取的。大量的数据都具有这样的特点。有些数据可能组成很大的数据仓库,并且数据分析程序从头扫描到尾。有些可能是运行应用而不断的产生的数据流。有些是归档的数据。有些是一个机器为另一个机器产生的中间结果,另一个机器及时或者随后处理这些中间结果。假设在大文件上数据访问具有这样的模式,那么当缓存数据在客户端失效后,append操作就成为性能优化和原子性的关键。 第四,应用程序和文件系统api的协同设计,增加了整个系统的灵活性。比如我们通过放松了GFS的一致性模型大大简化了文件系统,同时也没有给应用程序带来繁重的负担。我们也提供了一个原子性的append操作,这样多个客户端就可以对同一个文件并行的进行append操作而不需要彼此间进行额外的同步操作。这些都会在后面进行详细的讨论。</p><h2 id="2-DESIGN-OVERVIEW"><a href="#2-DESIGN-OVERVIEW" class="headerlink" title="2.DESIGN OVERVIEW"></a>2.DESIGN OVERVIEW</h2><h3 id="2-1-Assumptions"><a href="#2-1-Assumptions" class="headerlink" title="2.1 Assumptions"></a>2.1 Assumptions</h3><p>在设计一个满足我们需求的文件系统时,我们以一些充满了挑战和机遇的假设作为指南,之前我们曾间接的提到过一些关键的点,现在我们把这些假设再详细的列出来。</p><ul><li>系统是建立在大量廉价的普通计算机上,这些计算机经常故障。必须对这些计算机持续进行检测,并且在运行的系统上进行:检查,容错,以及从快速故障恢复。</li><li>系统存储了大量的超大文件。我们期望有数百万个文件,每个100mb或者更大。上GB的文件大小应该是很普通的情况而且能被有效的管理。小文件也应该被支持,但我们不需要为它们进行优化。</li><li>工作负载主要由两种类型的读取组成:大的流式读取和小的随机读取。在大的流式读取中,单个操作通常要读取数百k,甚至1m或者更大的数据。对于同一个客户端来说,往往会发起连续的读取操作顺序读取一个文件。小的随机读取通常在某个任意的偏移位置读取几kb的数据。小规模的随机读取通常在文件的不同位置,读取几k数据。对于性能有过特别考虑的应用通常会作批处理并且对他们读取的内容进行排序,这样可以使得他们的读取始终是单向顺序读取,而不需要往回读取数据。</li><li>通常基于GFS的操作都有很多超大的,例如顺序写入(大的流式读取)的文件操作。通常写入操作的数据量和读取的数据量是相当。一旦完成写入,文件就很少会被再次修改。支持文件中任意位置的小规模写入操作,但是不需要为此作特别的优化。</li><li>系统对多客户端并行添加同一个文件必须非常有效以及明确语义细节的进行支持。我们的文件经常使用生产者/消费者队列模式,或者作为多路合并模式进行操作。好几百个运行在不同机器上的生产者,将会并行增加一个文件。其本质就是最小的原子操作的定义。读取操作可能接着生产者操作之后进行,消费者会同时读取这个文件。</li><li>高性能的稳定带宽的网络要比低延时更加重要。我们大多数的目标应用程序都非常重视高速批量处理数据,而很少有人对单个读写操作有严格的响应时间要求。</li></ul><h3 id="2-2-Interface"><a href="#2-2-Interface" class="headerlink" title="2.2 Interface"></a>2.2 Interface</h3><p>GFS虽然他没有实现一些标准的API比如POSIX,但它提供了常见的文件系统的接口。文件是通过pathname来通过目录进行分层管理的。我们支持的一些常见操作:create,delete,open,close,read,write等文件操作。 另外,GFS有snapshot,record append等操作。Snapshort(快照)以低成本创建一个文件或者一个目录树的副本。Record append允许很多个客户端同时对一个文件增加数据,同时保证每一个客户端的添加操作的原子操作性。这个对于多路合并操作和多个客户端同时操作的生产者/消费者队列的实现非常有用,它不用额外的加锁处理。我们发现这种文件对于构造大型分布式应用来说,是不可或缺的。snapshot和record append在后边的3.4 和3.3节有单独讲述。</p><h3 id="2-3-Architecture"><a href="#2-3-Architecture" class="headerlink" title="2.3 Architecture"></a>2.3 Architecture</h3><p><strong>GFS集群由一个单个的master和多个chunkserver(块服务器)组成</strong>,GFS集群会有很多客户端client访问(图1)。每一个节点都是一个普通的Linux计算机,运行的是一个用户级别(user-level)的服务器进程。只要机器资源允许,并且允许不稳定的应用代码导致的低可靠性,我们就在同一台机器上运行chunkserver和client。 在GFS下,每一个文件都拆成固定大小的chunk(块)。<strong>每一个块都由master根据块创建的时间产生一个全局唯一的64位的chunk handle标志</strong>。Chunkservers在本地磁盘上用Linux文件系统保存这些chunk,并且根据chunk handle和字节区间,通过Linux文件系统读/写这些chunk的数据。出于可靠性的考虑,每一个块都会在不同的chunkserver上保存备份。默认情况下,我们保存3个备份,不过用户对于不同的文件namespace区域,可以指定不同的复制级别。 master负责管理所有的文件系统的元数据(metadata,元数据是指描述数据属性的信息,包括存储位置,历史数据等等),包括namespace,访问控制信息,文件到chunk的映射关系,当前chunk的位置等等信息。master也同样控制系统级别的活动,比如chunk的分配管理,孤点chunk的垃圾回收机制,chunkserver之间的chunk镜像管理。master和这些chunkserver之间会有周期性的的心跳检测,并且在检测的过程中向其发出指令并收集其状态。 连接到各个应用系统的GFS客户端代码包含了文件系统的API,并且会和master和chunkserver进行通讯处理,代表应用程序进行读/写数据的操作。客户端和master进行元数据的操作,但是所有的数据相关的通讯是直接和chunkserver进行的。我们并没有提供POSIX API,因此不需要连接到Linux的vnode层。 客户端或者chunkserver都不会缓存文件数据。客户端缓存机制没有什么好处,这是因为大部分的应用都是流式访问超大文件或者操作的数据集太大而不能被缓存。不设计缓存系统使得客户端以及整个系统都大大简化了(不用设计解决缓存的一致性的问题,也就是缓存同步机制)(不过客户端缓存元数据)。chunkserver不需要缓存文件数据,因为chunks已经跟本地文件一样的被保存了,所以Linux的buffer cache已经把常用的数据缓存到了内存里。</p><p><img src="https://s2.loli.net/2022/06/06/KpelqfMPQ2RoAIH.png" alt="GFS0.PNG"></p><h3 id="2-4-Single-Master"><a href="#2-4-Single-Master" class="headerlink" title="2.4 Single Master"></a>2.4 Single Master</h3><p>引入一个<strong>单个master的设计可以大大简化我们的设计</strong>,并且也让master能够基于全局的角度来针对复杂的chunk的存放和复制做出决策。不过,我们必须尽量减少master的读和写操作,以避免它成为瓶颈。客户端永远不会通过master来做文件的数据读写。客户端只是问master它应当访问那一个chunkserver来访问数据。<strong>客户端在一定时间内缓存这个信息,</strong>并且在后续的操作中都直接和chunkserver进行操作。 这里我们简单介绍一下图1中的读取操作。首先,客户端把应用要读取的文件名和偏移量,根据固定的chunk大小,转换成为文件的chunk index。然后向master发送这个包含了文件名和chunkindex的请求。master返回相关的chunk handle以及对应的位置。客户端cache这些信息,把文件名和chunkindex作为cache的关键索引字。 于是这个客户端就像对应的位置的chunkserver发起请求,通常这个chunkserver会是离这个客户端最近的那个。请求给定了chunk handle以及一个在这个chunk内需要读取得字节区间。在这个chunk内,再次操作数据将不用再通过客户端-master的交互,除非这个客户端本身的cache信息过期了,或者这个文件重新打开了。实际上,客户端通常都会在请求中附加向master询问多个chunk的信息,master于是接着会立刻给这个客户端回应这些chunk的信息。这个附加信息是通过几个几乎没有任何代价的client-master的交互完成的。</p><h3 id="2-5-Chunk-Size"><a href="#2-5-Chunk-Size" class="headerlink" title="2.5 Chunk Size"></a>2.5 Chunk Size</h3><p>chunk 的大小是一个设计的关键参数。<strong>我们选择这个大小为 64M</strong>,远远大于典型的文件系统的block大小。每一个chunk的实例(复制品,快照)都是作为在chunk server上的Linux文件格式存放的,并且只有当需要的情况下才会增长。滞后分配空间(Lazy space allocation)的机制可以通过文件内部分段来避免空间浪费,可能对于这样大的chunk size来说,(内部分段fragment)这可能是一个最大的缺陷了。 chunk 的大小选的稍大,有许多重要的好处。<strong>首先</strong>,它减少了客户端和master的交互,因为在同一个chunk内的读写操作需要客户端初始询问一次master关于chunk位置信息就可以了。这个减少访问量对于我们的系统来说是很显著的,因为我们的应用大部分是顺序读写超大文件的。即使是对小范围的随机读,客户端可以很容易cache一个好几个TB数据文件的所有的位置信息。<strong>其次</strong>,由于是使用一个大的chunk,客户端可以在一个chunk上完成更多的操作,它可以通过维持一个到chunk server的TCP长连接来减少网络管理量(overhead,负载?)。<strong>第三</strong>,它减少了元数据在master上的大小。这个使得我们可以把元数据保存在内存,这样带来一些其他的好处,详细的讨论请见2.6.1节。 在另一方面,即时采用了lazy space allocation的大chunk size也有它的不好的地方。小文件可能仅由一些chunk组成,也许只有一个chunk。如果很多的client都需要访问这个文件,这样那些存储了这些chunk的chunkserver就会变成热点。实际中,热点还没有成为一个主要考虑的问题因为我们的应用绝大部分都是在顺序读取多个大型chunk文件(large multi-chunck files)。 然而,当GFS第一次使用在一个批处理队列系统时,热点确实出现了:把仅占用一个 chunk 的可执行文件写到 GFS,然后同时在数百台机器上开始执行它。只有两三个chunkservers存储这个可执行文件,而这些chunkserver被数百个并发请求瞬间变成超载。我们通过更高的备份级别存储这样的可执行文件(多存几份?)以及错开队列系统的应用程序启动时间解决了这个问题。一个潜在的长远的解决方案是在这种情况下,允许客户端从其他客户端读取数据(P2P下载?)。</p><h3 id="2-6-Metadata"><a href="#2-6-Metadata" class="headerlink" title="2.6 Metadata"></a>2.6 Metadata</h3><p>master 节点保存这样三个主要类型的元数据:①file和chunk namespace,②从files到chunks的映射关系③每一个chunk及其副本的位置。<strong>所有的元数据都是保存在master的内存里。</strong>前两个类型(namep spaces和文件到chunk的映射)通过保存在本地磁盘实现持久化,对这两个类型的元数据的更新会产生相应的操作日志,并且日志也会在远端机器上保存副本。使用log允许我们简单可靠地更新master的状态,不用担心当master crash的时候的不一致性。<strong>master并不持久化保存chunk位置信息。相反,他在启动地时候以及chunkserver加入集群的时候,向每一个chunkserver询问他的chunk信息。</strong></p><h3 id="2-6-1-In-Memory-Data-Structures"><a href="#2-6-1-In-Memory-Data-Structures" class="headerlink" title="2.6.1 In-Memory Data Structures"></a>2.6.1 In-Memory Data Structures</h3><p>因为元数据都是在内存保存的,master 的操作很快。另外 master 很容易,有效地定时地在后台扫描所有的内部状态。这个周期性的扫描是用来实现 chunk 垃圾回收,chunkserver出现失败时进行的重复制,以及为了平衡负载和磁盘空间在chunkserver间的chunk 迁移。4.3,4.4 将进一步讨论这些活动。 这种内存保存数据的方式有一个潜在的问题,就是说整个系统的 chunk 数量以及对应的系统容量是受到master机器的内存限制的。这个在实际生产中并不是一个很严重的限制。master为每 64M chunk 分配的空间不到 64 个字节的元数据。大部分的 chunks 是满的,因为大部分文件都是很大的,包含很多个 chunk,只有文件的最后部分可能是未满的。类似的,每个文件名字空间数据通常需要少于 64 byte 因为文件名称存储时会使用前缀压缩算法进行压缩。 如果有需要支持到更大的文件系统,因为我们是采用内存保存元数据的方式,所以我们可以很简单,可靠,高效,灵活的通过增加master 机器的内存就可以了。</p><h3 id="2-6-2-Chunk-Locations"><a href="#2-6-2-Chunk-Locations" class="headerlink" title="2.6.2 Chunk Locations"></a>2.6.2 Chunk Locations</h3><p>master 并不持久化保存 chunkserver 上保存的 chunk 的记录。它只是在启动的时候简单的从 chunkserver 取得这些信息。master可以在启动之后一直保持自己的这些信息是最新的,因为它控制所有的 chunk 的位置,并且使用普通心跳检测监视 chunkserver 的状态。 我们最开始尝试想把 chunk 位置信息持久化保存在 master 上,但是我们后来发现如果在启动时候,以及定期性从chunkserver上读取chunk位置信息会使得设计简化很多。因为这样可以消除master和chunkserver之间进行chunk信息的同步问题,<strong>当chunkserver加入和离开集群,更改名字,失效,重新启动等等时候,如果master上要求保存chunk信息,那么就会存在信息同步的问题。</strong>在一个数百台机器的组成的集群中,这样的发生chunkserver的变动实在是太平常了。 此外,不在master上保存chunk位置信息的一个重要原因是因为<strong>只有chunkserver对于chunk到底在不在自己机器上有着最后的话语权。</strong>另外,在master上保存这个信息也是没有必要的,因为有很多原因可以导致chunserver可能忽然就丢失了这个chunk(比如磁盘坏掉了等等),或者chunkserver忽然改了名字,那么master上保存这个资料啥用处也没有。</p><h3 id="2-6-3-Operation-Log"><a href="#2-6-3-Operation-Log" class="headerlink" title="2.6.3 Operation Log"></a>2.6.3 Operation Log</h3><p>操作日志保存了关键元数据变化的历史记录。<strong>它是 GFS 的核心</strong>。不仅仅因为这是唯一持久化的元数据记录,也是因为操作日志作为逻辑时间基线,定义了并行操作的顺序。chunks 以及 Files,连同他们的版本(参见4.5节),都是用他们创建时刻的逻辑时间基线来作为唯一的标志。 由于操作日志是极其关键的,我们必须可靠保存它,在元数据改变并且持久化之前,对于客户端来说都是不可见的(也就是说保证原子性)。否则,就算是chunkserver完好的情况下,我们也可能会丢失整个文件系统,或者最近的客户端操作。因此,我们把这个文件保存在多个远程主机上,并且只有当刷新这个相关的操作日志到本地和远程磁盘之后,才会给客户端操作应答。master可以在刷新之前将多个操作日志批量处理,以减少刷新和复制这个日志导致的系统吞吐量。 master通过反演操作日志来回复自身文件系统状态。为了减少启动时间,我们必须保证操作日志的文件尽可能的小。master在日志增长超过某一个大小的时候,执行 checkpoint 动作,卸出自己的状态,这样可以使下次启动的时候从本地硬盘读出这个最新的 checkpoint,然后反演有限记录数。checkpoint是一个类似 B- 树的格式,可以直接映射到内存,而不需要额外的分析。这更进一步加快了恢复的速度,提高了可用性。 因为建立一个checkpoint可能会花一点时间,于是我们这样设定master的内部状态,就是说新建立的checkpoint可以不阻塞新的状态变化。master切换到一个新的log文件,并且在一个独立的线程中创建新的checkpoint。新的checkpoint包含了在切换到新log文件之前的状态变化。当这个集群有数百万文件的时候,创建新的checkpoint会花上几分钟的时间。当checkpoint建立完毕,会写到本地和远程的磁盘。 对于master的恢复,只需要最新的checkpoint以及后续的log文件。旧的checkpoint及其log文件可以删掉了,虽然我们还是保存几个checkpoint以及log,用来防止比较大的故障产生。在checkpoint的时候得故障并不会导致正确性受到影响,因为恢复的代码会检查并且跳过不完整的checkpoint。</p><h3 id="2-7-Consistency-Model"><a href="#2-7-Consistency-Model" class="headerlink" title="2.7 Consistency Model"></a>2.7 Consistency Model</h3><p><strong>GFS 是一个松散的一致性检查的模型</strong>,通过简单高效的实现,来支持我们的高度分布式计算的应用。我们在这里讨论的 GFS 的可靠性以及对应用的可靠性。我们也强调了 GFS 如何达到这些可靠性,实现细节在本论文的其他部分实现。</p><h3 id="2-7-1-Guarantees-by-GFS"><a href="#2-7-1-Guarantees-by-GFS" class="headerlink" title="2.7.1 Guarantees by GFS"></a>2.7.1 Guarantees by GFS</h3><p>文件名字空间的改变(比如,文件的创建)是原子操作。他们是由 master 专门处理的。名字空间的锁定保证了操作的原子性以及正确性(4.1节);<strong>master 的操作日志定义了这些操作的全局顺序</strong>(2.6.3)。 当数据变更后,文件区域(文件区就是在文件中的一小块内容)的状态取决于变更的类型,变更是否成功以及是否是并发进行的。表1是对结果的一个概述。</p><p><img src="https://s2.loli.net/2022/06/06/2oqGktIrcKexzMV.png" alt="GFS1.PNG"></p><p>如果所有的客户端都看到的是相同的数据的时候,并且与这些客户端从哪个数据的副本读取无关的时候,那我们就称这个file region具有一致性的。<strong>当数据变更前后具有一致性,同时客户端将会看到完整的变更,我们就称该文件区已定义。</strong>当一个更改操作成功完成,没有并发写冲突,那么受影响的region就是已定义的了(肯定是一致性):所有客户端都可以看到这个变化是什么。并发成功操作会使region的状态进入未定义,但还是一致性的:所有客户端都看到了相同的数据,但它可能无法看到所有的变化(无法区分到底发生了什么变化,如果变更是针对相同的数据写这样有的变更就会被新的变更所覆盖,这样用户就无法看到最先的变更了,同时发生在跨chunk的操作会被拆分成两个操作,这样这个操作的一部分可能会被其他操作覆盖,而另一部分则保留下来,如3.1节末尾所述)。通常它看到的是多个变更组合后的结果。<strong>一个失败的变更会使区域进入非一致的状态(因此也是未定义的状态):不同的客户端在不同的访问中可能看到不同的数据。</strong>我们下面描述下我们的应用程序如何区分定义良好的区域和未定义的区域。应用程序不需要进一步区分未定义区域的各种不同的类型。 数据变更可能是_write_或者_record append_。写操作会使数据在应用程序指定的偏移位置写入。record append操作会使数据原子性的append,如果是并发性的话则至少会被append一次,但是偏移位置是由 GFS 决定的(然而,通常的理解可能是在客户端想写入的那个文件的尾部)。偏移位置会被返回给客户端,同时标记包含这条记录的那个定义良好的文件区域的起始位置。另外GFS可能会在它们之间插入一些 padding 或者记录的副本。它们会占据那些被认为是不一致的区域,通常它们比用户数据小的多。 在一系列成功的变更之后,变更的文件区域被保证是已定义的,同时包含了最后一次变更的数据写入。GFS 通过两种方式来实现这种结果:(a).将这些变更以相同的操作顺序应用在该 chunk 的所有的副本上(3.1小节);(b).使用 chunk 的版本号来检测那些老旧的副本可能是由于它的 chunkserver 挂掉了而丢失了一些变更。陈旧的副本永远都不会参与变更或者返回给那些向 master 询问 chunk 位置的 client。它们会优先参与垃圾回收。 因为客户端会缓存 chunk 的位置,在信息更新之前它们可能会读到陈旧的副本。时间窗口由缓存值的超时时间以及文件的下一次打开而限制,文件的打开会清除缓存中该文件相关的 chunk 信息。此外,由于我们的大部分操作都是append,因此一个陈旧副本通常会返回一个过早结束的chunk而不是过时的数据。当读取者重试并与 master 联系时,它会立即得到当前的 chunk 位置。 成功的变更很久之后,组件失败仍有可能破坏或者污染数据。GFS 通过周期性的在 master 和所有 chunkserver 间握手找到那些失败的 chunkserver,同时通过校验和(5.2节)来检测数据的污染。一旦发现问题,会尽快的利用正确的副本恢复(4.3节)。只有一个块的所有副本在 GFS 做出反应之前,全部丢失,这个块才会不可逆转的丢失,而通常 GFS 的反应是在几分钟内的。即使在这种情况下,块不可用,而不是被污染:应用程序会收到清晰的错误信息而不是被污染的数据。</p><h3 id="2-7-2-Implications-for-Application"><a href="#2-7-2-Implications-for-Application" class="headerlink" title="2.7.2 Implications for Application"></a>2.7.2 Implications for Application</h3><p>GFS 应用程序可以通过使用简单的技术来适应这种松散的一致性模型,这些技术已经为其他目的所需要:依赖于append 操作而不是覆盖,检查点,写时自我验证,自己标识记录。 实际中,我们所有的应用程序都是通过append 而不是覆盖来改变文件。在一个普通的应用中,程序员生成一个文件都是从头到尾直接生成的。<strong>当写完所有数据后它自动的将文件重命名为一个永久性的名称</strong>,或者通过周期性的检查点检查已经有多少数据被成功写入了。检查点可能会设置应用级的 checksum。读取者仅验证和处理最后一个检查点之前的文件区域,这些区域处于已定义的状态。无论什么样的并发和一致性要求,这个方法都工作的很好。Append 操作比随机写对于应用程序的失败处理起来总是要更加有效和富有弹性。Checkpoint 使得写操作者增量的进行写操作并且防止读操作者处理已经成功写入,但是对于应用程序角度看来并未提交的数据。 另一种常见的应用中,很多写操作同时向一个文件append是为了归并文件或者是作为一个生产者消费者队列。记录的 append 的 append-at-least-once 语义预服务每个写者的输出。Reader 对偶然的空白填充(padding)和重复数据的处理如下:writer 为每条记录准备一些额外信息,比如 checksums,这样它的合法性就可以验证。Reader 可以识别和丢弃额外的 padding,并使用 checksum 记录片段。如果不能容忍重复的数据(比如它们可能触发非幂等操作),可以通过在记录中使用唯一标识符来过滤它们,很多时候都需要这些标识符命名相应的应用程序实体,比如网页文档。这些用于 record 输入输出的功能函数(除了重复删除)是以库的形式被我们的应用程序共享的,同时应用于 google 其他的文件接口实现。所以,相同系列的记录,加上一些罕见的重复,总是直接被分发给记录的 Reader。</p><h2 id="3-System-interactions"><a href="#3-System-interactions" class="headerlink" title="3.System interactions"></a>3.System interactions</h2><p>我们设计的一个原则是尽量在所有操作中减少与 master 的交互。基于该条件下我们现在阐述 client,master 以及 chunkserver 如何通过交互来实现数据变更,记录 append 以及快照。</p><h3 id="3-1-Leases-and-Mutation-Order"><a href="#3-1-Leases-and-Mutation-Order" class="headerlink" title="3.1 Leases and Mutation Order"></a>3.1 Leases and Mutation Order</h3><p>租约和变更顺序?令牌和变化顺序? 一个变更是指一个改变 chunk 的内容或者 metadata 的操作,比如写操作或者 append 操作。每个变更都需要在所有 chunk 的副本上执行。我们使用租约来保持多个副本间<strong>变更顺序的一致性。</strong>Master授权给其中的一个副本一个该 chunk 的租约,我们把它叫做主副本(<em>primary</em>)。这个 primary 对所有对 chunk 更改进行序列化。然后所有的副本根据这个顺序执行变更。因此,全局的变更顺序首先是由master选择的租约授权顺序来确定的(可能有多个chunk需要进行修改),而同一个租约内的变更顺序则是由那个主副本来定义的。 租约机制是为了最小化 master 的管理开销而设计的。<strong>一个租约有一个初始化为60s的超时时间设置。</strong>然而只要这个 chunk 正在变更,那个主副本就可以向 master 请求延长租约。这些请求和授权通常是与 master 和 chunkserver 间的心跳信息一起发送的。有时候 master 可能想在租约过期前撤销它(比如,master可能想使对一个正在重命名的文件的变更无效)。即使 master 无法与主副本进行通信,它也可以在旧的租约过期后安全的将租约授权给另一个新的副本。 如图2,我们将用如下的数字标识的步骤来表示一个写操作的控制流程。</p><p><img src="https://s2.loli.net/2022/06/06/l4VGaILz81vjSF6.png" alt="GFS2.PNG"></p><p>1.client 向 master 询问哪个 chunkserver 获取了当前 chunk 的租约以及其他副本所在的位置。如果没有人得到租约,master将租约授权给它选择的一个副本。 2.master 返回该主副本的标识符以及其他副本的位置。Client为未来的变更缓存这个数据。只有当主副本没有响应或者租约到期时它才需要与master联系。 3.client 将数据推送给所有的副本,client可以以任意的顺序进行推送。每个chunkserver会将数据存放在内部的 LRU buffer cache里,直到数据被使用或者过期。通过将<strong>控制流与数据流分离</strong>,我们可以通过将昂贵的数据流基于网络拓扑进行调度来提高性能,而不用考虑哪个chunkserver是主副本。3.2节更深入地讨论了这点。 4.一旦所有的副本应答接收到了数据,client发送一个写请求给主副本,这个请求标识了先前推送给所有副本的数据。主副本会给它收到的所有变更(可能来自多个client)安排一个连续的序列号来进行必需的串行化。它将这些变更根据序列号应用在本地副本上。 5.主副本将写请求发送给所有的次副本,每个次副本以与主副本相同的串行化顺序应用这些变更。 6.所有的次副本完成操作后向主副本返回应答 7.主副本向client返回应答。任何副本碰到的错误都会返回给client。出现错误时,该写操作可能已经在主副本以及一部分次副本上执行成功。(如果主副本失败,那么它不会安排一个序列号并且发送给其他人)。客户端请求将会被认为是失败的,被修改的区域将会处在非一致状态下。我们的客户端代码会通过重试变更来处理这样的错误。<strong>它会首先在3-7步骤间进行一些尝试后在重新从头重试这个写操作。</strong> 如果应用程序的一个写操作很大或者跨越了chunk的边界,GFS client代码会将它转化为多个写操作。它们都会遵循上面的控制流程,但是可能会被来自其他client的操作插入或者覆盖。因此共享的文件区域可能会包含来自不同client的片段,虽然这些副本是一致的(因为所有的操作都按照相同的顺序在所有副本上执行成功了),但是文件区域会处在一种一致但是未定义的状态,正如2.7节描述的那样。</p><h3 id="3-2-Data-Flow"><a href="#3-2-Data-Flow" class="headerlink" title="3.2 Data Flow"></a>3.2 Data Flow</h3><p><strong>为了更有效的使用网络我们将数据流和控制流解耦。</strong>控制流从 client 到达主副本,然后到达其他的所有次副本,而数据则是线性地通过一个精心选择的 chunkserver 链,某种程度上像是管道流水线那样推送过去的。我们的目标是充分利用每个机器的网络带宽,避免网络瓶颈和高延时链路,并且最小化数据推送的延时(最小化同步数据的时间)。 为了充分利用每个机器的网络带宽,数据通过 chunkserver 链线性的推送过去,而不是以其他的拓扑结构进行推送(例如树形)。因此每个机器的带宽都是用于尽可能快地传送数据,而不是在多个接收者之间进行分配。 为了尽可能的避免网络瓶颈和高延时链路(比如 inter-switch 连接通常既是瓶颈延时也高),每个机器向网络中还没有收到该数据的最近的那个机器推送数据。假设 client 将数据推送给 S1- S4,它会首先将数据推送给最近的chunkserver假设是S1,S1推送给最近的,假设S2,S2推送给S3,S4中离他最近的那个。我们网络拓扑足够简单,以至于距离可以通过IP地址估计出来。 最后为了最小化延时,我们通过将TCP数据传输进行流水化。一旦一个chunkserver收到数据,它就开始立即往下发送数据。流水线对我们来说尤其有用,因为我们使用了一个全双工链路的交换网络。立即发送数据并不会降低数据接受速率。如果没有网络拥塞,向R个副本传输B字节的数据理想的时间耗费是B/T+RL,T代表网络吞吐率,L是机器间的网络延时。我们的网络连接是100Mbps(T),L远远低于1ms,因此1MB的数据理想情况下需要80ms就可以完成。</p><h3 id="3-3-Atomic-Record-Appends"><a href="#3-3-Atomic-Record-Appends" class="headerlink" title="3.3 Atomic Record Appends"></a>3.3 Atomic Record Appends</h3><p>GFS 提供一个原子性的 append 操作叫做 <em>record append</em> (注意这与传统的 append 操作也是不同的)。在传统的写操作中,用户指定数据需要写的偏移位置。对于相同区域的并行写操作是不可串行的:该区域的末尾可能包含来自多个 client 的数据片段。但在一个 record append 操作中,client 唯一需要说明的只有数据。GFS 会将它至少原子性地 append 到文件中一次,append 的位置是由 GFS 选定的,同时会将这个位置返回给 client。这很类似于unix 文件打开模式中的 O_APPEND,当多个写者并发操作时不会产生竞争条件。 Record append 在我们的分布式应用中被大量的使用。在我们的应用中很多在不同机器的 client 并发地向同一个文件 append。<strong>如果使用传统的写操作,client 将需要进行复杂而又昂贵的同步化操作,比如通过一个分布式锁管理器。</strong>在我们的工作负载中,这样的文件通常作为一个多生产者/单消费者队列或者用来保存来自多个不同 client 的归并结果。 Record append 是一种类型的变更操作,除了一点在主副本上的额外的逻辑外依然遵循 3.1 节的控制流。Client 将所有的数据推送给所有副本后,它向主副本发送请求。主副本检查将该记录 append 到该 chunk 后是否会导致该 chunk 超过它的最大值(64MB)。如果超过了,它就将该 chunk 填充到最大值,告诉所有的次副本做同样的工作,然后告诉客户端该操作应该在下一个 chunk 上重试。(<strong>append 的 Record 大小需要控制在最大 chunk 大小的四分之一以内</strong>,这样可以保证最坏情况下的碎片可以保持在一个可以接受的水平上 )。如果 record 没有超过最大尺寸,就按照普通情况处理,主副本将数据 append 到它的副本上,告诉次副本将数据写在相同的偏移位置上,最后向 client 返回成功应答。 如果 record append 在任何一个副本上失败,client 就会重试这个操作。这样,相同 chunk 的多个副本就可能包含不同的数据,这些数据可能包含了相同记录的整个或者部分的重复值。GFS 并不保证所有的副本在byte 级别上的一致性,它只保证数据作为一个原子单元最少写入一次。这个属性是由如下的简单的观察中得出,当操作报告成功时,数据必须写在所有副本的相同chunk的相同偏移量写入。此外,所有的副本都必须至少和纪录结束点等长,并且因此即使另外一个副本成了主副本(primary),所有后续的纪录都会被分配在一个较高的偏移量或者在另外一个chunk中。在我们的一致性保证里,record append 操作成功后写下的数据区域是已定义的(肯定是一致的),然而介于其间的数据则是不一致的(因此也是未定义的)。我们的应用程序可以处理这样的不一致区域,正如我们在 2.7.2 里讨论的那样。</p><h3 id="3-4-Snapshot"><a href="#3-4-Snapshot" class="headerlink" title="3.4 Snapshot"></a>3.4 Snapshot</h3><p>快照操作在尽量不影响正在执行的变更操作的情况下,<strong>几乎即时</strong>产生一个文件或者目录树(“源”)。用户经常用它来创建大数据集的分支拷贝(经常还有拷贝的拷贝,递归拷贝),或者在提交变动前做一个当前状态的checkpoint,这样可以使得接下来的commit或者回滚容易一点。 像 AFS,我们使用标准的copy-on-writer技术来实现快照。当master收到一个快照请求时,它首先撤销将要进行快照的那些文件对应的chunk的所有已发出的租约。这就使得对于这些chunk的后续写操作需要与master交互来得到租约持有者。这就首先给master一个机会创建该chunk的新的拷贝。 当这些租约被撤销或者过期后,master将这些操作以日志形式写入磁盘。然后复制该文件或者目录树的元数据,然后将这些日志记录应用到内存中的复制后的状态上,<strong>新创建的快照文件与源文件一样指向相同的chunk。</strong> 当 client 在快照生效后第一次对一个chunk C进行写入时,它会发送请求给master找到当前租约拥有者。Master注意到对于chunk C的引用计数大于1。它延迟回复客户端的请求,选择一个新的chunk handle C 。然后让每个拥有 C 的那些 chunkserver 创建一个新的叫做 C 的chunk。通过在相同的chunkserver上根据原始的chunk创建新chunk,<strong>就保证了数据拷贝是本地,而不是通过网络</strong>(我们的硬盘比100Mbps网络快大概三倍)。这样,对于任何chunk的请求处理都没有什么不同:master为新chunk C 的副本中的一个授权租约,然后返回给 client,这样它就可以正常的写这个 chunk 了,client 不需要知道该 chunk 实际上是从一个现有的 chunk 创建出来的。</p><h2 id="4-Master-Operation"><a href="#4-Master-Operation" class="headerlink" title="4.Master Operation"></a>4.Master Operation</h2><p>Master 执行所有的 namespace 操作。此外,它还管理整个系统的 chunk 备份:决定如何放置,创建新的 chunk 和相应的副本,协调整个系统的活动保证 chunk 都是完整备份的,在 chunkserver 间进行负载平衡,回收没有使用的存储空间。我们现在讨论这些主题。</p><h3 id="4-1-Namespace-Management-and-Locking"><a href="#4-1-Namespace-Management-and-Locking" class="headerlink" title="4.1 Namespace Management and Locking"></a>4.1 Namespace Management and Locking</h3><p>很多master操作都需要花费很长时间:比如,一个快照操作要撤销该快照所包含的chunk的所有租约。我们并不想耽误其他运行中的master操作,因此我们允许多个操作处于活动状态并通过在namespace区域使用锁来保证正确的串行化。 不像传统的文件系统,GFS的目录并没有一种数据结构用来列出该目录下所有文件,而且也不支持文件或者目录别名(像unix的硬链接或者软连接那样)。GFS在逻辑上通过一个路径全称到元数据映射的查找表来表示它的名字空间。通过采用前缀压缩,这个表可以有效地在内存中表示。namespace树中的每个节点(要么是文件的绝对路径名称要么是目录的)具有一个相关联的读写锁。 每个master操作在它运行前,需要获得一个锁的集合。比如,如果它想操作/d1/d2…/dn/leaf,那么它需要获得/d1,/d1/d2……/d1/d2…/dn这些目录的读锁,然后才能得到路径/d1/d2…/dn/leaf的读锁或者写锁。注意Leaf可能是个文件或者目录,这取决于具体的操作。 我们现在解释一下,当为/home/user创建快照/save/user时,锁机制如何防止文件/home/user/foo被创建。快照操作需要获得在/home和/save上的读锁,以及/home/user和/save/user上的写锁。文件创建需要获得在/home和/home/user上的读锁,以及在/home/user/foo上的写锁。这两个操作将会被正确的串行化,因为它们试图获取在/home/user上的相冲突的锁。文件创建并不需要父目录的写锁,因为实际上这里并没有”目录”或者说是类似于inode的数据结构,需要防止被修改。读锁已经足够用来防止父目录被删除。 这种锁模式的一个好处就是它允许对相同目录的并发变更操作。比如多个文件的创建可以在相同目录下并发创建:每个获得该目录的一个读锁,以及文件的一个写锁。目录名称上的读锁足够可以防止目录被删除,重命名或者快照。文件名称上的写锁将会保证重复创建相同名称的文件的操作只会被执行一次。 因为namespace有很多节点,所以读写锁对象只有在需要时才会被分配,(懒加载)一旦不再使用用就删除。为了避免死锁,锁是按照一个一致的全序关系进行获取的:首先根据所处的namespace树的级别,相同级别的则根据字典序。</p><h3 id="4-2-Replica-Placement"><a href="#4-2-Replica-Placement" class="headerlink" title="4.2 Replica Placement"></a>4.2 Replica Placement</h3><p>副本位置。GFS集群是高度分布在多个层次上的。它拥有数百个散步在多个机柜中的chunkserver。这些chunkserver又可以被来自不同或者相同机柜上的client访问。处在不同机柜的机器间的通信可能需要穿过一个或者更多的网络交换机。此外,进出一个机柜的带宽可能会小于机柜内所有机器的带宽总和。多级的分布式带来了数据分布式时的扩展性,可靠性和可用性方面的挑战。 Chunk的备份放置策略服务于两个目的:最大化数据可靠性和可用性,最小化网络带宽的使用。为了达到这两个目的,仅仅将备份放在不同的机器是不够的,这只能应对机器或者硬盘失败,以及最大化利用每台机器的带宽。我们必须在机柜间存放备份。这样能够保证当一个机柜整个损坏或者离线(比如网络交换机故障或者电路出问题)时,该chunk的存放在其他机柜的某些副本仍然是可用的。这也意味着对于一个chunk的流量,尤其是读取操作可以充分利用多个机柜的带宽。另一方面,写操作需要在多个机柜间进行,这是我们权衡之后认为可以接受的。</p><h3 id="4-3-Creation-Re-replication-Rebalancing"><a href="#4-3-Creation-Re-replication-Rebalancing" class="headerlink" title="4.3 Creation,Re-replication,Rebalancing"></a>4.3 Creation,Re-replication,Rebalancing</h3><p>Chunk副本的创建主要有三个原因:chunk的创建,重备份,重平衡。 当master创建一个chunk时,它会选择放置初始化空白副本的位置。它会考虑几个因素:1.尽量把新的chunk放在那些低于平均磁盘空间使用值的那些chunkserver上。随着时间的推移,这会使得chunkserver的磁盘使用趋于相同;2.尽量限制每个chunkserver上的最近的文件创建数,虽然创建操作是很简单的,但是它后面往往跟着繁重的写操作,因为chunk的创建通常是因为写者的需要而创建它。在我们的一次append多次读的工作负载类型中,一旦写入完成,它们就会变成只读的。3.正如前面讨论的,我们希望在机柜间存放chunk的副本。 当chunk的可用备份数低于用户设定的目标值时,Master会进行重复制。有多个可能的原因导致它的发生:chunkserver不可用,chunkserver报告它的某个备份已被污染,一块硬盘由于错误而不可用或者用户设定的目标值变大了。需要重复制的chunk根据几个因素确定优先级。一个因素是它与备份数的目标值差了多少,比如我们给那些丢失了2个副本的chunk比丢失了1个的更高的优先级。另外,比起最近被删除的文件的chunk,我们更想备份那些仍然存在的文件的chunk(参考4.4节)。最后,为了最小化失败对于运行中的应用程序的影响,我们提高那些阻塞了用户进度的chunk的优先级。 Master选择最高优先级的chunk,通过给某个chunkserver发送指令告诉它直接从一个现有合法部分中拷贝数据来进行克隆。新备份的放置与创建具有类似的目标:平均磁盘使用,限制在单个chunkserver上进行的clone操作数,使副本存放在不同机柜间。为了防止clone的流量淹没client的流量,master限制整个集群已经每个chunkserver上处在活动状态的clone操作数。另外每个chunkserver还会限制它用在clone操作上的带宽,通过控制它对源chunkserver的读请求。 最后,master会周期性的对副本进行重平衡。它检查当前的副本分布,然后为了更好的磁盘空间使用和负载均衡,将副本进行移动。而且在这个过程中,master是逐步填充一个新的chunkserver,而不是立即将新的chunk以及大量沉重的写流量使他忙的不可开交。对于一个新副本的放置,类似于前面的那些讨论。另外,master必须选择删除哪个现有的副本。通常来说,它更喜欢那些存放在低于平均磁盘空闲率的chunkserver上的chunk,这样可以使磁盘使用趋于相等。</p><h3 id="4-4-Garbage-Collection"><a href="#4-4-Garbage-Collection" class="headerlink" title="4.4 Garbage Collection"></a>4.4 Garbage Collection</h3><p>文件删除后,GFS并不立即释放可用的物理存储。它会将这项工作推迟到文件和chunk级别的垃圾回收时做。我们发现,这种方法使得系统更简单更可靠。</p><h3 id="4-4-1-Mechanism"><a href="#4-4-1-Mechanism" class="headerlink" title="4.4.1 Mechanism"></a>4.4.1 Mechanism</h3><p>当文件被应用程序删除时,master会将这个删除操作立刻写入日志(就像其他变更操作)。但是文件不会被立即删除(回收),而是被重命名为一个包含删除时间戳的隐藏名称。在master对文件系统进行常规扫描时,它会删除那些存在时间超过3天(这个时间是可以配置的)的隐藏文件。在此之前,文件依然可以用那个新的特殊名称进行读取,或者重命名回原来的名称来取消删除。当隐藏文件从名字空间删除后,它的元数据会被擦除。这样就有效地切断了它与所有chunk的关联。 在chunk的类似的常规扫描中,master找到那些孤儿块(无法从任何文件到达),擦除这些块的元数据。在chunkserver与master周期性心跳信息中,chunkserver报告它所拥有的chunk的那个子集,然后master返回那些不在master的元数据中出现的chunk的标识。Chunkserver就可以自由的删除这些chunk的那些副本了。</p><h3 id="4-4-2-Discussion"><a href="#4-4-2-Discussion" class="headerlink" title="4.4.2 Discussion"></a>4.4.2 Discussion</h3><p>虽然分布式的垃圾回收是一个艰巨的问题,在程序设计的时候需要复杂的解决,但是在我们的系统中却是比较简单的。我们可以轻易辨别出对一个chunk的全部引用:它们都唯一保存在master的file-to-chunk映射中。我们也可以容易辨别所有的chunk副本:它们是在各个chunkserver上的指定目录下的linux文件。所有不被master知道的副本就是”垃圾”。 采用垃圾回收方法收回存储空间与直接删除相比,提供了几个优势:首先在经常出现组件失败的大规模分布式系统中,它是简单而且可靠的。Chunk创建可能在某些chunkserver上成功,在另外一些失败,这样就留下一些master所不知道的副本。副本删除消息可能丢失,master必须记得在出现失败时进行重发。垃圾回收提供了一种统一,可信赖的清除无用副本的方式。其次,它将存储空间回收与master常规的后台活动结合在一起,比如名字空间扫描,与chunkserver的握手。因此它们是绑在一块执行的,这样开销会被平摊。而且只有当master相对空闲时才会执行。Master就可以为那些具有时间敏感性的客户端请求提供更好的响应。第三,空间回收的延迟为意外的不可逆转的删除提供了一道保护网。 根据我们的经验,主要的缺点是:滞后删除会导致阻碍我们尝试调整磁盘使用情况的效果。那些频繁创建和删除中间文件的应用程序不能够立即重用磁盘空间。我们通过当已删除的文件被再次删除时加速它的存储回收来解决这个问题。我们也允许用户在不同的namespace内使用不同的重备份和回收策略。比如用户可以指定某个目录树下的文件的chunk使用无副本存储,任何已删除的文件立刻并且不可撤销的从文件系统状态中删除。</p><h3 id="4-5-Stale-Replica-Detection"><a href="#4-5-Stale-Replica-Detection" class="headerlink" title="4.5 Stale Replica Detection"></a>4.5 Stale Replica Detection</h3><p>如果chunkserver失败或者在它停机期间丢失了某些更新,chunk副本就可能变为过期的。对于每个chunk,master维护一个版本号来区分最新和过期的副本。 无论何时只要master为一个chunk授权一个新的租约,那么它的版本号就会增加,然后通知副本进行更新。在一致的状态下,Master和所有副本都会记录这个新的版本号。这发生在任何client被通知以前,因此也就是client开始向chunk中写数据之前。如果另一个副本当前不可用,它的chunk版本号就不会被更新。当chunkserver重启或者报告它的chunk和对应的版本号的时候,master会检测该chunkserver是否包含过期副本。如果master发现有些版本号大于它的记录,master就认为它在授权租约时失败了,所以采用更高的版本号的那个进行更新。 Master通过周期性的垃圾回收删除过期副本。在删除之前,它需要确认在它给所有客户端的chunk信息请求的应答中都没有包含这个过期的副本。作为另外一种保护措施,当master通知客户端那个chunkserver包含某chunk的租约或者当它在clone操作中让chunkserver从另一个chunkserver中读取chunk时,会将chunk的版本号包含在内。当client和chunkserver执行操作时,总是会验证版本号,这样就使得它们总是访问最新的数据。</p><h2 id="5-Fault-Tolerance-And-Diagnosis"><a href="#5-Fault-Tolerance-And-Diagnosis" class="headerlink" title="5.Fault Tolerance And Diagnosis"></a>5.Fault Tolerance And Diagnosis</h2><p>容错和诊断。在设计系统时,一个最大的挑战就是频繁的组件失败。组件的数量和质量使得这些问题变成一种常态而不再是异常。我们不能完全信任机器也不能完全信任磁盘。组件失败会导致系统不可用,甚至是损坏数据。我们讨论下如何面对这些挑战,以及当它们不可避免的发生时,在系统中建立起哪些工具来诊断问题。</p><h3 id="5-1-High-Availability"><a href="#5-1-High-Availability" class="headerlink" title="5.1 High Availability"></a>5.1 High Availability</h3><p>在GFS的数百台服务器中,在任何时间总是有一些是不可用的。我们通过两个简单有效的策略来保持整个系统的高可用性:快速恢复和备份。</p><h3 id="5-1-1-Fast-Recovery"><a href="#5-1-1-Fast-Recovery" class="headerlink" title="5.1.1 Fast Recovery"></a>5.1.1 Fast Recovery</h3><p><strong>快速恢复机制。</strong>经过设计后Master和chunkserver无论何时并以任意的方式被终止,我们都可以在在几秒内恢复它们的状态并启动。事实上,我们并没有区分正常和异常的终止。服务器通常都是通过杀死进程来关闭。客户端和其他服务器的请求超时后会经历一个小的停顿,然后重连那个重启后的服务器,进行重试。6.2.2报告了观测到的启动时间。</p><h3 id="5-1-2-Chunk-Replication"><a href="#5-1-2-Chunk-Replication" class="headerlink" title="5.1.2 Chunk Replication"></a>5.1.2 Chunk Replication</h3><p><strong>chunk备份</strong>。正如之前讨论的,每个chunk备份在不同机柜上的多个chunkserver上。用户可以在不同名字空间内设置不同的备份级别,默认是3.当chunkserver离线或者通过检验和检测到某个chunk损坏后(5.2节),master会克隆现有的副本使得副本的数保持充足。尽管副本已经很好的满足了我们的需求,我们还探寻一些其他的具有同等或者更少code的跨机器的冗余方案,来满足我们日益增长的只读存储需求。我们期望在我们的非常松散耦合的系统中实现这些更复杂的冗余模式是具有挑战性但是可管理的。因为我们的负载主要是append和读操作而不是小的随机写操作。</p><h3 id="5-1-3-Master-Replication"><a href="#5-1-3-Master-Replication" class="headerlink" title="5.1.3 Master Replication"></a>5.1.3 Master Replication</h3><p>为了可靠性,master的状态需要进行备份。它的操作日志和检查点备份在多台机器上。对于状态的变更只有当它的操作日志被写入到本地磁盘和所有的远程备份后,才认为它完成。为了简单起见,master除了负责进行各种后台活动比如:垃圾回收外,还要负责处理所有的变更。当它失败后,几乎可以立即重启。如果它所在的机器或者硬盘坏了,独立于GFS的监控设施会利用备份的操作日志在别处重启一个新的master进程。Client仅仅使用master的一个典型名称(比如gfs-test)来访问它,这是一个DNS名称,如果master被重新部署到一个新的机器上,可以改变它。 此外,当主master down掉之后,还有多个影子master可以提供对文件系统的只读访问。它们是影子,而不是镜像,这意味着它们可能比主master要滞后一些,通常可能是几秒。对于那些很少发生变更的文件或者不在意轻微过时的应用程序来说,它们增强了读操作的可用性。实际上,因为文件内容是从chunkserver中读取的,应用程序并不会看到过期的文件内容。文件元数据可能在短期内是过期的,比如目 录内容或者访问控制信息。</p><h3 id="5-2-Data-Integrity"><a href="#5-2-Data-Integrity" class="headerlink" title="5.2 Data Integrity"></a>5.2 Data Integrity</h3><p><strong>数据完整性。</strong>每个chunkserver通过校验和来检测存储数据中的损坏。GFS集群通常具有分布在几百台机器上的数千块硬盘,这样它就会经常出现导致数据损坏或丢失的硬盘失败。我们可以从chunk的其他副本中恢复被损坏的数据,但是如果通过在chunkserver间比较数据来检测数据损坏是不现实的。另外,有分歧的备份仍然可能是合法的:根据GFS的变更语义,尤其是前面提到的原子性的record append操作,并不保证所有副本是完全一致的。因此每个chunkserver必须通过维护一个检验和来独立的验证它自己的拷贝的完整性。 一个chunk被划分为64kb大小的块。每个块有一个相应的32bit的校验和。与其他的元数据一样,校验和与用户数据分离的,它被存放在内存中,同时通过日志进行持久化存储。 对于读操作,chunkserver在向请求者(可能是一个client或者其他的chunkserver)返回数据前,需要检验与读取边界重叠的那些数据库的校验和。因此chunkserver不会将损坏数据传播到其他机器上去。如果一个块的校验和与记录中的不一致,chunkserver会向请求者返回一个错误,同时向master报告这个不匹配。之后,请求者会向其他副本读取数据,而master则会用其他副本来clone这个chunk。当这个合法的新副本创建成功后,master向报告不匹配的那个chunkserver发送指令删除它的副本。 校验和对于读性能的影响很小,因为:我们大部分的读操作至少跨越多个块,我们只需要读取相对少的额外数据来进行验证。GFS client代码通过尽量在校验边界上对齐读操作大大降低了开销。另外在chunkserver上校验和的查找和比较不需要任何的IO操作,校验和的计算也可以与IO操作重叠进行。 校验和计算对于append文件末尾的写操作进行了特别的优化。因为它们在工作负载中占据了统治地位。我们仅仅增量性的更新最后一个校验块的校验值,同时为那些append尾部的全新的校验块计算它的校验值。即使最后一个部分的校验块已经损坏,而我们现在无法检测出它,那么新计算出来的校验和将不会与存储数据匹配,那么当这个块下次被读取时,就可以检测到这个损坏。(也就是说这里并没有验证最后一个块的校验值,而只是更新它的值,也就是说这里省去了验证的过程,举个例子假设最后一个校验块出现了错误,由于我们的校验值计算时是增量性的,也就是说下次计算不会重新计算已存在的这部分数据的校验和,这样该损坏就继续保留在校验和里,关键是因为这里采用了增量型的校验和计算方式)。 与之相对的,如果一个写操作者覆盖了一个现有chunk的边界,我们必须首先读取和验证操作边界上的第一个和最后一个块,然后执行写操作,最后计算和记录新的校验和。如果在覆盖它们之前不验证第一个和最后一个块,新的校验和就可能隐藏掉那些未被覆盖的区域的数据损坏。(因为这里没有采用增量计算方式,因为它是覆盖不是append所以现有的检验和就是整个块的没法从中取出部分数据的校验和,必须重新计算)。 在空闲期间,chunkserver可以扫描验证处在非活动状态的trunk的内容。这允许我们检测到那些很少被读取的数据的损失。一旦损坏被发现,master就可以创建一个新的未损坏副本并且删除损坏的副本。这就避免了一个不活跃的坏块骗过master,让之以为块有足够的好的副本。</p><h3 id="5-3-Diagnostic-Tools"><a href="#5-3-Diagnostic-Tools" class="headerlink" title="5.3 Diagnostic Tools"></a>5.3 Diagnostic Tools</h3><p><strong>诊断工具</strong>。全面而详细的诊断性的日志以很小的成本提供了问题分解,调试,性能分析上不可估量的帮助。没有日志,就很难理解那些机器间偶然出现的不可重复的交互。GFS生成一个诊断日志用来记录很多重要事件(比如chunkserver的启动停止)以及所有RPC请求和应答。这些诊断日志可以自由的删除而不影响系统的正常运行。然而,只要磁盘空间允许,我们会尽量保存这些日志。 除了正在读写的文件数据,RPC日志包含了精确(exact,所有???)的请求和响应信息。通过匹配请求和响应,整理不同机器上的RPC日志,我们可以重新构建出整个交互历史来诊断一个问题。这些日志也可以用来进行负载测试和性能分析。 因为日志是顺序异步写的,因此写日志对于性能的影响是很小的,得到的好处却是大大的。最近的事件也会保存在内存中,可以用于持续的在线监控。</p><h2 id="6-Measurements"><a href="#6-Measurements" class="headerlink" title="6.Measurements"></a>6.Measurements</h2><p>在这一节,我们用一些小规模的测试来展示GFS架构和实现固有的一些瓶颈,有一些数字来源于google的实际集群。</p><h3 id="6-1-Mirco-benchmarks"><a href="#6-1-Mirco-benchmarks" class="headerlink" title="6.1 Mirco-benchmarks"></a>6.1 Mirco-benchmarks</h3><p>我们在一个master,两个master备份,16个chunkserver,16个client组成的GFS集群上进行了性能测量。这个配置是为了方便测试,实际中的集群通常会有数百个chunkserver,数百个client。 所有机器的配置是,双核PIII 1.4GHz处理器,2GB内存,两个80G,5400rpm硬盘,以及100Mbps全双工以太网连接到HP2524交换机。所有19个GFS服务器连接在一个交换机,所有16个客户端连接在另一个上。两个交换机用1Gbps的线路连接。</p><h3 id="6-1-1-Reads"><a href="#6-1-1-Reads" class="headerlink" title="6.1.1 Reads"></a>6.1.1 Reads</h3><p>N个客户端从文件系统中并发读。每个客户端在一个320GB的文件集合里随机4MB进行读取。然后重复256次,这样每个客户端实际上读取了1GB数据。Chunkserver总共只有32GB内存,因此我们估计在linux的buffer cache里最多有10%的命中率。我们的结果应该很接近一个几乎无缓存的结果。 图3(a)展示了对于N个客户端的总的读取速率以及它的理论上的极限。当2个交换机通过一个1Gbps的链路连接时,它的极限峰值是125MB/s,客户端通过100Mbps连接,那么换成单个客户端的极限就是12.5MB/s。当只有一个客户端在读取时,观察到的读取速率是10MB/s,达到了单个客户端极限的80%。当16个读取者时,总的读取速率的94 MB/s,大概达到了链路极限(125MB/s)的75%,换成单个客户端就是6 MB/s。效率从80%降到了75%,是因为伴随着读取者的增加,多个读者从同一个chunkserver并发读数据的概率也随之变大。</p><p><img src="https://s2.loli.net/2022/06/06/lAnxpw1kJWTG8fS.png" alt="GFS3.PNG"></p><h3 id="6-1-2-Writes"><a href="#6-1-2-Writes" class="headerlink" title="6.1.2 Writes"></a>6.1.2 Writes</h3><p>N个客户端并行向N个不同的文件写数据。每个客户端以1MB的单个写操作总共向一个新文件写入1GB数据。总的写速率以及它的理论上的极限如图3(b)所示。极限值变成了67 MB/s,是因为我们需要将每个字节写入到16个chunkserver中的3个,每个具有12.5MB/s的输入连接。 单个客户端的写入速率是6.3 MB/s,大概是极限值的一半。主要原因是我们的网络协议栈。它不能充分利用我们用于chunk副本数据推送的流水线模式。将数据从一个副本传递到另一个副本的延迟降低了整体的写速率。 对于16个客户端,总体的写入速率达到了35 MB/s,平均每个客户端2.2 MB/s,大概是理论极限的一半。与写操作类似,伴随着写操作的增加,多个写操作从同一个chunkserver并发写数据的概率也随之变大。另外对于16个写操作比16个读者更容易产生碰撞,因为每个写操作将关联到3个不同的副本。 写操作比我们期望的要慢。在实际中,这还未变成一个主要问题,因为尽管它可能增加单个客户端的延时,但是当系统面对大量客户端时,其总的写入带宽并没有显著的影响。</p><h3 id="6-1-3-Record-Appends"><a href="#6-1-3-Record-Appends" class="headerlink" title="6.1.3 Record Appends"></a>6.1.3 Record Appends</h3><p>图3(c)展示了record append的性能。N个客户端向单个文件并行的append。性能取决于保存了该文件最后那个chunk的那些chunkserver,与客户端的数目无关。当只有一个客户端时,能达到6.0MB/s,当有16个客户端时就降到了4.8 MB/s。主要是由于拥塞以及不同的客户端的网络传输速率不同造成的。 我们的应用程序倾向于并行创建多个这样的文件。换句话说,N个客户端向M个共享文件并行append,在这里N和M通常是几十甚至几百大小。因此在我们的实验中出现的chunkserver的网络拥塞问题在实际中并不是一个显著的问题,因为当一个文件的chunkserver比较繁忙的时候,它可以去写另一个。</p><h3 id="6-2-Real-World-Clusters"><a href="#6-2-Real-World-Clusters" class="headerlink" title="6.2 Real World Clusters"></a>6.2 Real World Clusters</h3><p>我们选择在google内部使用的两个集群进行测试作为相似的那些集群的一个代表。集群A主要用于100多个工程的日常研发。它会从数TB的数据中读取数MB的数据,对这些数据进行转化或者分析,然后将结果再写回集群。集群B主要用于产品数据处理。它上面的任务持续时间更长,持续地在生成和处理数TB的数据集合,只是偶尔可能需要人为的参与。在这两种情况下,任务都是由分布在多个机器上的很多进程组成,它们并行的读写很多文件。</p><h3 id="6-2-1-Storage"><a href="#6-2-1-Storage" class="headerlink" title="6.2.1 Storage"></a>6.2.1 Storage</h3><p>正如表中前5个字段所展示的,两个集群都有数百个chunkserver,支持TB级的硬盘空间,空间已经被充分使用但还没全满。已用的空间包含chunk的所有副本。通常文件存在三个副本,因此这两个集群实际分别存储了18TB和52TB的数据。 这两个集群的文件数目很接近,尽管B集群有大量的死文件(那些已经被删除或者被新版本文件所替换但空间还没有被释放的文件)。而且它具有更多的trunk,因为它上面的文件通常更大。</p><p><img src="https://s2.loli.net/2022/06/06/n9thwjE5cGNMykq.png" alt="GFS4.PNG"></p><h3 id="6-2-2-Metadata"><a href="#6-2-2-Metadata" class="headerlink" title="6.2.2 Metadata"></a>6.2.2 Metadata</h3><p>所有的Chunkserver总共存储了数十G的元数据,大部分是用户数据的64kb块的校验和。Chunkserver上唯一的其他的元数据就是4.5节讨论的chunk的版本号。 保存在master上的元数据要更小一些,只有数十MB,平均下来每个文件只有100来个字节。这也刚好符合我们的master的内存不会成为实际中系统容量限制的假设。每个文件的元数据主要是以前缀压缩格式存储的文件名称。还有一些其他的元数据比如文件所有者,权限,文件到chunk的映射以及chunk的当前版本。另外对于每个chunk我们还存储了当前的副本位置以及用于实现写时复制的引用计数。 每个独立的server(chunkserver和master)只有50-100MB的元数据。因此,恢复是很快的:在server可以应答查询前只需要花几秒钟的时间就可以把它们从硬盘上读出来。然而,master的启动可能要慢一些,通常还需要30-60秒从所有的chunkserver获得chunk的位置信息。</p><h3 id="6-2-3-Read-and-Write-Rates"><a href="#6-2-3-Read-and-Write-Rates" class="headerlink" title="6.2.3 Read and Write Rates"></a>6.2.3 Read and Write Rates</h3><p>表3展示了不同时期的读写速率。进行这些测量时,两个集群都已经运行了大约一周(为了更新到最新版本的GFS,这两个集群被重启过)。 从启动开始看,平均写速率小于30MB/s。当我们进行这些测量时,集群B正在以100MB/s的速率进行密集的写操作,同时产生了300MB/s的网络负载,因为写操作将会传给3个副本。</p><p><img src="https://s2.loli.net/2022/06/06/1Ub28ilwzyvDKBH.png" alt="GFS5.PNG"></p><p><img src="https://s2.loli.net/2022/06/06/lAnxpw1kJWTG8fS.png" alt="GFS3.PNG"></p><p>读速率要远高于写速率。正如我们料想的那样,整个工作负载组成中,读要多于写。这两个集群都处在繁重的读活动中。尤其是,A已经在过去的一个星期中维持了580MB/s的读速率。它的网络配置可以支持750MB/s,因此它已经充分利用了资源。B集群可支持1300 MB/s的峰值读速率,但是应用只使用了380 MB/s。</p><h3 id="6-2-4-Master-Load"><a href="#6-2-4-Master-Load" class="headerlink" title="6.2.4 Master Load"></a>6.2.4 Master Load</h3><p>表3也表明发送给master的操作速率大概是每秒200-500个操作。Master可以轻易的处理这个级别的速率,因此对于这些工作负载来说,它不会成为瓶颈。 在早期版本的GFS中,master偶尔会成为某些工作负载的瓶颈。为了查找文件,花费大量的时间在巨大的目录(包含上千万的文件)中进行线性扫描。因此,我们改变了master的数据结构,使之可以在名字空间内进行有效的二分搜索。现在它可以简单的支持每秒上千次的文件访问。如果必要的话,我们可以通过在namespace数据结构前面放置名称查找缓存来进一步加快速度(If necessary, we could speed it up further by placing name lookup caches in front of the namespace data structures.)。</p><h3 id="6-2-5-Recovery-Time"><a href="#6-2-5-Recovery-Time" class="headerlink" title="6.2.5 Recovery Time"></a>6.2.5 Recovery Time</h3><p>一台Chunkserver失败后,它上面的那些chunk的副本数就会降低,必须进行clone以维持正常的副本数。恢复这些chunk的时间取决于资源的数量。在一个实验中,我们关闭集群B中的一个chunkserver。该chunkserver大概有15000个chunk,总共600GB的数据。为减少对于应用程序的影响以及为调度决策提供余地,我们的默认参数设置将集群的并发clone操作限制在91个(占chunkserver个数的40%),同时每个clone操作最多可以消耗6.25MB/s(50Mbps)。所有的chunk在23.2分钟内被恢复,备份速率是440MB/s。 在另一个实验中,我们关掉了两个chunkserver,每个具有16000个chunk,660GB的数据。这次失败使得266个chunk降低到了一个副本,但是两分钟内,它们就恢复到了至少2个副本,这样就让集群能够容忍另一个chunkserver发生失败,而不产生数据丢失。</p><h3 id="6-3-Workload-Breakdown"><a href="#6-3-Workload-Breakdown" class="headerlink" title="6.3 Workload Breakdown"></a>6.3 Workload Breakdown</h3><p><strong>工具负载解析</strong>。在这一节,我们将继续在两个新的集群上对工作负载进行细致的对比分析。集群X是用于研究开发的,集群Y是用于产品数据处理。</p><h3 id="6-3-1-Methodology-and-Caveats"><a href="#6-3-1-Methodology-and-Caveats" class="headerlink" title="6.3.1 Methodology and Caveats"></a>6.3.1 Methodology and Caveats</h3><p>这些结果只包含了客户端产生的请求,因此它们反映了应用程序的对整个文件系统的工作负载。并不包含为了执行客户端的请求进行的server间的请求,或者是内部的后台活动,比如写推送或者是重平衡。 对于IO操作的统计是从GFS的server的PRC请求日志中重新构建出来的。比如为了增加并行性,GFS客户端代码可能将一个读操作拆分为多个RPC请求,我们通过它们推断出原始请求。因为我们的访问模式高度的程式化,希望每个错误都可以出现在日志中。应用程序显式的记录可以提供更精确的数据,但是重新编译以及重启正在运行中的客户端在逻辑上是不可能这样做的。而且由于机器数很多,收集这些数据也会变得很笨重。 需要注意的是,不能将我们的工作负载过于泛化(generalize)。因为GFS和其应用程序是由google完全控制的,这些应用程序都是倾向于针对GFS进行专门调整,同时GFS也是专门为这些应用而设计的。这种相互的影响可能也存在于一般的文件系统及其应用程序中,但是在我们的案例中这种影响可能更加明显。</p><h3 id="6-3-2-Chunkserver-Workload"><a href="#6-3-2-Chunkserver-Workload" class="headerlink" title="6.3.2 Chunkserver Workload"></a>6.3.2 Chunkserver Workload</h3><p>表4按大小显示了操作的分布。读操作的大小表现出双峰分布,小型读操作(小于64kb)来自于那些在大量文件中查找小片数据的随机读客户端,大型读操作(超过512kb)来自于穿越整个文件的线性读操作。</p><p><img src="https://s2.loli.net/2022/06/06/LVK7RkFCmxiBut3.png" alt="GFS6.PNG"></p><p>集群Y中大量的读操作没有返回数据。我们应用程序,尤其是在产品系统中,经常使用文件作为生产者消费者队列。生产者并行的往文件中append数据,而消费者则从文件尾部读数据。有时候,如果消费者超过了生产者,就没有数据返回。集群X很少出现这种情况,因为它主要是用来进行短期数据分析,而不是长期的分布式应用。 写操作的大小也表现出双峰分布。大型的写操作(超过256KB)通常来自于写操作者的缓冲。那些缓冲更少数据的写操作者,检查点或者经常性的同步或者简单的数据生成组成了小型的写操作(低于64KB)。 对于记录的append,Y集群比X集群可以看到更大的大record append比率。因为使用Y集群的产品系统,针对GFS进行了更多的优化。 表5展示了不同大小的数据传输总量。对于各种操作来说,大型的操作(超过256KB)构成了大部分的数据传输。但是小型(低于64KB)的读操作虽然传输了比较少的数据但是在数据读中也占据了相当的一部分,主要是由于随机seek造成的。</p><p><img src="https://s2.loli.net/2022/06/06/M6ix1GB8FrlbPCN.png" alt="GFS7.PNG"></p><h3 id="6-3-3-Appends-versus-Writes"><a href="#6-3-3-Appends-versus-Writes" class="headerlink" title="6.3.3 Appends versus Writes"></a>6.3.3 Appends versus Writes</h3><p>记录append操作被大量的应用尤其是在我们的产品系统中。对于集群X来说,按字节传输来算,write与append的比例是108:1,根据操作数来算它们的比例是8:1。对于集群Y,比例变成了3.7:1和2.5:1。对于这两个集群来说,它们的append操作都要比write操作大一些{操作数的比要远大于字节数的比,说明单个的append操作的字节数要大于write。对于集群X来说,在测量期间的记录append操作要低一些,这可能是由其中具有特殊缓冲大小设置的应用程序造成的。 正如期望的,我们的数据变更操作处于支配地位的是追加而不是重写(write也可能是追加)。我们测量了在主副本上的数据重写数量。对于集群X来说,以字节大小计算的话重写大概占了整个数据变更的0.0001%,以操作个数计算,大概小于0.0003%。对于Y集群来说,这两个数字都是0.05%,尽管这也不算大,但是还是要高于我们的期望。结果显示,大部分的重写是由于错误或者超时导致的客户端重写而产生的。它们并不是工作负载的一部分,而是属于重试机制。</p><h3 id="6-3-4-Master-Workload"><a href="#6-3-4-Master-Workload" class="headerlink" title="6.3.4 Master Workload"></a>6.3.4 Master Workload</h3><p>表6展示了对于master各种请求类型的剖析。大部分请求是为了得到chunk位置以及数据变更需要的租约持有信息。</p><p><img src="https://s2.loli.net/2022/06/06/udnqBh3bHeGojQU.png" alt="GFS8.PNG"></p><p>可以看到集群X和Y在delete请求上的限制区别,因为集群Y上存储的产品信息会周期性地生成被新版本数据所替换。这些不同被隐藏在open请求中,因为老版的数据在被写的时候的打开操作中被隐式的删除(类似与Unix的”w”打开模式)。 查找匹配文件是一个类似于ls的模式匹配请求。不像其他的请求,它可能需要处理很大部分的名字空间,因此可能是很昂贵的。在集群Y上可以更频繁地看到它,因为自动化的数据处理任务为了了解整个应用程序的状态可能需要检查文件系统中的某些部分。与此相比,集群X需要更多显式的用户控制而且已经提前知道所需要的文件的名称。</p><h2 id="7-Experiences"><a href="#7-Experiences" class="headerlink" title="7.Experiences"></a>7.Experiences</h2><p>在构建和部署GFS的过程中,我们总结出了很多经验,观点和技术。 起初,GFS只是考虑作为我们产品系统的后端文件系统。随着时间的推移,开始在研究和开发中使用。一开始它基本不支持像权限,磁盘配额(quota)这些东西,但是现在它们都已经有了。产品系统是很容易控制的,但是用户却不是。因此需要更多的设施来避免用户间的干扰。 我们最大的问题是硬盘和linux相关性。我们的很多硬盘声称支持各种IDE协议版本的linux驱动,但是实际上它们只能在最近的一些版本上才能可靠的工作。因此如果协议版本如果相差不大,硬盘大多数情况下都可以工作,但是有时候这种不一致会使得驱动和内核在硬盘状态上产生分歧。由于内核的问题,这将会导致数据被默默的污染。这个问题使得我们使用校验和来检测数据污染,如果出现这种情况,我们就需要修改内核来处理这种协议不一致的情况。 之前,由于linux2.2内核的fsync()的花费,我们也碰到过一些问题。它的花费是与文件大小而不是被修改部分的大小相关的。这对于我们大的操作日志会是一个问题,尤其是在我们实现检查点之前。我们通过改用同步写来绕过了这个问题,最后迁移到Linux2.4来解决了它。 另一个由于linux产生的问题是与读写锁相关的。在一个地址空间里的线程在从硬盘中读页数据(读锁)或者在mmap调用中修改地址空间(写锁)的时候,必须持有一个读写锁。在系统负载很高,产生资源瓶颈或者出现硬件失败时,我们碰到了瞬态的超时。最后,我们发现当磁盘读写线程处理前面映射的数据时,这个锁阻塞了网络线程将新的数据映射到内存。由于我们的工作瓶颈主要是在网络带宽而不是内存带宽,因此我们通过使用pread()加上额外的开销替代mmap()绕过了这个问题。 尽管出现了一些问题,linux代码的可用性帮助了我们探索和理解系统的行为。在适当的时机,我们也会改进内核并与开源社区共享这些变化。</p><h2 id="8-Related-work"><a href="#8-Related-work" class="headerlink" title="8.Related work"></a>8.Related work</h2><p>像其他的大型分布式文件系统比如AFS,GFS提供了一个本地的独立名字空间,使得数据可以为了容错或者负载平衡而透明的移动。但与AFS不同的是,为了提升整体的性能和容错能力,GFS将文件数据在多个存储服务器上存储,这点更类似于xFS或者Swift。 硬盘是相对便宜的,而且与复杂的RAID策略相比,副本策略更简单。由于GFS完全采用副本策略进行冗余因此它会比xFS或者Swift消耗更多的原始存储。 与AFS,xFS,Frangipani,Intermezzo这些系统相比,GFS在文件系统接口下并不提供任何缓存。我们的目标工作负载类型对于通常的单应用程序运行模式来说,基本上是不可重用的,因为这种模式通常需要读取大量数据集合或者在里面进行随机的seek,而每次只读少量的数据。 一些分布式文件系统比如xFS,Frangipani,Minnesota’s GFS和GPFS删除了中央服务节点,依赖于分布式的算法进行一致性和管理。我们选择中央化测量是为了简化设计增加可靠性,获取灵活性。尤其是,一个中央化的master更容易实现复杂的chunk放置和备份策略,因为master具有大部分的相关信息以及控制了它们的改变。我们通过让master状态很小以及在其他机器上进行备份来解决容错。当前通过影子master机制提供可扩展性和可用性。对于master状态的更新,通过append到write-ahead 日志里进行持久化。因此我们可以通过类似于Harp里的主copy模式来提供一个比我们当前模式具有更强一致性的高可用性。 我们未来将解决类似于Lustre的一个问题:大量客户端的整体性能。然而我们通过专注于我们自己的需求而不是构建一个POSIX兼容文件系统来简化了这个问题。另外,GFS加速不可靠组件的数量是很大的,因此容错是我们设计的中心。 GFS很类似于NASD架构。但是NASD是基于网络连接的硬盘驱动器,GFS则使用普通机器作为chunkserver。与NASD不同,chunkserver在需要时分配固定大小的chunk,而没有使用变长对象。此外,GFS还实现了诸如重平衡,副本,产品环境需要的快速恢复。 不像Minnesota’s GFS和NASD,我们并没有寻求改变存储设备的模型。我们更专注于解决使用现有商品化组件组成的复杂分布式系统的日常的数据处理需求。 通过在生产者消费者队列中使用原子record append操作解决了与分布式操作系统River的类似问题。River使用基于内存的跨机器分布式队列以及小心的数据流控制来解决这个问题,而GFS只使用了一个可以被很多生产者append数据的文件。River模型支持mton的分布式队列,但是缺乏容错,GFS目前只支持m to 1。多个消费者可以读取相同文件,但是它们必须协调好对输入负载进行划分(各自处理不相交的一部分)。</p><h2 id="9-Conclusions"><a href="#9-Conclusions" class="headerlink" title="9.Conclusions"></a>9.Conclusions</h2><p>GFS包含了那些在商品化硬件上支持大规模数据处理的必要特征。尽管某些设计决定与我们特殊的应用类型相关,但是可以应用在具有类似需求和特征的数据处理任务中。 针对我们当前的应用负载类型,我们重新审视传统的文件系统的一些假设。我们的审视,使得我们的设计中产生了一些与之根本不同的观点。我们将组件失败看做常态而不是异常,为经常进行的在大文件上的append进行优化,然后是读(通常是顺序的),为了改进整个系统我们扩展并且放松了标准文件系统接口。 我们的系统通过监控,备份关键数据,快速和自动恢复来提供容错。Chunk备份使得我们可以容忍chunkserver的失败。这些经常性的失败,驱动了一个优雅的在线修复机制的产生,它周期性地透明的进行修复尽快的恢复那些丢失的副本。另外,我们通过使用校验和来检测数据损坏,当系统中硬盘数目很大的时候,这种损坏变得很正常。 我们的设计实现了对于很多执行大量任务的并发读者和写者的高吞吐率。通过从数据传输中分离文件系统控制,我们来实现这个目标,让master来处理文件系统控制,数据传输则直接在chunkserver和客户端之间进行。通过增大chunk的大小以及chunk的租约机制,降低了master在普通操作中的参与。这使中央的master不会成为瓶颈。我们相信在当前网络协议栈上的改进将会提供客户端写出速率的限制。 GFS成功地满足了我们的存储需求,同时除了作为产品数据处理平台外,还作为研发的存储平台而被广泛使用。它是一个使我们可以持续创新以及面对整个web的海量数据挑战的重要工具 。</p>]]></content>
<summary type="html"><p><strong>GFS 中文翻译</strong></p>
<h2 id="ABSTRACT"><a href="#ABSTRACT" class="headerlink" title="ABSTRACT"></a>ABSTRACT</h2><p>我们已经设计和实现了Goo</summary>
</entry>
</feed>