-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy paththesis.tex
2154 lines (1798 loc) · 185 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
\usepackage{textcomp}
\usepackage{tikz}
\usepackage{amsmath}
\usepackage{pdfpages}
\usepackage[hyphens]{url}
\usepackage{hyperref}
\usepackage[utf8]{inputenc}
\usepackage{relsize}
\usepackage{ifthen}
\usepackage{xstring}
\usepackage{cite}
\usepackage{placeins}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{bold-extra}
\usepackage{amssymb}
\usepackage{booktabs}
\usepackage{rotating}
\usepackage{pdflscape}
\usepackage[inline]{enumitem}
\usepackage{mdframed}
\usepackage{mathtools}
\usepackage{bm}
\usepackage{newfloat}
\usepackage{xspace}
\include{dcr-preamble}
\include{msg-preamble}
\newcommand\cpp{C\texttt{++}\xspace}
\DeclareFloatingEnvironment[
fileext=los,
listname=List of code snippets,
name=Listing,
placement=tbhp,
within=none,
]{snippet}
\lstset{
captionpos=b
}
\definecolor{OliveGreen}{HTML}{3C8031}
\definecolor{Papyrus}{HTML}{FFFAE6}
\lstdefinestyle{pseudo}{
mathescape=true,
escapeinside={<|}{|>},
morekeywords={function, send, to, receive, from, broadcast, timeout, match, with, when, and, or, if, else},
morecomment=[l]{//},
keepspaces=true,
basicstyle=\small\ttfamily,
commentstyle=\color{OliveGreen},
keywordstyle=\color{blue}\bfseries,
numberstyle=\small\ttfamily,
numbers=left,
xleftmargin=-.1cm
}
% - forklar låsning af egen event og hvornår
% - at vi ikke låser på conditions
% - manglende states i netværks tilstandsgrafen
% - adversary model mangler udbygning (meget stærkere)
% - løsnings table
% - konklusion
\begin{document}
\sloppy
\begin{titlepage}
% \includepdf{itu-frontpage.pdf}
% \pagebreak
\centering {\huge Trusted DCR: Decentralised workflow management in a byzantine setting}
\vspace{1.5cm}
{\large Mikkel Gaub \\ Malthe Ettrup Kirkbro \\ Mads Frederik Madsen}
\vspace{0.5cm}
\today
\pagenumbering{gobble}
\vspace{\fill}
{\large Supervisor: Søren Debois}
\vspace{0.5cm}
\begin{abstract}
In this thesis we present a solution to the problem of decentralized distributed workflow execution.
The presented solution utilises partial state replication of dynamic condition response (DCR) graphs to achieve execution of events with complexity less than the number of peers in the system.
We also explore the security and optimisation options provided by recent advances in trusted execution environments (TEEs), specifically Intel Secure Guard Extensions (SGX), in order to achieve byzantine fault tolerance in the context of this problem.
The design and implementation of this system contains several new contributions: a general transformation of crash fault tolerant distributed protocols to byzantine fault tolerant protocols using SGX, an SGX implementation of the Raft consensus algorithm, an efficient method of collecting the state of a DCR graph called \textit{CheapShot}, and an analysis of the synchronisation of executions in DCR graphs using a minimal locking scheme.
Lastly we describe a generalisation of the implemented DCR graph system to a structure supporting arbitrary smart contracts.
\end{abstract}
\end{titlepage}
\clearpage
\pagenumbering{arabic}
\tableofcontents
\newpage
\section{Introduction}
Collaboration between companies, small and large, occurs daily across vast physical distances.
As part of this collaboration a trusted means of coordination is key.
One way to coordinate in such a setting is by using Dynamic Condition Response (DCR) graphs~\cite{hildebrandt_declarative_2011}, which are workflows declaratively formulated as directed graphs.
DCR graphs formulate processes in terms of events and relations between those events.
Relations specify the order in which events can transpire, and what paths can be taken in a workflow.
DCR graphs provide a concise and simple way of describing workflows, as the effects of the execution of an event is declaratively defined by its relations.
This means that an event execution cannot have cascading effects.
This property of limited effects on execution means that DCR graphs allow for highly localized states and, depending on how a specific DCR graph is constructed, it can therefore allow for a very high degree of concurrency in a distributed setting.
Central servers would be an easy way to enable a distribution of a workflow execution, but would introduce a single point of failure as well as require trust and most likely payment, due to a third party being involved in managing the process.
The alternative to a solution using a central server is a decentralized solution, in which participants act as peers in a peer-to-peer network.
Decentralisation brings with it a completely different set of challenges, all springing from the core necessity of distributed collaboration requiring agreement between all parties on what has transpired in the workflow.
This set of challenges is the problem we will present and solve during this thesis.
The issue of making DCR graphs distributed and maintaining a consistent state across peers is closely related to that of distributed replications of state machines, especially when peers can exhibit faulty behaviour.
Existing solutions to the distributed state machine replication (SMR) problem provide a low degree of concurrency, however, and require a significant number of messages, which means that applying such a solution to the problem of distributed DCR graphs would impede the conccurency and efficiency of DCR.
One way to improve on the SMR solution in this instance could be to distribute the graph among the peers, thus allowing for synchronisation only between the parts of the state which affect each other.
This is refereed to as the \textit{partial state machine replication} (PSMR) problem.
Solutions to the PSMR problem are widely sought, as there are a number of applications of SMR solutions where the scalability of state is problematic.
An example of this is the blockchain based distributed smart contract engine, Ethereum~\cite{_ethereum_2018}.
We will therefore analyse the effects of applying our solution to the PSMR problem to an algorithm for running distributed smart contracts.
Recent technological advances in the field of \textit{Trusted Execution Environments} (TEEs), specifically \textit{Intel Software Guard Extensions} (SGX)~\cite{costan_intel_2016}, have already made waves within state replication solutions~\cite{kapitza_cheapbft_2012,veronese_efficient_2013,liu_scalable_2016} and could potentially be leveraged to provide a solution to the described problem, efficiently in terms of required messages and a high degree of concurrency.
Intel SGX provides a TEE in which arbitrary code can be run and the correct execution can be attested by Intel to other users also running Intel SGX.
In a network of untrusting peers, trust can therefore be guaranteed by Intel, which is a very strong base to build a consensus algorithm on.
\noindent The contributions of this thesis are as follows:
\begin{itemize}
\item An implementation of the Raft consensus algorithm inside of a TEE.
\item A thorough design of a DCR engine, with a potentially totally distributed graph state\footnote{By event.}, supporting concurrent executions only limited by the semantics of DCR.
Furthermore, the design accomplishes consensus on executions in less than $O(n)$ messages, where $n$ is the number of peers in the network.
\item A proof-of-concept implementation of the TEE Raft algorithm inside of the designed DCR engine
\item A transformation of any crash-stop tolerant distributed protocol to a byzantine fault tolerant one, using Intel SGX.
\item A description of a generalised version of the DCR engine design supporting smart contracts.
\end{itemize}
\subsection{Problem}
We formulate the problem of byzantine fault tolerant PSMR with a DCR graph as the \textit{state}, and allow for peers to collect the state and change the state through the execution of DCR events.
In order to achieve fault tolerance, the system must distribute the DCR graphs on several peers, so that if one peer fails, other peers can take over running the DCR graph.
The trivial method of ensuring fault tolerance, is to make all peers responsible for the entire state of the system, but means that changes in the graph would then have to be synchronized with all peers.
For a solution to this problem to be scalable in large networks, a message complexity below $O(n)$ is desirable, where $n$ is the number of peers.
Alternatively, the DCR graph would have to be split into parts distributed among peers, but this introduces further issues, as a partial state solution would need intermittent synchronisation, when several actions might affect the same parts of the state.
As DCR graphs are primarily used in industry where the documentation of workflows is important, it is essential that any user is able to collect the entire state of a workflow at any given time.
Assuming a partial state solution, collecting this state is further complicated in that a coherent state would have to be pieced together from each part of the graph.
In summary, the system should:
\begin{description}
\item[Concurrent] Allow for a high degree of concurrency as permitted by DCR graphs, shown in~\cite{debois_concurrency_2015}.
\item[Scalable] Achieve consensus on a state change in less than $O(n)$ messages, where $n$ is the number of peers participating in the graph, while supporting the distribution of the state of the DCR graph among those peers.
\item[Byzantine fault tolerant] Function in the presence of malicious workflow participants and network nodes, where the only guarantees are those based on strong cryptographic security.
\end{description}
\section{Background}
In order to attain a complete understanding of the problem at hand, three subject areas need further explanation:
\begin{description}
\item[DCR] which must be defined and the requirements of which must be explored, as it forms the basic limitations of a potential solution.
\item[Consensus] which defines the fundamental challenge of any decentralized system, in which agreement on state is needed.
\item[Intel SGX] which has recently shown promise in improving the fault tolerance of consensus solutions, notably in~\cite{kapitza_cheapbft_2012,veronese_efficient_2013,liu_scalable_2016}.
\end{description}
\subsection{DCR}
\label{subsec:dcr}
A DCR graph $G$, is a graph representation of a workflow, made up of $v$ stateful nodes, $E=\{e_1, e_2, \dots, e_v\}, v \geq 1$, called \textit{events}, with a number of directed edges called \textit{relations} between them.
The cumulated state of all the events in a graph $G$ is called the state of $G$.
Depending on the relations, an event $e$, in a graph $G$, can be \textit{executed}, which will change the state of events in $G$ according to the rules defined by $e$'s outgoing relations.
We denote this execution $E(G,e)=G'$.
\subsubsection*{Event state}
The events in a DCR graph have three binary attributes: \textit{executed}, \textit{included} and \textit{pending}.
This is an event: \ev{A}
\begin{description}
\item[Executed attribute] If the \textit{executed} attribute of an event is false, executing the event will set its \textit{executed} attribute to true.
Executing an already executed event will have no effect on the \textit{executed} attribute.
The \textit{executed} attribute is shown as a tick mark: \ev[101]{A}
\item[Included attribute] If the \textit{included} attribute is true, the event is said to be \textit{included}.
If the \textit{included} attribute of an event is false, the event is said to be \textit{excluded} and it cannot be executed.
The \textit{included} attribute is denoted by the outline of the event: \ev{A} when included and \ev[000]{A} when excluded.
\item[Pending attribute] If any event in a workflow has a \textit{pending} attribute which is true and that event is included, we say that the workflow is in an unfinished state.
Every time an event is executed, its \textit{pending} attribute is set to false.
This means that setting the \textit{pending} attribute of an included event to true is specifying that this event must be executed or excluded at some point for the workflow to be in a finished state.
The \textit{pending} attribute is shown as an exclamation point: \ev[011]{A}
\end{description}
\subsubsection*{Relations}
Relations are directed edges between events in the workflow.
Each relation is directed from an origin event, toward a target event.
There are five types of relations which each specify different behaviour between events in a workflow.
The origin and the target of a relation can be the same event.
The relations can be divided into two categories:
\begin{description}
\item[Effects] are relations that change the state of the targeted event when the originating event is executed.
If an effect originating from an event \texttt{A} is targeting an event \texttt{B}, then \texttt{A} is said to be the effecting event on the effected event \texttt{B}.
Effects are the \emph{response}, \emph{include} and \emph{exclude} relations.
\begin{description}
\item[Response relation] If there is a response relation from event \texttt{A} to event \texttt{B}, then the \textit{pending} attribute of \texttt{B} will be set to true every time \texttt{A} is executed.
The response relation is represented as: $\ev{A} \rrel \ev{B}$
\item[Include relation] If there is an include relation from event \texttt{A} to event \texttt{B}, then the \textit{included} attribute of \texttt{B} will be set to true every time \texttt{A} is executed.
The include relation is represented as: $\ev{A} \irel \ev{B}$
\item[Exclude relation] If there is an exclude relation from event \texttt{A} to event \texttt{B}, then the \textit{included} attribute of \texttt{B} will be set to false every time \texttt{A} is executed.
The exclude relation is represented as: $\ev{A} \erel \ev{B}$
\end{description}
\item[Constraints] are relations that exclusively affect whether or not the targeted event can be executed.
If a constraint originating from an event \texttt{A}, is targeting an event \texttt{B}, then \texttt{A} is said to be the constraining event on the constrained event \texttt{B}.
Constraints are the \emph{condition} and \emph{milestone} relations.
\begin{description}
\item[Condition relation] If there is a condition relation from event \texttt{A} to event \texttt{B}, then \texttt{B} can only be executed if the \textit{executed} attribute of \texttt{A} is true or the \textit{included} attribute of \texttt{A} is false.
The condition relation is represented as: $\ev{A} \crel \ev{B}$
\item[Milestone relation] If there is a milestone relation from event \texttt{A} to event \texttt{B}, then \texttt{B} can only be executed if the \textit{pending} attribute of \texttt{A} is false or the \textit{included} attribute of \texttt{A} is false.
The milestone relation is represented as: $\ev{A} \mrel \ev{B}$
\end{description}
\end{description}
\subsubsection*{Enabledness}
We use the notion of \textit{enabledness} to describe whether an event is executable in the state of a graph or not.
An event is \textit{enabled} if it can be executed, and \textit{disabled} if it cannot.
An event's enabledness is strictly controlled by the state of its constraining events, and its included variable:
\begin{itemize}
\item If an event is excluded it is disabled.
\item If an event is the target of a milestone relation from an event that is included and pending, it is disabled.
\item If an event is the target of a conditions relation from an event that is included and not executed, it is disabled.
\end{itemize}
In all other cases, the event is enabled.
\subsubsection*{Execution rights}
For DCR graphs to have a meaningful application, each event is typically assigned a number of actors who are allowed to execute that event.
Rights to execute can also be assigned based on role.
\subsubsection*{Formalising \texorpdfstring{$G$}{}}
After the previous definitions, we can now formalise the concept of a DCR graph $G$:
$G = (E,R)$, where
\begin{itemize}
\item $E$ is the set of all events in the workflow,
\item $R = \{r_1, r_2, \dots\}$ is the set of relations in the workflow, where each relation, $r$, has the form $r=(e_{from}, e_{to}, Type)$ and $e_{from}$ is the event $r$ is originating from, $e_{to}$ is the event $r$ is targeting, and $Type$ is one of the relation types.
\end{itemize}
If, in a given state of $G$, an event $e'$ changes state or enabledness when event $e$ is executed, we say that $e'$ is in $e$'s \textit{effecting event set}, denoted $E_{aff}(G,e)$.
\subsubsection{Concurrency}
\label{subsubsec:concurrency}
The notion of concurrency in DCR graphs, described in~\cite{debois_concurrency_2015}, is the property a pair of events have when the graph allows those events to be executed in either order with the same resulting graph state.
More formally, given a graph $G$ and two events $e_1$ and $e_2$ in $G$, we say that $e_1$ and $e_2$ are concurrent if $E(E(G, e_1),e_2)=E(E(G, e_2),e_1)$.
We define a \textit{run} as a series of consecutive executions.
For a run to be valid, each execution must be enabled at the point of execution in the sequence.
We say that two runs $R_1$ and $R_2$, containing the same set of executions, are \textit{equivalent up to concurrency} on a graph $G$ when the results of performing $R_1$ on $G$ leads to $G'$ and performing $R_2$ on $G$ also leads to $G'$.
If there exist a run $R$ which applied to a graph state $G$ results in graph state $G'$, we say that $G'$ is \textit{reachable} from $G$ via $R$.
We identify three concurrency relationships between events, and name these relationships \textit{independence relationships}, and say that the events in an independence relationship are \textit{independent}:
\begin{description}
\item[Dynamic independence] Two events are \textit{dynamically independent} if they are both enabled and concurrent in the current state of the graph.
That is, $e_1$ and $e_2$ are dynamically independent if $E(E(G, e_1),e_2)=E(E(G, e_2),e_1)$ for the current $G$.
Dynamic independence is essentially what we have defined as concurrency.
We say that an event $e$ has a \textit{dynamic independence set} $I_{dyn}(G,e)$, such that $e' \in I_{dyn}(G,e)$ iff. $e$ and $e'$ are dynamically independent in graph state $G$.
Likewise an event $e$ has a \textit{dynamic dependence set} $D_{dyn}(G,e) = E \setminus I_{dyn}(G,e)$.
\item[Static independence] Two events are \textit{statically independent} if they are dynamically independent in every \textit{reachable} state of the graph from the initial graph state.
That is, $e_1$ and $e_2$ are statically independent in $G$, if they are dynamically independent in all states of $G$ that can be reached by a run.
We say that an event $e$ has a \textit{static independence set} $I_{stat}(e)$, such that $e' \in I_{stat}(e)$ iff. $e$ and $e'$ are statically independent.
Likewise an event $e$ has a \textit{static dependence set} $D_{stat}(e) = E \setminus I_{stat}(e)$.
Notice that $e$'s static dependence set is the set of events that are dependent with $e$ in \textit{some} state of $G$, not all.
The problem of finding all static independence sets is NP-hard in~\cite{debois_replication_2017-1} and an approximation will have to be used instead.
\end{description}
The approximation we use of static event independence is from~\cite{debois_concurrency_2015}, and boils down to the following principles, where we say that $e_1$ and $e_2$ are approximately statically dependent if, in any event-state permutation of $G$:
\begin{itemize}
\item $e_1$ can enable $e_2$, or vice versa.
\item $e_1$ can disable $e_2$, or vice versa.
\item $e_1$ can disable some $e$, if $e_2$ can enable it, or vice versa.
\item $e_1$ can exclude some $e$, if $e_2$ can include it, or vice versa.
\item $e_1$ can set $e_2$ to pending, unless $e_2$ sets itself to pending, or vice versa.
\end{itemize}
This approximation is complete, but not sound, meaning it contains all statically dependent event pairs, but does not guarantee that an event pair in the approximation is not statically independent.
We say that an event $e$ has an \textit{approximate static dependence set} $D_{appr}(e)$, such that $e' \in D_{appr}(e)$ iff. $e$ and $e'$ are approximately statically dependent, according to the above approximation.
Likewise an event $e$ has an \textit{approximate static independence set} $I_{appr}(e) = E \setminus D_{appr}(e)$.
Notice that the an approximate static independence set is sound but incomplete.
All the independence- and dependence sets are symmetric: $e' \in S(e) \iff e \in S(e')$, where $S$ is any of the described (in-)dependency sets.
Notice that in any DCR graph it must hold that $\forall G. \forall e\in G. |I_{dyn}(G,e)| \geq |I_{stat}(e)| \geq |I_{appr}(e)|$, which speaks to the desirability of the different concurrency levels; a system that supports concurrent execution of dynamically independent events allows for more concurrent executions than a one that supports execution of statically independent events, which in turn allows for more concurrent executions than one that supports execution of approximate statically independent events.
Figure~\ref{fig:concurrency-all} shows examples of independence in DCR graphs.
\begin{figure}[ht]
\center
\includegraphics[scale=0.6]{figures/dcr-graphs/concurrency-all.pdf}
\caption{From top to bottom:
(1) An example of a DCR graph where \ev{A} and \ev{B} are not concurrent.
(2) An example of a DCR graph where \ev{A} and \ev{B} are statically independent.
(3) An example of a DCR graph where \ev{A} and \ev{B} are dynamically, but not statically independent (Note that this is only applicable if from a starting state where \ev{A} is not executed).
(4) An example of a DCR graph where \ev{A} and \ev{B} are statically independent, but the approximation will contain them as a dependent pair\label{fig:concurrency-all}.}
\end{figure}
\subsection{Consensus}
\label{subsec:consensus}
Consensus is the problem of achieving agreement between the processes in a distributed system.
It is central to the problem of decentralised DCR, as agreement on the states of events must be achieved to ensure that all correct peers adhere to the semantics of the DCR graph.
Consider a simple graph: $\ev{A} \crel \ev{B}$, and let a subset of peers, $ps_1$, store the state of $\ev{A}$, and a subset of peers, $ps_2$, store the state of $\ev{B}$.
If agreement on the state of $\ev{A}$ is not guaranteed, then $ps_2$ might, in the belief that $\ev{A}$ is executed, execute $\ev{B}$ even though $ps_1$ has never executed $\ev{A}$.
A consensus protocol is formally a protocol with $n$ processes ($p_1, p_2, \dots, p_{n}$), where each process $p_i$ begins in an undecided state, and proposes a single value $v_i$ from a set of values $D$. The processes then communicate their values to each other, and each process sets a decision value $d_i$ and enters a decided state.
For the protocol to solve the consensus problem, the following requirements should hold for every execution:
\begin{description}
\item[Termination] every process eventually sets its decision value.
\item[Agreement] the decision value of all correct processes are the same.
\item[Validity] if a correct process decides $d$, then some correct process has proposed a value $v$, where $v = d$.
\end{description}
We show that distributed DCR reduces to consensus, under the assumptions that the correct graph has been distributed to all peers, and that the initial state is agreed upon:
\begin{itemize}
\item Let $D$ be the set of enabled events.
\item Let $v_i$ uniquely identify an event.
\item A correct process will only propose $v_i$ if the event identified by $v_i$ is enabled.
\item We can now use a consensus sub-process to decide on a $v$, which will then be executed.
\item Each process recalculates $D$ to reflect the new global state, and repeats the protocol until the workflow has been completed.
\end{itemize}
In the famous FLP impossibility result from 1985~\cite{fischer_impossibility_1985}, it was shown that no consensus protocol could have termination, agreement and validity.
The proof essentially boils down to showing that that non-termination is possible in all consensus protocol systems with asynchronous communication channels and at least one faulty process.
In order to circumvent this impossibility, the guarantees of a fault-tolerant system that utilizes a consensus protocol must be relaxed, for instance by guaranteeing termination only when no faults are present, as is the case in the Paxos protocol~\cite{lamport_part-time_1998}.
\subsubsection{State machine replication}
One problem that relates directly to consensus, is \textit{State Machine Replication}~\cite{schneider_implementing_1990} (SMR).
SMR is the problem of replicating the state of a system on several nodes, called \textit{replicas}.
On each replica, an instance of the same state machine is running, and must take requests and pass them on to the state machine.
In SMR systems, \textit{clients} send requests to the replicas who must then guarantee:
\begin{description}
\item[Safety] all non-faulty replicas execute the requests in the same order.
\item[Liveness] clients eventually receive replies to their requests.
\end{description}
The state machine replication problem is equivalent to the consensus problem and provides the same guarantees~\cite{schneider_implementing_1990}, since a solution to SMR is consensus on the state of a request log.
This is evident in the fact that the safety requirement is essentially an aggregation of the agreement and validity requirements in consensus, while the liveness property is equivalent to termination.
As such, FLP impossibility entails that one of the requirements must be relaxed in fault tolerant SMR systems.
This is often done by only guaranteeing liveness when no failures are present~\cite{chandra_unreliable_1996,lamport_part-time_1998,castro_practical_1999,kotla_zyzzyva_2007}.
A version of SMR deals with byzantine faults, rather than crash failures.
These systems are knowns as \textit{byzantine fault tolerant} (BFT) systems~\cite{castro_practical_1999,correia_byzantine_2011,veronese_efficient_2013,liu_scalable_2016}.
These systems deal with arbitrary faults, meaning arbitrary behaviour of faulty processes and channels, including modified messages, new messages and changes in local state.
It has been shown that $2f+1$\footnote{Reads as $n=2f+1$, where $n$ is the number of processes, and $f$ is the number of recoverable faults.} processes are necessary and sufficient to solve SMR with relaxed liveness if faulty processes only exhibit crashes~\cite{bracha_asynchronous_1985}.
Similarly it has been shown that $3f+1$ processes are necessary and sufficient to solve SMR with relaxed liveness if faulty processes can exhibit byzantine faults~\cite{bracha_asynchronous_1985,pease_reaching_1980}.
However, recent advances have been made in BFT-algorithms, which increase fault tolerance from $3f+1$ to $2f+1$ by utilizing TEE features~\cite{liu_scalable_2016,kapitza_cheapbft_2012,veronese_efficient_2013}.
SMR brings with it the taxing job of pushing each update to all nodes in the network, which leads to obvious scalability issues in large networks.
One solution to that problem is delegating the responsibility of segments of data to each node in the network.
This is called \textit{Partial State Machine Replication} (PSMR)~\cite{sousa_partial_2001}.
Since the state in PSMR is divided into parts, it is possible for multiple leaders to be active concurrently, as opposed to traditional SMR implementations, that have a single leader at any given time.
Having multiple leaders means that the availability of the system is much greater, as a single node does not have to process all requests.
High concurrency and performance also follows from the partial states reducing the number of nodes which have to be notified of a state change leading to both fewer messages and less chance of two state changes overlapping, requiring ordering.
Relatively few solutions exists for the PSMR problem and the ones that do require a large amount of messages~\cite{sousa_partial_2001}.
To the best of our knowledge, no solution to the PSMR problem using a trusted execution environment exists.
\subsection{Intel Software Guard Extensions}
\label{subsec:intel-sgx}
Intel SGX is a TEE technology which includes a set of CPU instructions, that allow user-level code to run in protected areas of memory, so called \textit{enclaves}.
Only processes running in enclaves are allowed to use SGX instructions.
Intel guarantees that any code run within an enclave, and any data loaded into an enclave, is protected from access by any process running outside of that enclave~\cite{intel_sgx}.
More specifically, the guarantees encompass \textit{confidentiality}, as only the process running in the enclave can read the contents of the enclave, and \textit{integrity}, as only the process running in the enclave can modify the contents of the enclave.
These guarantees of the enclave comes at the expense of direct access to system calls, and by extension drives, network cards and all hardware components other than the processor it runs on.
In order to access any of these, an enclave therefore needs a so-called \textit{wrapper component} which is a non-enclave block of memory that the process can make calls to, and receive calls from.
This is all made possible by a set of unique keys generated during manufacturing and permanently stored inside the so-called \textit{fuse array} of the processor | an array of fuses that can be programmed once and then protected from further manipulation~\cite{robson_electrically_2007}.
Using these keys, an enclave process is able to identify if another process is an enclave process, using the process of \textit{local attestation}.
Local attestation is intuitively possible as two enclaves running on the same processor both have access to the same secret key, provisioned by Intel.
This process can be extended to identify if a process running on another processor is an enclave process, by contacting Intel's servers for verification of the other processor's key, called \textit{remote attestation}.
Remote attestation can be performed with all other SGX enabled processes, as Intel has already provisioned keys to these.
In short, the major innovation in Intel SGX is the option of running hardware secured software, which enables tamper-proof messages, where the sender can be verified as being a correct process.
As an example of SGX's usefulness in relation to consensus algorithms, SGX has been utilized to implement a new Byzantine fault tolerance (BFT) consensus algorithm, called \textit{FastBFT}~\cite{liu_scalable_2016}.
This is accomplished by using the \textit{strawman design} where a request is sent to a node, the \textit{primary}, who prepares a vote by distributing parts of a secret to all nodes.
The secret can be reconstructed given enough parts and then compared to the hash of the secret which is common knowledge.
The reason that Intel SGX is needed for this algorithm is that the primary, who distributes the secrets, can fake being any node as it has all parts of the secret.
A Trusted Execution Environment (TEE) such as Intel SGX, means that these secrets can be computed and distributed without the primary ever having access to them.
A further issue with the strawman design, is that the primary can change the orders of requests, thereby equivocating which request is being voted on.
This is once again solved by Intel SGX, by numbering requests with a trusted counter that can only be incremented.
This application of Intel SGX means that the byzantine faults in the algorithm, $f$, can now be tolerated up to $f = \lfloor\frac{n-1}{2}\rfloor$, given that the SGX component does not exhibit byzantine faults, but at most crash faults.
\subsubsection{Enclave}
\label{subsec:enclave}
An enclave is a software module which, as its name suggests, is isolated completely from the rest of the system.
The strong guarantees of enclaves, confidentiality and integrity, build on the following two hardware details:
\begin{description}
\item [Processor Reserved Memory (PRM)] is a sequential block of memory reserved for SGX, only accessible through enclave CPU instructions.
\item [Enclave mode] is a mode under which a process gains access to the PRM.
\end{description}
An enclave's memory is located solely in PRM, preventing access by other processes and enclaves running on the system.
The PRM is protected from non-enclave processes by enclave mode.
When a process is not in enclave mode and tries to access the PRM, the memory access is denied by the processor~\cite{costan_intel_2016}.
The PRM of an enclave process is protected from accesses by other processes in enclave mode by the SGX Enclave Control Structure (SECS).
The SECS holds meta-data about the enclave processes, and among this is a virtual-to-physical memory mapping called the Enclave Page Cache (EPC).
Through the EPC an enclave's access to physical PRM is restricted to what has been allocated for the enclave-process~\cite{costan_intel_2016}.
All enclave data is encrypted while stored in memory, and is decrypted when loaded out of memory, and as such prevents anyone from reading the enclave data by bypassing the processor.
While SGX provides powerful guarantees through its enclave concept, it does not guarantee software correctness and will not protect against flawed software.
Instead SGX encourages developers to isolate a minimal piece of their software, the Trusted Computing Base (TCB), in a trusted enclave environment, and keep the remainder outside the enclave~\cite{intel_sgx_guide}.
By minimizing the size of the TCB, and thus the amount of code one must trust, common security principles dictate that the chance of security flaws decreases~\cite{intel_sgx_guide}.
In order for an enclave to be useful, communication with untrusted code is typically required.
Calls to trusted enclave code from untrusted code, or other enclaves, is enabled through \textit{enclave calls} (ECALLs) and calls to untrusted code from enclave code is enabled through \textit{out calls} (OCALLs).
This interface must be defined at compile-time, specifying an API of ECALLs for the enclave as well as any untrusted services needed as OCALLs, in an \textit{EDL}-file~\cite{intel_sgx_guide}.
When built, an enclave module is a plain binary on the untrusted file system.
As an enclave under such circumstances would be vulnerable to tampering before initialisation, SGX enforces a strict signature policy.
An enclave must include an \textit{Enclave Signature} containing \textbf{(1)} a hash of the code and initialisation data of the enclave, \textbf{(2)} the author's public key and \textbf{(3)} an enclave version/product number.
During enclave initialisation a hardware check is performed, ensuring that the Enclave Signature matches the enclave binary loaded from the file system.
This signature identifies the enclave during the attestation process as well, so if an adversary was to tamper with an existing enclave binary, the contents would no longer match, and would not be allowed to be loaded by SGX.
If the adversary was to create their own enclave, replacing the existing, the signature would no longer match and the replacement would be rejected during the attestation process with other, correct, enclaves.
Another powerful tool of an SGX enclave is the ability to read from, and write data to, an untrusted storage medium while ensuring confidentiality of its contents.
Such capabilities are needed as the PRM is volatile and will not persist after shutdown.
In SGX this process is known as \textit{sealing}.
Sealing allows encryption and decryption using a \textit{Seal Key}, which is confidential to the sealing enclave.
However, these mechanics alone do not guarantee integrity, as integrity is not only violated by other processes.
For instance, the infamous Single Upset Event (also knowns as cosmic ray bit flips)~\cite{normand_single_1996}, is an integrity violation where bits in memory are flipped due to background radiation.
Intel SGX protects against such hardware errors by using cryptographic integrity checks~\cite{gueron_memory_2016}.
When loading enclave memory, Intel SGX's Memory Encryption Engine (MEE) decrypts the memory, and then performs an integrity check on the loaded memory.
If the integrity check does not pass, the processor drops the memory, and eventually locks itself for further operations (the \textit{drop-and-lock policy}), requiring a physical reset before new operations can be completed~\cite{jang_sgx-bomb_2017}.
The integrity checks are implemented as a Merkle-Tree of Message Authentication Codes (MACs), with the root node of the tree stored in processor-internal memory~\cite{jang_sgx-bomb_2017}.
Under the assumption that AES128 is a random permutation, this method provides the guarantee that an active adversary with the capability of observing the MACs has a negligible probability of forgery~\cite{gueron_memory_2016}.
A practical example of the effectiveness of this integrity check is the \textit{SGX bomb} attack~\cite{jang_sgx-bomb_2017}.
The SGX bomb, is a denial-of-service attack on Intel SGX.
It utilises a Row hammer attack to flip arbitrary bits in the EPC, thus triggering the drop-and-lock policy whenever memory is loaded.
\subsubsection{Attestation}
\label{subsec:attestation}
Attestation, within the Intel SGX platform, is the action of verifying that some other process is running inside a specific enclave.
The applications of this are two-fold, in the case where there are multiple enclaves running locally on the same CPU and in the case where enclaves need to communicate to enclaves running on external CPUs.
These two situations are handled by two different processes, aptly named local and remote attestation.
Local attestation builds on a secret fused into the CPU.
An enclave can use an SGX instruction to generate a \textit{report} uniquely identifying the enclave, as it also contain the enclave signature.
This report is MACed with a key derived from the fused secret and enclave identifier.
It is not possible to use the create such a report for another enclave, as the report instruction is only available to software components that ensure that the report corresponds to the caller enclave.
The fused secret is not directly accessible by software components, but can only be used by certain hardware instructions that protects against malicious use~\cite{costan_intel_2016}.
An attestation challenger will ask for a MAC'ed report from the client enclave, and after receiving it derive the same key to verify the MAC.
Because a report can only be created by the enclave it describes, the challenger can be certain of which enclave the client is running if the MAC is valid.
To prevent replays the MAC'ed report is allowed to contain a block of arbitrary data, in which the challenger can require a nonce~\cite{costan_intel_2016}.
Remote attestation builds on the same concept of a report, but as challenger and client are now on different CPUs, they no longer hold the same fused secret.
Instead remote attestation relies on the group signature scheme Enhanced Privacy ID (EPID) and a third party issuer.
Each SGX enabled CPU is granted an EPID Member Private Key by the third party issuer some time after manufacturing.
Like the fused secret, this private key is not directly accessible to an enclave~\cite{costan_intel_2016}.
Under the EPID scheme, all issuer generated private keys share the same public key (EPID Group Public Key).
When remote attestation takes place, the client enclave generates a report and attests locally with the Quoting Enclave (QE).
The QE, now convinced of the client enclave's identity, strips the MAC off the report and instead signs it with the EPID Member Private Key, which the QE has special privileges to access\footnote{Enclaves signed by Intel have special privileges throughout SGX. This allows SGX to implement complicated SGX instructions in software.}.
The remote challenger can verify the signed report with the EPID Group Public Key and be sure that \textbf{(1)} the report is signed by an EPID Member Private Key, \textbf{(2)} EPID Member Private Keys are granted secret to QEs, and \textbf{(3)} QEs will not sign false reports~\cite{costan_intel_2016}.
\section{System model}
We consider a system of $n$ processes $P=\{p_1, p_2, \dots, p_n\}$, that communicate through message-passing channels.
The system is asynchronous, in the sense that we make no assumptions about the execution speed of processes, or delivery time of messages.
We assume a byzantine fault model, in that processes may fail by exhibiting arbitrary behaviour, except that byzantine faults cannot guess or simulate our cryptographic secrets.
Byzantine faults include, but are not limited to, crashes and arbitrary changes of local state.
Processes that never fail are \textit{correct}, and processes that fail are \textit{faulty}.
\section{Transforming byzantine faults to crashes}
\label{sec:transforming-byzantine-faults}
We conjecture that the correct use of Intel SGX can transform all byzantine faults in a distributed system with unreliable channels, such that correct processes in the protocol treat faulty processes as if they were crashed or the messages from them lost.
Thereby a crash-fault tolerant distributed protocol with unreliable channels, which exhibits undocumented behaviour under byzantine-faults (from here on in \textit{crash-resistant protocol}), can be transformed into a byzantine-fault tolerant protocol using Intel SGX and cryptography.
We assume that processes exhibiting byzantine faults will not be able to utilize the cryptographic secrets of a correct process.
\begin{figure}
\center
\includegraphics[scale=0.6]{figures/state-machines/simple-NFA.pdf}
\caption{A simple example of our finite state machine model\label{fig:simple-nfa}}
\end{figure}
\FloatBarrier
To show the transformation, we will model an arbitrary distributed protocol as consisting of \textit{finite state machines}, \textit{channels} and \textit{processes}.
A finite state machine consists of a current state, $s_{current}$, a finite set of $n$ distinct states, $S=\{s_1, s_2, \dots, s_n\}$, and a finite set of $m$ distinct state-transitions, $T$.
A finite state machine can only be in one state at any time, and transitions to another state by accepting a bit-vector as input, and optionally outputs another bit-vector.
A transition has the form $(s_{from}, i, o, s_{to})$, where $i$ is an input consisting of a pair of a bit-vector and the channel the bit-vector was received on, and $o$ is the output consisting of a set of bit-vector/channel pairs, each representing that bit-vector being sent on the appertaining channel.
For example, the finite state machine in Figure~\ref{fig:simple-nfa} will transition from the starting state $s_0$ to $s_2$ on input $"010"$ from channel $c_0$, and will during the transition output $"101"$ to channel $c_1$ and $"100"$ to $c_2$.
A channel is bi-directional and can either be reliable or unreliable.
A reliable channel guarantees that messages sent on it will be delivered in the order the messages were sent, and will not be changed during transmission.
An unreliable channel gives no such guarantees, and can thus experience loss, reordering and corruption of messages.
The last component in our model is the process.
A process is a set of running state machines.
The state machines inside the process can be connected with reliable channels.
State machines can be connected to state machines in other processes only with unreliable channels.
\begin{figure}
\center
\includegraphics[scale=0.6]{figures/state-machines/Distributed-protocol-model.pdf}
\caption{A simple example of distributed protocol model.\label{fig:simple-model}}
\end{figure}
\FloatBarrier
Figure~\ref{fig:simple-model} is a model of a simple distributed protocol, consisting of the processes $P_0$ and $P_1$, each of which have three state machines $SM_0 - SM_5$.
Each state machine is connected to other state machines on the same process with reliable channels $c_0$--$c_3$, and $SM_1$ on $P_0$ is connected to $SM_3$ on $P_1$ by the unreliable channel $c_4$.
A distributed protocol can experience a number of different errors.
Apart from the channel faults already described, we will model crashes and byzantine faults, as these are the ones we are concerned about in this transformation.
We model a crash fault in the state machine by adding an empty input transition ($\varepsilon$) from every state to an error-state, which cannot transition to any state (see $s_{error}$ in Figure~\ref{fig:simple-nfa}).
As the state machine will not be able to transition to another state, and thus cannot output anything from this state, this serves to model a component crashing in a distributed protocol.
We model a byzantine fault by exchanging a process with a new process.
This potentially includes the removal of state machines, the addition of state machines or the exchange of old state machine with new state machines with new current states and new channels (we will refer to this last case as \textit{corrupted} state machines).
This models a process exhibiting arbitrary behaviour.
To transform a crash-resistant distributed protocol into a byzantine-fault-tolerant protocol, we utilize the verification in Remote Attestation (see Section~\ref{subsec:attestation}) and the integrity guarantees provided by Intel SGX.
To model the integrity and confidentiality guarantees, we introduce the notion of a cryptographic secret state machine $SM_{s}$, and an integrity protected area of the process.
We assume that any byzantine fault can replicate $SM_{s}$, but cannot replicate the secrets contained in $SM_s$, under a corruption of $SM_s$.
This models that faulty processes cannot access the cryptographic secrets of a correct process, which is a reasonable assumption by the integrity and confidentiality guarantees given by Intel SGX, if the secrets are stored in an enclave (see Section~\ref{subsec:enclave}).
State machines running in the integrity protected area of the process can be uniquely identified by $SM_{s}$, but cannot have channels to state machines running on other processes.
This models the integrity protection in Intel SGX and abilities of the Quoting Enclave (QE), as well as an enclave's inability to access peripheral hardware such as network cards.
To model the Remote Attestation verification, we introduce a Remote Attestation process, $P_{RA}$ and an Attestation state machine, $P_A$, which will communicate with $P_{RA}$ to verify and provision $SM_s$ with a symmetric secret key.
We will not go into detail with, or model, the Remote Attestation protocol in this section, as the specifics of it are inconsequential with regards to this transformation.
For more information on the Remote Attestation protocol, see Section~\ref{subsec:attestation},~\cite{costan_intel_2016}, and~\cite{intel_sgx_developer_reference}.
We will also assume that each process has a unique identifier $id_p$, that $SM_s$ knows, and that each untransformed state machine includes both this $id_p$ and the $id_p$ of the intended receiver in its outgoing messages.
In the case of broadcast messages, the $id_p$ of the receiving processes can be omitted.
In practise, this could be implemented as a process's external IP-address that is provisioned to $SM_s$ by in a verifiable or trusted manner (for instance by $P_{RA}$ during secret provisioning), or by a sufficiently large number that is randomly selected on process start up.
Both of these solutions require a handshake protocol during channel establishment, in which $SM_s$ on each process must MAC their $id_p$ with the Remote Attestation provisioned secret for verification by the other process.
We will not describe this handshake protocol.
\FloatBarrier
\begin{figure}
\center
\includegraphics[scale=0.40]{figures/state-machines/distributed-protocol-model-transformation0.pdf}
\includegraphics[scale=0.40]{figures/state-machines/distributed-protocol-model-transformation1.pdf}
\includegraphics[scale=0.40]{figures/state-machines/distributed-protocol-model-transformation2.pdf}
\includegraphics[scale=0.40]{figures/state-machines/distributed-protocol-model-transformation3.pdf}
\includegraphics[scale=0.40]{figures/state-machines/distributed-protocol-model-transformation4.pdf}
\caption{Transformation of Figure~\ref{fig:simple-model} from a crash-resistant protocol to a byzantine fault-tolerant protocol.}
\label{fig:tranformation-model}
\end{figure}
\FloatBarrier
The transformation of a crash-resistant protocol to a byzantine-fault-tolerant protocol consists of four steps on the process-level, as shown in Figure~\ref{fig:tranformation-model}.
The steps are as follows:
\begin{enumerate}
\item Add all the state machines to the integrity protected area of the process.
This is equivalent to making the distributed components in the crash-resistant protocol into enclaves.
Please note that it is rarely necessary to guarantee integrity for all components.
Instead the minimal necessary number of components, the TCB, should be identified and protected.
For the general case, however, every component is part of the TCB.
\item For each endpoint state machine in unreliable channels between processes, add a new state machine in the unprotected area of each of the processes (called a wrapper state machine), connect a new unreliable channel between these and add a reliable channel from the old endpoint state machines to the new.
This is equivalent to adding unprotected wrapper components for the enclaves, which will have access to the network, and thus can act as a middle-man for communicating with other processes.
\item Add reliable channels from all state machines in the integrity protected area of the process to the cryptographic secret state machine $SM_s$.
\item Add an Attestation state machine $SM_A$ to each process and a Remote Attestation process $P_{RA}$ to the protocol.
\end{enumerate}
\FloatBarrier
\begin{figure}[ht]
\center
\makebox[\textwidth][c]{\includegraphics[scale=0.6]{figures/state-machines/NFA-transformation.pdf}}
\caption{Transformation of Figure~\ref{fig:simple-nfa} from state machine in a crash-resistant protocol to integrity protected state machine in a byzantine fault-tolerant protocol, if all channels are to other processes.}
\label{fig:nfa-transformation}
\end{figure}
\FloatBarrier
The transformation of the state machines consists of five steps.
An application of this, on the example from Figure~\ref{fig:simple-nfa}, can be seen in Figure~\ref{fig:nfa-transformation}.
The five steps are as follows:
\begin{enumerate}
\item Before a process runs the protocol, a pre-compute step must be added where each process's $SM_s$ is provisioned with the same cryptographic secret.
This is handled by the $SM_s$ and $SM_A$, which will contact $P_{RA}$ and be attested and securely provisioned.
In our example, we have only added a single $P_{RA}$, which of course means that a \textit{k}-resilient protocol suddenly cannot handle a single $P_{RA}$ crash.
This can be solved by adding \textit{k+1} identical $P_{RA}$s, as this is a pre-compute step.
This means that the $P_{RA}$s play no further role after this step in the protocol, and can crash without affecting the running processes.
\item All messages that were to be sent in the original protocol, must be sent from a state machine in the integrity protected area though a state machine in the unprotected area of the process.
This is due to the constraint that the state machines in the integrity protected area of the process can no longer have channels to other processes.
\item Before any message intended for another process is sent from the integrity protected area, it must have a cryptographic MAC appended to the message.
The MAC is an output of the message and the key provisioned with Remote Attestation.
$SM_s$ is the only state machine with the Remote Attestation provisioned key, and thus only $SM_s$ can MAC a message correctly.
$SM_s$ will refuse to sign any message if any state machine in the integrity protected area is not the original.
This translates to a transformation of the state machine, such that each transition $(s_{from}, i, \{(m_o, c_o)\}, s_{to})$, where $c_o$ is a channel that was previously to another process, must be transformed to the transitions $(s_{from}, i, \{(m_o, c_{SM_s})\}, s_{wait})$ and $(s_{wait}, (m_o+MAC, c_{SM_s}), \{(m_o+MAC, c_o)\}, s_{to})$ with the intermediate state $s_{wait}$, which acts as a state in which the state machine waits for $SM_s$ to MAC the output message.
\item Whenever a message that supposedly originated from another process is received from the unprotected area by a state machine in the integrity protected area, the MAC of that message must be verified by $SM_s$.
In case this verification fails, the message must be discarded.
This translates to a transformation of the state machine, such that each transition $(s_{from}, (m_i, c_i), o, s_{to})$, where $c_i$ is a channel that was previously to another process, must be transformed to the transitions $(s_{from}, (m_i+MAC, c_i), \{(m_i+MAC, c_{SM_s})\}, s_{wait})$, $(s_{wait}, ("0", c_{SM_s}), \{\}, s_{from})$ and $(s_{wait}, ("1", c_{SM_s}), o, s_{to})$ with the intermediate state $s_{wait}$, which acts as a state in which the state machine waits for $SM_s$ to verify the MAC of the input, and the messages $"0"$ and $"1"$ represents the failure and success of this verification, respectively.
\item In the event that $SM_s$ is asked to sign a message from a state machine that it identifies is not the original state machines in the integrity protected area, it must discard the message.
\end{enumerate}
Notice that this transformation is linear in the number of messages, under the assumption that the MAC, and MAC-verification sub-routines in $SM_s$ do not add any messages.
Each input message in the transitions is transformed to exactly one new state, three new state transitions and one MAC-verification sub-routine on $SM_s$.
Each output-set is transformed to exactly one new state, two new state transitions and $n$ MAC sub-routines on $SM_s$ where $n$ is the size of the output set.
\begin{table}[ht]
\centering
\makebox[\textwidth][c]{\begin{tabular}{r|l}
\textbf{Msgs.} & \textbf{Origin} \\ \hline
$m$ & MAC-request outputs to $SM_s$ \\ \hline
$m$ & MAC-response inputs from $SM_s$ \\ \hline
$m$ & sending to the wrapper state machine on the sending process \\ \hline
$m$ & sending between processes \\ \hline
$m$ & sending from the wrapper state machine on the receiving process \\ \hline
$m$ & input MAC verification requests to $SM_s$ \\ \hline
$m$ & verification responses from $SM_s$ \\ \hline\hline
$7\cdot m$ & total messages
\end{tabular}}
\caption{Number of messages after the transformation.}
\label{tab:number-of-messages-transformed}
\end{table}
\FloatBarrier
Under the assumption that $SM_s$ sends no additional messages during the MAC and MAC-verification sub-routines, and no failures occur, $m$ messages in the original state machine will become $7m$ messages after the transformation (see table~\ref{tab:number-of-messages-transformed}).
If we only count the messages sent between processes, there is no change in the number of messages sent.
These transformations gives us the following guarantees, relevant if a process experiences a byzantine fault.
Recall that we model a byzantine fault as a process being exchanged with a new process, potentially with removed, added and/or corrupted state machines.
\begin{itemize}
\item If a process has its transformed state machine removed, the process will exhibit exactly the same behaviour as a crash (the state machine can take no further inputs, and produce no further outputs), and so it is exactly equivalent to a crash of the state machine.
\item If the wrapper state machines are removed, the transformed state machine can no longer send or receive messages.
This is again equivalent to a crash, as a token can no longer be sent to or from that state machine.
\item If the $SM_A$ is removed, it can either happen before or after the Remote Attestation process has completed:
\begin{itemize}
\item If it happens before, the process's $SM_S$ will never be provisioned with the Remotely Attestation secret, and no messages will be verified.
Therefore, the transformed state will never have its outgoing messages MAC'ed nor will it have any incoming messages verified.
As such, it will either get stuck in a state where it never gets a response from $SM_s$ of a message MAC request, or it will get stuck in a state where a specific message cannot be verified.
\item If it happens after the provisioning it will have no effect, as $SM_s$ is already provisioned, and no further communication with $SM_A$ is required.
\end{itemize}
\item If a process gets an additional state machine in the integrity protected area, it will have no consequence, as $SM_s$ will refuse MAC'ing and verification of messages from this state machine, and it cannot communicate with other state machines, unless these are corrupted.
\item If a process gets an additional state machine outside the integrity protected area, it cannot communicate with anything inside the integrity protected area, without corrupting these (see below).
It can communicate with other non-protected state machines, but this is equivalent to the other state machines being corrupted (see below).
\item If a process has a corrupted transformed state machine (in the integrity protected area), $SM_s$ will refuse to sign any messages.
So any messages sent from this corrupted state machine, will not be verified on correct processes, meaning that the transformed state machines receiving this message will not transition any further than the verification state.
\item If a process has a corrupted $SM_s$, the new $SM_s$ will not have access to the attestation secret.
Therefore, it cannot MAC or verify messages.
Nor can it be provisioned with the key again, as the Remote Attestation procedure ensures that $SM_s$ is the correct state machine.
\item If a process has a corrupted wrapper state machine, any messages from the integrity protected area can either be dropped, changed/corrupted, redirected, or duplicated:
\begin{itemize}
\item A dropped message is equivalent to a dropped message on an unreliable channel, which the protocol already handles.
\item A corrupted message will not be verified by a correct receiving process's $SM_s$, and the message will be discarded.
\item A redirected message will be not by verified by the receiving process, as its $id_p$ will not match the intended recipient, and the message will be discarded.
\item A duplicated message is equivalent to a duplicated message on the unreliable channel, which the protocol already handles.
\end{itemize}
\end{itemize}
This concludes the argument that Intel SGX and cryptography can change the behaviour of correct process under byzantine faults to behaviour equivalent to that of crash faults, and thereby, using the above transformation, make a crash-resistant protocol into a byzantine fault tolerant protocol.
As we will apply this transformation to our solution, we will from here on in only consider a fault model of crashing processes.
\subsection{Example: central server mutual exclusion}
We will now show a practical example of how the transformation in Section~\ref{sec:transforming-byzantine-faults} can be applied.
The protocol that we will transform is the central server protocol for mutual exclusion (CSME).
The central server protocol for mutual exclusion is a protocol which partly solves the \textit{distributed mutual exclusion} problem.
In the distributed mutual exclusion (ME) problem, a collection of processes share one or more resources (referred to as the \textit{critical section}), and need to perform reads/writes on these resources.
To prevent race-conditions, a mutual exclusion algorithm ensures that only a single process has access to the critical section at any given time.
More formally, in any solution to the distributed mutual exclusion problem, the following requirements must be upheld:
\begin{description}
\item[ME1: Safety] At most one process may execute in the critical section (CS) at a time.
\item[ME2: Liveness] Requests to enter and exit the critical section eventually succeed.
\end{description}
Optionally, some distributed mutual exclusion protocols also solves the additional fairness requirement of
\begin{description}
\item[ME3: Happened-before ordering] If one request to enter the CS happened-before another, then entry to the CS is granted in that order.
\end{description}
CSME solves ME1 and ME2 under the conditions of no process failures.
However, CSME does not fulfil the requirements of ME3 in asynchronous systems, and thus only partly solves the distributed mutual exclusion problem.
In broad strokes, the protocol works by deploying a central server process, which will process a request for access to the critical section in the received order, send an access token to the appropriate client process, wait for an acknowledgement from the process that access to the critical section is no longer needed, and then process the next request.
It is trivial that the protocol does not provide liveness under process crashes | if the server crashes, no process can gain access tokens, and thus no requests for access will succeed.
The same scenario is true if a process crashes while having access to the critical section, as the server will never release the token of the crashed process.
However, safety is still guaranteed, as a process cannot access the critical section without a token.
Under byzantine faults, neither safety nor liveness can be guaranteed.
For instance, the server could serve the tokens to all requests without waiting for acknowledgements from the clients.
We will now show how the transformation from Section~\ref{sec:transforming-byzantine-faults}, can transform CSME to give the same guarantees in a byzantine system as the untransformed protocol gives in a crash-fault system.
Note that there is no formal definition of how the critical section access and token is implemented in CSME | it can be implemented with cryptography, as partial states, with append-only memory, as another process, etc.
We will keep this abstraction in the following transformation, but the implementation of this is also required to undergo the same transformation.
We will omit the handling of unreliable channels on the state machines for readability, but assume that it is handled.
First we need to model the protocol as finite state machines, channels and processes.
We model two different state machines: a client state machine, and a server state machine.
\begin{figure}[ht]
\center
\includegraphics[scale=0.6]{figures/state-machines/CSME-client-NFA.pdf}
\caption{Client state machine from the central server protocol for mutual exclusion. Note that the channel $c_s$ represents the unreliable channel to the server, and is different across client instances.}
\label{fig:CSME-client-NFA}
\end{figure}
\FloatBarrier
The client state machine is modelled in Figure~\ref{fig:CSME-client-NFA}.
It can either not have requested access to the critical section ($s_0$), wait for the token from the server ($s_1$), or have the token and be in the critical section ($s_2$).
Naturally it can also crash ($s_{error}$).
\begin{figure}[ht]
\center
\includegraphics[width=\textwidth]{figures/state-machines/CSME-server-NFA.pdf}
\caption{Server state machine from the central server protocol for mutual exclusion. Note that this server state machine can only handle two client processes.}
\label{fig:CSME-server-NFA}
\end{figure}
\FloatBarrier
The server state machine is a bit more complex as it must be modelled to account for the number of client processes.
Figure~\ref{fig:CSME-server-NFA} is a model of the server state machine in a protocol with two client processes.
It encompasses 6 states:
\begin{description}
\item[$s_0$] where no client has access to the critical section, and no requests have been sent.
\item[$s_1$] where no client has access to the critical section and client $1$ has requested the token and the token has been sent.
\item[$s_2$] where no client has access to the critical section and client $2$ has requested the token and the token has been sent.
\item[$s_3$] where client $1$ has the token and client $2$ has requested the token.
\item[$s_4$] where client $2$ has the token and client $1$ has requested the token.
\item[$s_{error}$] where the state machine has crashed.
\end{description}
\begin{figure}[ht]
\center
\includegraphics[scale=0.6]{figures/state-machines/CSME-protocol.pdf}
\caption{Processes and channels in the central server protocol for mutual exclusion, with one server and two clients.}
\label{fig:CSME-protocol}
\end{figure}
\FloatBarrier
Figure~\ref{fig:CSME-protocol} shows the processes and channels in CSME with two client processes.
$P_1$ and $P_2$ are the clients, each running an instance of the state machine from Figure~\ref{fig:CSME-client-NFA}, with an unreliable channel to the server process $P_0$, which runs an instance of the state machine from Figure~\ref{fig:CSME-server-NFA}.
We start the transformation by transforming the state machines to have their messages MAC'ed by $SM_s$, and to have $SM_s$ verify the messages they receive.
This is done by applying the state machine replication transformation described in Section~\ref{sec:transforming-byzantine-faults}.
\begin{figure}[ht]
\center
\includegraphics[width=\textwidth]{figures/state-machines/CSME-client-NFA-transformed.pdf}
\caption{Client state machine in CSME, after it has been transformed to handle byzantine faults.}
\label{fig:CSME-client-NFA-transformed}
\end{figure}
\FloatBarrier
Figure~\ref{fig:CSME-client-NFA-transformed} shows the client state machine after the transformation.
Whenever the client wants to enter the critical section, the state machine sends the request to $SM_s$ for MAC'ing.
When the MAC'ed message is received from $SM_s$, this message is sent to the server process.
The state machine is now in $s_1$, which is equivalent to $s_1$ in the untransformed state machine (Figure~\ref{fig:CSME-client-NFA}).
$c_s$ is now a channel from the wrapper state machine.
When a token is received from $c_s$, the message's MAC is sent to $SM_s$ for verification.
If the MAC is correct, the state machine will transition to $s_2$, which is in the critical section, and is equivalent to $s_2$ in the untransformed state machine.
Before exiting the critical section, the exit message is sent to $SM_s$ for MAC'ing, and when the appropriate MAC is received from $SM_s$, the exit message is sent to the wrapper state machine for redirection to the server process.
\begin{figure}[ht]
\center
\makebox[\textwidth][c]{\includegraphics[width=\textwidth]{figures/state-machines/CSME-server-NFA-transformed.pdf}}
\caption{Server state machine in the central server protocol for mutual exclusion, after it has been transformed to handle byzantine errors. Note that $s_{error}$ has been omitted for improved readability.}
\label{fig:CSME-server-NFA-transformed}
\end{figure}
\FloatBarrier
Figure~\ref{fig:CSME-server-NFA-transformed} shows the transformed server state machine.
We have omitted $s_{error}$ for readability.
$s_0$, $s_1$, $s_2$, $s_3$ and $s_4$ are equivalent to the states by the same names in the untransformed state machine (Figure~\ref{fig:CSME-server-NFA}).
\begin{description}
\item[$s_0$] the token has been released, and no client has requested the token.
\item[$s_1$] client $1$ has requested the token, $SM_s$ has verified the MAC of the request and MAC'ed the token which has then been sent to the servers wrapper state machine for redirection to the client $1$ process.
If the token is released by client $1$, and this message's MAC is verified, the state machine will transition to $s_0$.
\item[$s_2$] is equivalent to $s_1$, except with client $2$ instead of client $1$.
\item[$s_3$] client $1$ has the token, and client $2$ has requested the token.
So when the server state machine receives an "exit" message verifiably from client $1$, a token is MAC'ed and sent to client $2$, so the state machine can transition to $s_2$.
\item[$s_4$] is equivalent to $s_3$, except with the clients reversed (client $2$ has the token, client $1$ has requested it, when client $2$ releases the token).
\end{description}
\begin{figure}[ht]
\center
\includegraphics[scale=0.6]{figures/state-machines/CSME-protocol-transformed.pdf}
\caption{Processes and channels in the central server protocol for mutual exclusion, after they have been transformed to handle byzantine errors.}
\label{fig:CSME-protocol-transformed}
\end{figure}
\FloatBarrier
Figure~\ref{fig:CSME-protocol-transformed} show the process's transformation.
Each of the client processes ($P_1$ and $P_2$), is now running the transformed client state machines in the integrity protected area.
The transformed client state machines have a reliable channel to a $SM_s$, and another to a wrapper state machine outside the integrity protected area, which is responsible for passing on the messages to the server process.
Each of the processes have a $SM_A$ responsible for passing Remote Attestation messages from the $SM_s$ to the new Remote Attestation process.
The server process is running a transformed server state machine, connected to two wrapper state machines and an $SM_s$.
The server $SM_s$ is also connected to a $SM_A$ for attestation.
Having transformed the CSME protocol, the correct processes will exhibit the same behaviour when other processes suffer byzantine faults, as the correct processes do in the untransformed protocol when other processes suffer crash faults.
\begin{figure}[ht]
\center
\includegraphics[scale=0.6]{figures/state-machines/CSME-protocol-transformed-byzantine.pdf}
\caption{Processes and channels in the transformed central server protocol for mutual exclusion, with a byzantine fault on the server process.}
\label{fig:CSME-protocol-transformed-byzantine}
\end{figure}
\FloatBarrier
As an example, consider a situation where the server state machines tries to serve tokens on any request, without getting acknowledgement that the token has been released by the client who last held the token.
There are several ways this could be modelled, one of which is presented here:
We model this byzantine fault as $P_0$ being exchanged with $P_{byz}$.
$P_{byz}$ (see Figure~\ref{fig:CSME-protocol-transformed-byzantine}) is exactly equal to $P_0$, except that $SM_0$ has been exchanged with $SM_{byz}$ (see Figure~\ref{fig:CSME-server-NFA-transformed-byzantine}), which serves a token to any request by the clients.
\begin{figure}[ht]
\center
\includegraphics[scale=0.6]{figures/state-machines/CSME-server-NFA-transformed-byzantine.pdf}
\caption{The byzantine fault state machine $SM_{byz}$ on the server process.\label{fig:CSME-server-NFA-transformed-byzantine}}
\end{figure}
\FloatBarrier
If this byzantine fault happens, none of the tokens are MAC'ed.
Thereby, any correct client process (Figure~\ref{fig:CSME-client-NFA-transformed}) will get stuck in a loop between $s_1$ and $s_1'$, when $SM_s$ rejects the token as it has not been MAC'ed.
This is equivalent behaviour to the server process crashing, as a token will never be accepted, and the client is unable to enter the critical section.
\section{Analysis}
\label{sec:analysis}
With a better understanding of the prerequisite information, we now present a more thorough definition of the problem.
We define our system as having $n$ peers $\{p_1, p_2 \dots p_n\}$ and a DCR graph $G$, which consists of $v$ events $E = \{e_1, e_2, \dots, e_v\}$ and a set of relations between events, $R$.
Each peer is said to be \textit{responsible} for one or more events, where being responsible for an event $e$ means that requests for executions of $e$ and requests for the state of $e$ can be handled by that peer.
Furthermore the system must have the following properties:
\begin{description}
\item[Safety] all peers responsible for $e$ execute the requests on $e$ in the same order.
\item[Liveness] The client of a request eventually receives a reply to that request.
\item[DCR linearisability] At any time before and after event executions, the states of $e_1, e_2, \dots, e_v$ must be the result of at least one run $R_G$ of $G$. Recall that $R_G$ only contains all successful executions.
\end{description}
Recall that we wish to create a system which is:
\begin{description}
\item[Concurrent] Simultaneous execution of concurrent events, as defined in Section~\ref{subsubsec:concurrency}, is performed without requiring that those executions are synchronised in the system.
\item[Scalable] Achieving consensus is done in less than $O(n)$ messages, where $n$ is the number of peers in the graph, while supporting the distribution of the state of the DCR graph among the participants.
\item[Byzantine fault tolerant] Byzantine behaviour is detectable, through the guarantees provided by Intel SGX, and treated as crashes.
\end{description}
Additionally, any peer should be able to collect the run, $R$, which is compatible with all prior collected runs, meaning that a run collected at logical time $t$ must be prefixed by a run which is equivalent up to concurrency to a run collected at time $t-1$.
Note that from this, it follows that two runs collected at time $t$, are not required to be identical, but are required to be equivalent up to concurrency, as defined in~\ref{subsubsec:concurrency}.
Since the peers responsible for each event can always be required to store executions which it has \textit{observed} and include it alongside the state of the event, we assume that collecting a run will always be possible, as long as maintaining the state of the graph is and therefore disregard it at this point in the analysis, returning to it in Section~\ref{subsec:global-history-collection}.
\begin{landscape}
\begin{table}
\centering
\makebox[\textwidth][c]{\begin{tabular}{ p{4cm} p{1.5cm} p{1.5cm} p{1.5cm} p{1.5cm} p{2cm} p{2.3cm} p{2cm} }
\toprule
\textbf{Approach} & \textbf{FT: Availability} & \textbf{FT: Safety} & \textbf{FT: Data integrity} & \textbf{Crash recovery} & \textbf{Minimum complexity for executing $e$} & \textbf{Concurrency} & \textbf{Request Bottleneck}\\
\midrule
Central Server & $0$ & $n$ & $0$ & n/a & $2$ & $\leq$ Dynamic independence & $1$ \\
State Machine Replication & $\lfloor \frac{(n-1)}{2} \rfloor$ & $n$ & $n-1$ & $O(n)$ & $2\cdot\lfloor\frac{(n-1)}{2}\rfloor$ & $\leq$ Dynamic independence & $1$ \\
Partial state, single-leader & $\lfloor \frac{(m-1)}{2} \rfloor^*$ & $n$ & $m-1$ & $O(n^2)$ & $2\cdot\lfloor\frac{(m-1)}{2}\rfloor$ & Dynamic independence & $1$ \\
Complete distribution & $0^*$ & $n$ & $0$ & n/a & $2\cdot D_{dyn}(G,e)$ & Dynamic independence & $n$ \\
Partial state, multi-leader & $\lfloor \frac{(m-1)}{2} \rfloor^*$ & $n$ & $m-1$ & $O(m)$ & $2\cdot\lfloor\frac{(m-1)}{2}\rfloor$ & Dynamic independence & $\frac{n}{m}$ \\
\bottomrule
\end{tabular}}
\caption[caption]{Comparison of different approaches to solving the problem. $^*$ denotes partial availability loss. \textbf{FT: Availability} is the fault tolerance of the availability guarantee. \textbf{FT: Safety} is the fault tolerance of the safety guarantee. \textbf{FT: Data integrity} is the fault tolerance of data integrity. \textbf{Crash recovery} is the number of messages required to recover from a crashing process. \textbf{Minimum complexity for executing $e$} is the minimum number of messages needed for the execution of an event. \textbf{Concurrency} is the supported level of DCR concurrency. \textbf{Request Bottleneck} is how many processes' can handle execution requests at one time.}
\label{tab:approach-comparison}
\end{table}
\end{landscape}
We consider several possible approaches to solving this problem, the performance of which can be seen in Table~\ref{tab:approach-comparison}.
The following is a short summary of each approach starting at the most intuitive, followed by iterative improvements of it.
\begin{figure}[ht]
\center
\includegraphics[scale=0.7]{figures/dcr-graphs/central-server-approach}
\caption{The central server approach.
The messages are ordered by the leader, $L$, such that the execution of $A$ is applied before the execution of $B$.
Note that the executions are numbered in the logical order which they are sent.
Peer $p_1$ is responsible for event $A$ along with $p_2$.
Peer $p_3$ is responsible for event $B$ along with $p_3$.}
\label{fig:central-server-approach}
\end{figure}
\FloatBarrier
\paragraph{Central server approach}
The first approach is a simple central server.
An example of the simultaneous execution of two events, $A$ and $B$, is shown in Figure~\ref{fig:central-server-approach}.
There exists a single leader $L$. Each of the other peers replicate some subset of $E$ and can redirect requests to the leader.
When the leader receives a request, it is queued in the order they are received and applied to $G$ in that order.
After the execution of an event $e$, each of the peers replicating $e$ or an event in $E_{aff}(G,e)$ are informed by $L$ of that execution, with a counter value that enables ordering of the execution of the events.
This ensures safety, since each of the peers will apply the executions in the same order as $L$.
Since every request is managed by a single leader, we are automatically given DCR linearisability, since the executions are totally ordered.
However, this solution does not ensure concurrency, since $L$ needs to apply each previous execution before the next.
If concurrency is desired, $L$ can execute requests in parallel, while keeping track of current executions.
When an execution request is received of an event that is dynamically dependent on one of the currently executing events, it is queued, and so is all other requests that are not in the approximate static independence set of all the currently executing and queued events.
Furthermore, there is no fault tolerance because under a leader crash, there is no guarantee that the remaining peers can recreate the most recent state.
\begin{figure}[ht]
\center
\includegraphics[scale=0.7]{figures/dcr-graphs/smr-approach.pdf}
\caption{The SMR approach.
Following the same semantics as in Figure~\ref{fig:central-server-approach}, with the exception that each peer is responsible for the entire graph.}
\label{fig:smr-approach}
\end{figure}
\FloatBarrier
\paragraph{SMR approach}
The next approach tries to improve on the fault tolerance of the central server approach.
An example of the simultaneous execution of two events, $A$ and $B$, is shown in Figure~\ref{fig:smr-approach}.
This is a classic SMR/consensus solution with crash fault tolerance.
There are several protocols that solve the issue of SMR, for instance in~\cite{bracha_asynchronous_1985,lamport_part-time_1998,lamport_lower_2006,ongaro_search_2014}, but as they are largely interchangeable with regards to performance\footnote{In the categories specified in the Table~\ref{tab:approach-comparison}.}, any solution that utilizes a leader, $L$, to achieve an optimal crash fault tolerance of $f = \lfloor \frac{(n-1)}{2} \rfloor$ is enough.
In this approach, every peer must replicate all events, as all executions must be committed to at least $f+1$ replicas.
This ensures that if a leader fails, a new leader can be chosen (using $O(n)$ messages), which gives us the aforementioned fault tolerance.
However, now $L$ must synchronise all requests, to ensure that they are applied in the same order by all replicas, which translates to no concurrency in executions.
Some SMR protocols (e.g.~\cite{castro_practical_1999}) allow for concurrent request commits, under the constraint that requests that are committed must not be applied to the underlying state machine before any uncommitted request that happened-before~\cite{lamport_time_1978} on $L$.
In order to allow for the concurrent execution of dynamically independent events, this constraint could be relaxed to allow for applying the execution of $e$ when the request has committed, if the peer is aware of which events is in the previous uncommitted execution requests, and none of these events are in $D_{dyn}(e)$.
Because the peer needs to know the previous uncommitted requests, this results in worse concurrency than dynamic independence.
Furthermore, the solution still needs $O(n)$ messages to execute a single event, no matter how the graph is constructed.
\begin{figure}[ht]
\center
\includegraphics[scale=0.7]{figures/dcr-graphs/partial-state-single-leader-approach.pdf}
\caption{The partial state, single leader approach.
There are two clusters, one for event $A$ containing $p_1$ and $p_2$, and one for event $B$ containing $p_3$ and $p_4$.
The leader is currently $p_2$.}
\label{fig:partial-state-single-leader-approach}
\end{figure}
\FloatBarrier
\paragraph{Partial state, single leader approach}
The next approach is an attempt to reduce the amount of messages needed for executing an event.
An example of the simultaneous execution of two events, $A$ and $B$, is shown in Figure~\ref{fig:partial-state-single-leader-approach}.
The basic idea is to distribute the event responsibility out to a subset of peers, which will then only be responsible for that event.
We name these subsets \textit{clusters} and denote the size of these clusters $m$.
The system still utilizes a single leader, which synchronises the requests.
When event $e$ is requested executed, the leader runs an SMR algorithm similar to the one described in the SMR approach, but only with the cluster for $e$, and the clusters of $E_{aff}(G,e)$ | to ensure that the clusters of $E_{aff}(G,e)$ has updated the state/enabledness of the events they are responsible for.
This ensures a lower message complexity than the SMR approach, given $n > m \cdot (|E_{aff}(G,e)|+1)$ where $n$ is the number of peers in the system.
Since $n = m \cdot v$ and $\forall e \in G. v \geq (|E_{aff}(G,e)|+1))$ for any graph, $G$, it must hold that $m\cdot v \geq m \cdot (|E_{aff}(G,e)|+1))$.
For a sufficiently loosely connected graph, $G$, where no event's execution affects all other events in the graph, it must hold $\forall e \in G. n > m \cdot (|E_{aff}(G,e)|+1))$.
Another improvement is the fact that if more than half of a cluster crashes, only part of the availability is affected, since the statically independent clusters can continue operating normally.
However, $L$ must still synchronise all requests, and worse, cannot allow for concurrent request commits, since the peers would not necessarily be able to recreate the synchronisation between these.
The one exception is dynamically independent events, since the ordering of the execution of these does not matter.
Under leader crashes, all clusters need to synchronise their logs, which would require $O(n^2)$ messages.
\begin{figure}[ht]
\center
\includegraphics[scale=0.7]{figures/dcr-graphs/complete-distribution-approach.pdf}
\caption{The complete distribution approach.
Note that the executions are no longer numbered, as synchronisation is done using locks.}
\label{fig:complete-distribution-approach}
\end{figure}
\FloatBarrier
\paragraph{Complete distribution approach}
The next approach is heavily inspired by~\cite{hildebrandt_safe_2011}.
An example showing the execution of an event is shown in Figure~\ref{fig:complete-distribution-approach}.
Here $n = v$ and each peer is only responsible for a single event.
There is no leader, and an execution is done through synchronisation between peers.
If event $e$ is to be executed, the responsible peer will use a synchronisation technique (e.g. two phase locking~\cite{skeen_nonblocking_1981}) with all the peers responsible for the events in $E_{aff}(G,e)$, who will update their state accordingly.
This guarantees DCR linearisability.
The fault tolerance of this approach is quite poor, as any peer crashing will result in a data loss, and effect availability, even though it has the same property as the partial state, single leader approach of not effecting the events in $I_{stat}(e)$.
\begin{figure}[ht]
\center
\includegraphics[scale=0.6]{figures/dcr-graphs/partial-state-multi-leader-approach.pdf}
\caption{The partial state, multi leader approach.
Note that this example is in a graph where events $A$ and $B$ are not dynamically independent, meaning that synchronisation is required.
Furthermore, locking is not the only way to achieve synchronisation and it is only used as an example here. }
\label{fig:partial-state-multi-leader-approach}
\end{figure}
\FloatBarrier
\paragraph{Partial state, multi-leader approach}
The last is our chosen approach, and we will go into greater detail from Section~\ref{subsec:algorithm} onwards.
Figure~\ref{fig:partial-state-multi-leader-approach} shows an example of event execution.
It builds upon the partial state, single leader approach, except that the common leader is replaced with inter-cluster synchronisation using locks.
This has the benefit that even if a leader dies, the recovery has message complexity $O(m)$ (instead of $O(n^2)$ in the partial state, single leader approach and $O(n)$ in the SMR approach), and the number of concurrent client requests is not bounded by the processing speed of a single process.
From here on in, we will primarily concern ourselves with the details of the partial state, multi-leader approach.
\subsection{Algorithm}
\label{subsec:algorithm}
To avoid the high number of messages associated with a fully replicated system and the message bottleneck associated with a single-leader system, we propose a multi-leader, partial state system.
Each event is represented by an individual leader, who can initiate execution of that event at any given time.
Since events can affect each other and this system is in an asynchronous setting, where messages can be delayed arbitrarily, having multiple leaders can potentially lead to situations where leaders disagree on which order events have been executed in, possibly resulting in disagreement on the state of the workflow.
The semantics for discerning when ordering is needed and how that ordering is applied is described in Section~\ref{subsec:execution-ordering}.
With a method of ordering executions between clusters peers can progress the graph, in the sense that events can be executed safely.
But since each peer only knows the state of the event which it is responsible for, what has transpired in the graph is implicitly distributed.
Since the algorithm does at this point guarantees that any event can only execute, when the semantics and the state of the graph permits it, collecting the global history is only interesting from a documentation perspective.
Suppose that a peer is responsible for an event which is not affected by any successful executions.
Such a peer will have no way of knowing if anything in the graph has been executed at all.