-
Notifications
You must be signed in to change notification settings - Fork 0
/
distributed-system-in-one-lesson.txt
executable file
·249 lines (213 loc) · 10 KB
/
distributed-system-in-one-lesson.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
Distributed System:
- A collection of independent computers that appear to its users as
one computer, with 3 characteristics:
- the computers operate concurrently.
- the computers fail independently.
- the computers don't share a global clock:
- the computing that each one of these machines does, is
asynchronous with respect to other machines in the sys.
- Storage: Relational/Mongo, Cassandra, HDFS
- Computation: Hadoop, Spark, Storm
- Synchronization: NTP, vector clocks
- Consensus: Paxos, Zookeeper
- Messaging: Kafka
-----
Single-Master Storage -- to -> Read Replication
- Complexity
- Consistency [take some time for all replicas to be the same]
- approperiate in read heavy workload.
---
Sharding:
- More Complexity.
- Limited data model.
- we need a common key in all queries we issue, that we can shard on.
if u cannot find one then sharding is not a great choice.
- Limited data access patterns.
- if u need to query without using the sharding key, u must go to all shards
scatter that query out to all of them, get results and gather those back together.
scatter gather
---.
Consistent Hashing:
Amazon Dynamo
we 're building our project for scale from the ground up. <- consistent hashing is about that.
make a ring of nodes with a known order and make label for each one, when u insert data or
retrieve it -> hash the key according to order u'll know which node to insert/retrieve.
[selected node the hash of the record is greater than the previous node and not greater than
me, so node will be aware of the label of the previous one].
uniformally hashing of the key will guarantee that data are going to be uniformly distributed
around the ring.
for replication we can write data to the next nodes in the ring.
--> for scalability can add more nodes to the ring.
how to deal with consistency ?
R+W>N
R -> the #of nodes needed to agree on the data when we ask for it.
N -> #of replicas we want to keep
W -> number of nodes that need to acknowledge the update when we update the data.
we make trade of between consistency and latency.
entropy problem: when nodes don't agree on data, when one replica is out of data with respect
to another, we say that the db has increased entropy. [solution: impl specific]
Traditionally in Sharding, u have decide early in the system, make a permanent decision
about how many shards u will ever have, and if u need to grow that u've got a lot of downtime,
a major migration need to be considered. in architecture like consistent hashing u've got the
ability to scale kinda baked in and really is designed from the ground up.
when to use:
-scale
-Transactional data [heavy write workload across networks]
-Always on
-----
CAP:
- Consistency: when u request to read data, u will get the last write.
- Availability: no error returns
- Partition Tolerance: when Partition fail and then healed, the system is able to deal with
life when the partiton is healed,
P is not negotiable in Distributed system. nodes will break. so we need to be
prepared for it.
C in ACID after commit, data will be in valid state[refrential integrity]
- most don't fo with distributed transaction. they prefer to deal with failure by
[write off, retry, compensation actions] rather than rollback the whole transaction
because this increase the throughput and decrease the latency.
- spliting the work allows us to parallelization sub-tasks, in unevenly workload
we can increase the #of nodes for the heavy tasks.
-----
Distributed computing:
- Scatter/Gather
- general paradiagm we saw in Distributed computatuion,
where we scatter a computation out to lots of individual
nodes where it process usually on local data. and gather
those results back together.
- data being local is the key.
- move computation close to data.
- MapReduce
- key paradiagm that hadoop implement.
- computation strategy not a piece of code.
-
- Hadoop
- Spark
- Storm
- another distrbuted computation solution but it's oriented
around event processing instead of working on data persisted
on some storage. realtime very small computation based on
incoming events.
- MapReduce
what happen is Map/shuffle/reduce.
but we just give the system the mapper and reducer.
- as word count example:
Mapper take a text file.
produce list of key/value pair. pass the to the shuffling stage.
where hadoop for example bring all common key across different nodes
in one place for good data locality. [one key, list of values].
Reducer take the output of shuffling and do some king of aggregation,
probably produce a scalar.
- shuffling can use hash partitioning, i.e. to get value of common
key together.
key is assigned to reduce # = hash(key)%number of reduce
---------
negative acknowledgment NAKs
The recipient sends it to the sender when the sent data packet isn’t
delivered properly. This can be due to different reasons like incomplete
or corrupted data, checksum failure, invalid headers, or wrong data formats.
multicast vs broadcast
broadcast send to all nodes in the network.
multicast send to a specific group of nodes.
- group of nodes want to inform all other nodes.
- most of them are application level protocol.
- simplest implementation:
loop over lst of the nodes and send to all of them using TCP/UDP.
- overhead on the sender rlate to num of reciever.
- if sender fail the rest of reciever will not recieve
- Tree-Based -> build a spanning tree.
- IPmulticast[network level protocol], [Application level protocol] SRM, RMTP, TRAM, TMTP.
- low latency. less overhead.
- problems ? how to setup and maintain the tree, detect failure.?
--
- build a spanning tree among the processes of the multicast group
- use spanning tree to dissements (ACKs) or (NAKs) to repair multicasts not recieved
- SRM (Scalable Reloable Multicast)
- Uses NAKs
- if the reciever not get a message send a repair NAK up to the root
- but adds random delays, and uses exponential backoff to avoid NAK storm.
- RMTP (Reliable Multicast Transport Protocol)
- Uses ACKs - use post ACK - receiver send a collection of ack for all multicast msgs
- But ACKs only sent to designated recievers, which then re-transmit missing multicasts.
- these protocols still cause an O(N) ACK/NAK overhead.
---
Gossip protocol [Push]
- Periodically, transmit to b [fanout] Random target. Gossip message [UDP]
- Other Nodes do same after receiving multicast
- nodes are not in sync but in analysis we assume it.
- node can receive messages more than once.
- Protocol rounds (local clock) b random per round.
- ...
- once u have a multicast msgs, u start gossiping about it,
- multiple msgs> Gossip a random subset of them, or recently-received onces,
or higher priority once.
------------
Peer To Peer System:
- first distributed system that seriously focused on scalability
with respect to number of nodes.
- P2P techniques in cloud computing system
- Key-value stores(e.g., Cassandra, Riak, Voldemort, DynamoDb) use Chord p2p hashing.
- Widely-deployed P2P Systems
1. Napster
2. Gnutella
3. Fasttrack (Kazaa, Kazaalite, Grokster)
4. BitTorrent
P2P System with provable properties
1. Chord
2. Pastry
3. Kelips
----
Napster
Server:
Store a directory, i.e, filenames with peer pointers
- maintain directory information using (ternary tree algorithm)
Client:
- Connect to a Napster Server
- Upload list of music files that you want to share.
- Server maintain list of <filename, ip_address, portnu> tuples.
Server Store no files.
- Search
- Send Server keywords to search with
- (Server search its list with the keywords)
- Server returns a list of hosts - <ip_address, portnum> tuples - to_client
- Client ping each host in the list to find transfer rates.
- Client Fetches file from best host.
- All communications uses TCP
- Joining A P2P system
- can be used for any p2p system
- Send an http request to well-known url for that
P2P service - http://www.myp2pservice.com
- message routed (after lookup in DNS=Domain Name Sydtem) to
introducer, a well known server that keeps track of some recently
joined nodes in p2p system.
- Introducer initializes new peers' neighbor table.
- Problems:
- Centralized server a source of congestion. can be overwhlmed with search queries.
- Centralized server single point of failure.
- No Security: Plaintext messages and passwds.
- napster.com declared to be responsible for users'
copyright violation=> "indirect infringement"
-----
Gnutella
- arguably the first fully distributed p2p system.
Eliminate the servers [used in napster].
clients machines search and retrieve amongest themselves [call them servents = server + client]
- peers
connected in an overlay[because it's overlayed on top of the Internet] graph:
peer store their own files and connected to neighbours peers.
Gnutella has 5 main messages types:
- Query(search)
- QueryHit(response to query)
- Ping (to update peer list)
- Pong (reply ping)
= Push (initiate file transfer)
all fields in msg except IP ADDR are in a little-endian format.
message fields:
Descriptor Id : unique for each msg.
Payload descriptor: define msg type.
TTL
Hops -> some don't use it. incr in each hop.
PayloadLength
Payload
-----
Query's forwarded only once. each p keep track of recent descriptorID