distributed-system-in-one-lesson.txt

Distributed System:
  - A collection of independent computers that appear to its users as
    one computer, with 3 characteristics:
      - the computers operate concurrently.
      - the computers fail independently.
      - the computers don't share a global clock:
        - the computing that each one of these machines does, is 
          asynchronous with respect to other machines in the sys.

  - Storage: Relational/Mongo, Cassandra, HDFS
  - Computation: Hadoop, Spark, Storm
  - Synchronization: NTP, vector clocks
  - Consensus: Paxos, Zookeeper
  - Messaging: Kafka
-----
Single-Master Storage -- to -> Read Replication
  - Complexity
  - Consistency [take some time for all replicas to be the same]
  - approperiate in read heavy workload.
---
Sharding:
  - More Complexity.
  - Limited data model.
    - we need a common key in all queries we issue, that we can shard on.
      if u cannot find one then sharding is not a great choice. 
  - Limited data access patterns.
    - if u need to query without using the sharding key, u must go to all shards
       scatter that query out to all of them, get results and gather those back together.
scatter gather    
---.
Consistent Hashing:
  Amazon Dynamo
  we 're building our project for scale from the ground up. <- consistent hashing is about that.
  
  
  make a ring of nodes with a known order and make label for each one, when u insert data or 
  retrieve it -> hash the key according to order u'll know which node to insert/retrieve.
   [selected node the hash of the record is greater than the previous node and not greater than
     me, so node will be aware of the label of the previous one].
     
     uniformally hashing of the key will guarantee that data are going to be uniformly distributed
     around the ring.
  
  for replication we can write data to the next nodes in the ring.
  --> for scalability can add more nodes to the ring.
  
  how to deal with consistency ?
    R+W>N 
    R -> the #of nodes needed to agree on the data when we ask for it.
    N -> #of replicas we want to keep
    W -> number of nodes that need to acknowledge the update when we update the data.
  
  we make trade of between consistency and latency.
  
  entropy problem: when nodes don't agree on data, when one replica is out of data with respect
    to another, we say that the db has increased entropy. [solution: impl specific]
    
  Traditionally in Sharding, u have decide early in the system, make a permanent decision 
  about how many shards u will ever have, and if u need to grow that u've got a lot of downtime,
   a major migration need to be considered. in architecture like consistent hashing u've got the 
   ability to scale kinda baked in and really is designed from the ground up.
  
  
  when to use:
   -scale
   -Transactional data [heavy write workload across networks]
   -Always on  
-----
CAP:
  - Consistency: when u request to read data, u will get the last write.  
  - Availability:  no error returns
  - Partition Tolerance: when Partition fail and then healed, the system is able to deal with
     life when the partiton is healed, 
  P is not negotiable  in Distributed system. nodes will break. so we need to be
   prepared for it.
   
   C in ACID after commit, data will be in valid state[refrential integrity]
   
   
  - most don't fo with distributed transaction. they prefer to deal with failure by
    [write off, retry, compensation actions] rather than rollback the whole transaction
    because this increase the throughput and decrease the latency.
  - spliting the work  allows us to parallelization sub-tasks, in unevenly workload
    we can increase the #of nodes for the heavy tasks.  
   
-----
Distributed computing:
  - Scatter/Gather
    - general paradiagm we saw in Distributed computatuion,
      where we scatter a computation out to lots of individual
      nodes where it process usually on local data. and gather 
      those results back together.
    - data being local is the key.   
    - move computation close to data.
  - MapReduce
    - key paradiagm that hadoop implement.
    - computation strategy not a piece of code.
    - 
  - Hadoop
  - Spark
  - Storm
    - another distrbuted computation solution but it's oriented 
      around event processing instead of working on data persisted 
      on some storage. realtime very small computation based on
      incoming events.
  - MapReduce
    what happen is Map/shuffle/reduce.
    but we just give the system the mapper and reducer.
    - as word count example:
      Mapper take a text file.
       produce list of key/value pair. pass the to the shuffling stage.
        where hadoop for example bring all common key across different nodes
        in one place for good data locality. [one key, list of values]. 
      Reducer take the output of shuffling and do some king of aggregation, 
       probably produce a scalar.

      - shuffling can use hash partitioning, i.e. to get value of common 
        key together.
        key is assigned to reduce # = hash(key)%number of reduce  
        
---------
negative acknowledgment NAKs
  The recipient sends it to the sender when the sent data packet isn’t 
  delivered properly. This can be due to different reasons like incomplete 
  or corrupted data, checksum failure, invalid headers, or wrong data formats.

multicast vs broadcast
 broadcast send to all nodes in the network.
 multicast send to a specific group of nodes.

  - group of nodes want to inform all other nodes.
    - most of them are application level protocol.
    - simplest implementation:
       loop over lst of the nodes and send to all of them using TCP/UDP.
         - overhead on the sender rlate to num of reciever.
         - if sender fail the rest of reciever will not recieve
   - Tree-Based -> build a spanning tree.
     - IPmulticast[network level protocol], [Application level protocol] SRM, RMTP, TRAM, TMTP.
     - low latency. less overhead.
     - problems ? how to setup and maintain the tree, detect failure.?    
     
     --
     - build a spanning tree among the processes of the multicast group
     - use spanning tree to dissements (ACKs) or (NAKs) to repair multicasts not recieved
     - SRM (Scalable Reloable Multicast)
       - Uses NAKs
       - if the reciever not get a message send a repair NAK up to the root 
       - but adds random delays, and uses exponential backoff to avoid NAK storm.
     - RMTP (Reliable Multicast Transport Protocol)
       - Uses ACKs - use post ACK - receiver send a collection of ack for all multicast msgs
       - But ACKs only sent to designated recievers, which then re-transmit missing multicasts.
     - these protocols still cause an O(N) ACK/NAK overhead.  
     ---
   Gossip protocol [Push]
     - Periodically, transmit to b [fanout] Random target. Gossip message [UDP]
     - Other Nodes do same after receiving multicast
     - nodes are not in sync but in analysis we assume it.
     - node can receive messages more than once.
     - Protocol rounds (local clock) b random per round.
     - ...
     - once u have a multicast msgs, u start gossiping about it,
     - multiple msgs> Gossip a  random subset of them, or recently-received onces, 
       or higher priority once.
     
     
     
     
     
     
     
     
     
     
     
------------
Peer To Peer System:
  - first distributed system that seriously focused on scalability
    with respect to number of nodes.
  - P2P techniques in cloud computing system
      - Key-value stores(e.g., Cassandra, Riak, Voldemort, DynamoDb) use Chord p2p hashing.

  - Widely-deployed P2P Systems
    1. Napster
    2. Gnutella
    3. Fasttrack (Kazaa, Kazaalite, Grokster)
    4. BitTorrent
    P2P System with provable properties
    1. Chord
    2. Pastry
    3. Kelips
  ----
  Napster
   Server:
     Store a directory, i.e, filenames with peer pointers
     - maintain directory information using (ternary tree algorithm)
   Client:
     - Connect to a Napster Server 
       - Upload list of music files that you want to share.
       - Server maintain list of <filename, ip_address, portnu> tuples.
         Server Store no files.
     - Search
       - Send Server keywords to search with
       - (Server search its list with the keywords)
       - Server returns a list of hosts - <ip_address, portnum> tuples - to_client
       - Client ping each host in the list to find transfer rates.
       - Client Fetches file from best host.
     - All communications uses TCP   

     - Joining A P2P system
       - can be used for any p2p system
         - Send an http request to well-known url for that
           P2P service - http://www.myp2pservice.com
         - message routed (after lookup in DNS=Domain Name Sydtem) to 
           introducer, a well known server that keeps track of some recently 
           joined nodes in p2p system.
         - Introducer initializes new peers' neighbor table.         
     - Problems:
       - Centralized server a source of congestion. can be overwhlmed with search queries.
       - Centralized server single point of failure.
       - No Security: Plaintext messages and passwds.
       - napster.com declared to be responsible for users' 
         copyright violation=> "indirect infringement"     
  -----
  Gnutella
   - arguably the first fully distributed p2p system.
   Eliminate the servers [used in napster].
   clients machines search and retrieve amongest themselves [call them servents = server + client]
   -  peers
   connected in an overlay[because it's overlayed on top of the Internet] graph:
    peer store their own files and connected to neighbours peers.
  Gnutella has 5 main messages types:
    - Query(search)
    - QueryHit(response to query)
    - Ping (to update peer list)
    - Pong (reply ping)
    = Push (initiate file transfer)

    all fields in msg except IP ADDR are in a little-endian format. 
    message fields:
      Descriptor Id : unique for each msg.
      Payload descriptor: define msg type.
      TTL
      Hops -> some don't use it. incr in each hop.
      PayloadLength
      Payload
      -----
      Query's forwarded only once. each p keep track of recent descriptorID