Contents

Design Notification Service - System Design Interview

TO DO: gossip proctocol

Problem Statement

Publisher/Subscriber model that publisher publishes messages need to be delivered to a group of subscribers.

The straight forward way is synchronous publishing that publisher calls each subscribers in some order and wait for response.

./synchronous.png

We need to build a notification service that can register arbitrary number of subscribers and coordiantes the message delivery.

./notificationService.png

Functional Requirements

A set of functions that the system will support, more specifically, APIs.

  • createTopic(topicName)
  • publish(topicName, message)
  • subscribe(topicName, endpoint)

Non-Function Requirements

  • Scalable, supports an arbitrary large number of topics, publishers and subscribers
  • Highly Available, survive hardward/software failures, no single point failures
  • Highly performant, keep end to end latency as low as possible
  • Durable, messagew must not be lost, each subscriber must receive evey message at least once

High level architecture

./architecture.png

Metadata service

  • provide access to database through well defined interface which simplies maintainance and has ability to make changes in the future
  • serving as a cache layer

Temporary Storage

Saving messages for a short period of time until messages are succesfully delieverd to all subscribers

Frontend Service

  • a lightweight web service
  • stateless service deployed across several data centers

Frontend Service Actions

  • request validation
  • authentication/authorization
  • TLS(SSL) termination
  • Server-side encryption
  • Caching
  • Rate limiting (throttling)
  • Request dispatching
  • Request dedeplication
  • Usage data collection

Frontend Service Host

./frontendServiceHost.png

The reverse proxy is doing SSL termination and encrypt response to client

Frontend service needs to call metadata service to get message topics information. To reduce requests to metadata service, frontend service will also use local cache. It also writes logs and metrics to local disks.

Metadata Service Host

./metadataServiceHost.png

There are two options to make Frontend Service know which metadata service host to call, either through a configuration service or make each metadata service hosts know the existence of other hosts.

Temporary storage service

  • must be fast, hightly available and scalable web service
  • has to store messages for serveral days to handle unavailablity of subscribers

./temporaryStorage.png

Different storage options

  • Database
    • no need to have ACID transactions, no need to run complicated queries, won’t use the storage to do analitics or dataware housing
    • need db to support scalable reads and writes
    • need to be highly available and tolerate network partitions
    • NOSQL wins for the use cases
    • to decide which NOSQL type, we know message have limited size, e.g. 1MB, so don’t need document based storage
    • and there is no complex relationships, so we don’t need graph type NOSQL
    • So either column or key-value NOSQL works for us

For the benefit of column NOSQL, refer to this link https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html In short, column NOSQL saves to disk column after column and its performance for analytics work, e.g. calculate the average value of a column, is much better. Becuase row based db saves records to disk row after row, calculating average value for a column needs to scan much more disk blocks.

Sender service

./senderService.png

  • Sender service needs to use multi threads to retrieve messagaes from temporary storage. It should have counting semaphore to dynamically adjust the number of threads to avoid overwhelm the temporary storage.
  • The sender services needs to call metadata service to retrieve the list of subscribers. The reason of not passing the list of subscribers along with each messages is that the size of list might be big and incur heavy requirements to temporary storage and we might have to use document-based NOSQL because of bigger message size.
  • The sender service will use multi threads to send messages to each subscribers. And it can delegate the sending work to other micro services, e.g. send email, call HTTP endpoints. The reason of not using simple for loop to subscriber list and sends message one by one is that error handling could be a big issue. Some subscribers might be slow or sending would fail. Using a for loop would block the remaining message delivery.

What else is important

  • avoid spammer
    • register subscribers need to have confirm
  • duplicate messages
    • subscriber is responsible for eliminating duplciates
  • retry of delivery attempt
    • retry is to guarantee at least once message delivery. It’d be great to enable subscriber to decide retry policy when messages are failed to deliver
  • message order
    • we’re not guarantee message order. Delivery might be out of order because of slow host, slow subscriber and retry failed delivery
  • security
  • monitoring