Blogs
Jul 25, 2024
The essentials for designing a full-featured Notification System to send product-to-user notifications at scale for less. Compatible with AWS, Azure, and GCP.
In our Ultimate Guide on Notification Services, we discussed if and when you should build a Notification Service. This article proposes a notification system architecture that you can use as your in-house notification system.
If you use Serverless, go to the Serverless Notification Service Design article.
Motivation
Product-to-user notifications are a common culprit for code smells, technical debt, and endless HTML/CSS/UI templates around your code base. They usually end up with no tests, no oversight or ownership, and cause pain to both product teams and end-users.
We previously wrote about common technical mistakes and non-technical mistakes teams make when building their notification service from scratch.
An Easier Solution
There is so much work that goes into building a scalable notification service. That is precisely why we built NotificationAPI: a plug-and-play notification-as-a-service solution. It takes 5 minutes to set up. It is scalable and has every feature you can imagine for your notifications, such as a dashboard for your team to design and configure notifications visually without developer involvement.
All engineering comes down to making tradeoffs between the perfect and the workable.
-- Katie Hafner
Objective
To design a notification service that can send product-to-user notifications across many channels at scale while addressing the mistakes and challenges discussed above.
Requirements:
Send API: An authenticated end-point that we can trigger to send notifications from any back-end
Supported Channels: Support sending notifications to any channel that exposes an API, e.g., Email, SMS, Push
User Preferences: Allow users to pick their user preferences on each notification and channel
Respecting downstream service limits: Avoid getting throttled or suspended by your email or SMS service
Scalable: Allow horizontal scaling for (theoretically) unlimited scaling
High-Level Architecture
A Quick Overview
Let's imagine that your code should send a notification. The numbers below correspond with the numbers you see on the diagram.
Your code calls the POST /send endpoint. The request contains the recipient's userId, the type of notification, and the notification contents for every supported channel.
The /send end-point authenticates the request using OAuth2's Client Credentials Flow.
It then requests the user's notification preferences from the database. The preferences indicate whether the user is subscribed to a particular notification and channel or not.
It will read the user attributes such as email address or phone number from the database.
This end-point will form a message object containing user attributes from step (4), along with the channels and content for each channel. However, it will exclude disabled channels based on step (3). Finally, the message is sent to a fan out service.
The fanout service is configured to broadcast incoming messages to job queues. However, there is filtering in place to ignore job queues related to channels that are not listed inside the message.
There is a job queue and processor per channel. The processor picks up the job and requests the appropriate service, e.g., a transactional email or SMS service.
Important Architecture Decisions:
POST /sent
You notice that the request to this end-point only contains the userId and not the email or phone number. This allows the services that send notifications to have no knowledge of your users.
The end-point is behind a load balancer to ensure scalability.
The end-point is not protected with your regular user-facing authentication. Since the service that makes the request is a "program" itself, you need to use a different authentication mechanism known as the OAuth2 Client Credential Flow used for server-to-server communication. Here are links to how to do this in Auth0 and Cognito.
Do we need a whole end-point for this?
There will be many parts of your application that will be issuing notifications. Implementing the send function as an end-point behind a load-balancer ensures that it is independently scalable and allows you to use it from virtually anywhere, e.g., from a new codebase or your build pipeline.
Alternatively, if all of your back-end services are encapsulated in a single cloud environment - e.g. one AWS account in one region - you can use an event-driven service to receive and process these requests.
User Preferences
Use a highly scalable NoSQL or key/value pair database. Structure the records as: KEY: sample_user_id:sample_notification_id, VALUE: [{channel: "email", state: true}, {channel: "sms", state: false}]
When the send end-point sees "false" values in the records, it will remove the related channel from the message sent to the fanout. If a record for a channel does not exist, it means the user has not explicitly set their preferences. In this case, you need to agree to a default.
The information in the user_preferences table is updated by the user through your UI, through a normal end-point protected by your common authentication mechanisms.
Why do we need user preferences?
Not allowing users to control their notification preferences will only frustrate them and force them to mark your notifications as spam or mute your notifications. This will further damage your user experience and mark your account for suspension by email or SMS delivery services.
Fan Out
Fanout takes a message and duplicates it to various places. They are cheap and highly scalable. In AWS, use SNS. In Google Cloud Platform, use Pub/Sub, and in Azure, use topics and subscriptions.
You can configure filtering between the fanout and job queues to avoid sending unnecessary messages to job queues of channels that have been excluded. For example, in AWS SNS, you can specify that the email job queue should only receive the fanout message if the message contains the "email" property inside the "channels" property.
Why do we need a fanout?
You could write code that puts the same message into the necessary job queues, but fanout is cheaper, and writing less code is good. Another benefit of fanout is being able to easily add/remove queues, thus allowing you to refactor and extend your channels.
Job Processing
Queues hold on to messages until your job processors process them. They are also cheap and highly scalable. Job processors are code that takes messages from the job queues and processes them. They can scale based on the number of messages in the queue.
In our case, the job processor should make an API call to the appropriate service to send out the notification through a transactional email service.
Most email, SMS, or similar delivery services have strict guidelines on the amount and quality of messages you send. You should also carefully review these and put proper systems in place. Here is our guide on how to prevent getting suspended on AWS SES.
You can configure a max number of job processors to avoid hitting the rate limits of the delivery services.
Why do we need job queues and processors?
Multiple reasons: A) External delivery services are generally slow. The queue mechanism allows you to process these dead jobs asynchronously from the rest of your code. B) The queue mechanism allows you to control the rate of your jobs, thus avoiding getting throttled. C) These external services could face outages. A job queue mechanism lets you decide what to do in cases of job failure without a single line of code, e.g., retry the job every 30 minutes for a maximum of 3 times.
In-App Notifications
If your service is a web app or uses an app to communicate with users, you should consider adding in-app notifications to your notification system. Using your app is a commitment that comes with the expectation of messages on important topics (order confirmation if retail-based, comments or approvals if collaborative), and in-app fulfills that in less invasive ways than filling email inboxes or sending SMS. While both email and SMS messages can be disregarded as spam or scams, in-app creates a more trustworthy and direct channel for notifications.
Websockets & Webhooks
Websockets are similar to API, but have a key difference - they stay live, so servers can constantly send information through them, instead of relying on clients sending requests to the API. This makes websockets a go-to solution when handling in-app notifications or other services that require live updates. Effective in-app notifications require a websocket connection to ensure new notifications are delivered to users without the need to refresh the web page.
Also very similar to API, webhooks differ because they wait for events to happen. Your email provider can call a webhook to inform you that an email was delivered or bounced and the contents of that request can be logged for deliverability records.
Logging
Logging gives valuable insight into how effective your notification system currently is. Tracking deliverability rates allows you to identify failures or bounces (incorrect email or phone numbers) and follow up with necessary changes, such as removing incorrect contact information, or troubleshooting your delivery process. Logging open-and-click rates lets you determine how effective your approach to notifications is to your clients, and better understand what kind of notifications clients are more receptive to.
Further Improvements
Here are a few things that are possible but we haven't covered. If you need any of these capabilities, read the next section.
The architecture of a scalable in-app notification service - they require their own APIs, tables, etc., so they deserve a separate article.
Removing notification contents from the code and allowing your product and design team to edit the notifications visually without code change.
Dashboard for your team to enable/disable notifications or specific channels without code changes
Conclusion
In this article, we learned about the architecture of a scalable notification service. We used tools available in all the major cloud providers so that you can build your notifications based on this. Alternatively, you also learned of a notification-as-a-service product that could save you all the trouble. Software engineering, like any other engineering discipline, comes down to trade-offs.
Further reading:
Decision tree in our ultimate guide on notification services that can help you figure out what trade-offs you make
Avoid these common technical mistakes with your notification service
Avoid these non-technical mistakes too