Authors:
Jeff Inman, Holger Lierse, Movyn John
Changed on:
15 Mar 2024
Fluent Commerce cloud-native Multi-Tenant Software as a Service (SaaS) is a solution engineered to optimize order management processes. It uses multi-availability zone microservices and a pooled resources multi-tenancy model for performance and scalability. The Purpose of this document is to explain the high level process for migration of a Client from a multi-tenant environment to a single tenant environment.
Multi-Availability Zone: The architecture ensures high availability across multiple distributed availability zones, guaranteeing uninterrupted service even when infrastructure failures occur.
Microservices: Modular components, deployable independently to handle specific order management functions. Each microservice prioritizes agility, fault tolerance, and scalability.
Containerization: AWS ECS or EC2 orchestrates containerized microservices, enabling efficient resource allocation, rapid scaling, and self-healing capabilities.
Multi-Tenant Resource Pooling: In a multi-tenant environment, certain compute resources are pooled, with multiple tenants sharing the same compute, storage, or database resources.
Logical tenant Isolation: Robust isolation mechanisms ensure that each tenant's data and configurations remain completely separated and secure. Separate message queues, API endpoints, and database partitioning techniques are employed to achieve this.
Multi-tenancy is managed at two tiers: Application Services and Databases, with deployment across multiple availability zones for reliability and scalability. Additionally, it highlights using a configuration registry to manage independent database connection pools, one per tenant, mapped to various database clusters within each multi-tenant environment.
The application services are using a microservices architecture. Each microservice focuses on specific business functions, such as user management, order processing, or inventory tracking. AWS Elastic Load Balancing (ELB) evenly distributes incoming traffic across instances of the microservices deployed in various AZs. Auto Scaling groups dynamically adjust the number of microservice instances based on traffic and resource utilization.
Every request contains a tenant identifier to map each tenant to its underlying Database and other resources. All microservices perform this mapping from the tenant identifier to the underlying isolated compute and database resources. Every API call to the Fluent platform maps the tenant identifier to a dedicated database connection pool that routes to the appropriate database cluster.
This database connection mapping allows dynamic routing and allocation of tenants at runtime—mapping or changing underlying infrastructure at the application level with no deployment or downtime.
Like the application services, databases are Multi-Availability Zone deployed. A hot standby is available for each different relational database, and it serves the following scenarios:
All databases use an “alias” as a reference with a primary and secondary hot standby. Each tenant is routed to a named DB alias and onto the underlying database cluster. This setup allows us to independently expand and contract the number of clients, database clusters, and infrastructure capacity.
Example of a microservice and its corresponding database, both Multi-AZ
We standardize the software platform, infrastructure configuration, and deployment processes. A summary of the differences between single and multi-tenant is below.
Fluent extensively uses a “Blue-Green” deployment methodology where individual services or the entire application stack is provisioned and gradually “cut over” from old to new versions. Service health metrics undergo automatic checks during a deployment "grace period," triggering an automatic rollback if metrics fall outside normal bounds.
Database upgrades follow a similar process. Updates are applied to the hot standby and then promoted to the primary. Afterward, the new standby undergoes updates.
These processes are performed many times every month across dozens of production environments.
The section covers an overview of the phases of moving a client from a multi-tenant to a single-tenant environment. The areas covered are:
It’s important to note that the below procedures are performed by Fluent Commerce during the customers' lowest activity hours.
During an Application Deployment, a new application stack that contains all the required stand-alone services for the Fluent OMS running side-by-side with the existing environment launches. Once provisioned, any customer-specific plugins synchronize to the new location. At this point, the new application stack connects to the database in the existing location; however, no external traffic is coming through to the new application stack. Upon completion, the new application stack undergoes various testing to ensure all services, functionality, and connections are working as expected.
Note: The application deployment is completed entirely on the Sandbox environment and then repeated on Production.
Once testing of the new environment concludes, a blue/green cutover executes to direct traffic to the new application stack (the green environment). At this point, the new (green) stack takes over all incoming requests. This cutover type eliminates the need for customer-side updates since DNS and other configurations remain unchanged.
Note: The application cutover is first completed entirely on the Sandbox environment and then repeated on Production after post-deployment verification testing and monitoring have passed on Sandbox.
An additional database replicates alongside the primary writer node, automatically transitioning into a hot standby mode, achieving data parity with the live database. Verification checks then ensure data integrity.
When the hot standby node synchronizes, it will extend the sequence numbers and repoint the account configuration to use the hot standby node as the primary data source.
Once we’re satisfied that the promotion was successful, the link between the primary node and the hot standby node will be severed, and a new read replica will be added to the promoted database (for redundancy).
This section covers other aspects of the end-to-end process.
All security aspects remain the same as the new environment is identical and managed via Infrastructure-as-code in version control. All certificates, webhook keys and authentication to the platform remain unchanged.
We have performed rigorous testing of the promotion process end-to-end to achieve a high confidence level leading up to the rollout. Additionally, our team goes through multiple test cases on the day of the promotion to the new database before proceeding.
At each application or database upgrade phase, we have an automated suite of tests and verifications to confirm success before progressing to the next stage.
If potential issues arise during the database promotion, the cutover will not proceed. Before the upgrade, we perform a full system backup that is restored in the unlikely event of an issue post-cutover.
For Sandbox
For Production
We monitor the entire process of upgrading to a single tenant, specifically the application cutover and the database promotion. Cloud Engineering and Site Reliability Engineering will rigorously monitor all aspects of the environment 24/7 for 3 days. After this hypercare period, the environment continues to be monitored as per BAU monitoring and alerting procedures.
Should an issue arise, we will keep the old environment provisioned for 3 days after the cutover. We cut back traffic to the old environment in case of an issue during the application cutover. The cutback generally takes a few seconds.
Should an issue arise after we’ve cut over to the new database, we will perform a cutback to the original data source and then push any new or inflight data from the new database (rolled back) back to the original database.
You may notice a difference with the jump in the database sequence IDs as part of promoting the hot standby. The entity IDs must be unique identifiers in the database. Your regular operations remain unaffected, and the customer does not need to take any action.
What do I do in the event of any post-cutover issues?
If there is any issue, please log a ticket with the Service Desk via Jira in the first instance, ensuring you raise it with the corresponding severity level and add detailed information.
Application:
During this process, the new application stack undergoes a blue/green cutover, allowing it to complete any traffic directed to the old application stack, resulting in zero outage.
Database:
During the database promotion, anticipate some instability in the environment, with expected service degradation of up to 30 minutes within the designated window when the platform and APIs might be unavailable.
The platform will generate 500 error responses and timeouts for incoming requests in such a scenario. These errors occur because the API cannot communicate with the database during the promotion. Typically, service degradation lasts no more than 1 minute, but preparing for a 30-minute maintenance period is recommended.
Copyright © 2024 Fluent Retail Pty Ltd (trading as Fluent Commerce). All rights reserved. No materials on this docs.fluentcommerce.com site may be used in any way and/or for any purpose without prior written authorisation from Fluent Commerce. Current customers and partners shall use these materials strictly in accordance with the terms and conditions of their written agreements with Fluent Commerce or its affiliates.