Post 8 of 15 | Phase 4: Fault Tolerance
Fault Tolerance — Keeping Your App Alive When Things Break
In every post so far we have assumed that services are always available and always respond correctly. In reality, services crash, slow down, run out of memory, and fail in unexpected ways. This is not a rare edge case. In a distributed system with many services, something is always failing somewhere.
Fault tolerance is the set of techniques that keep your overall application working even when individual parts of it are broken. Moleculer has four built-in fault tolerance mechanisms: Timeout, Retry, Circuit Breaker, and Bulkhead. This post covers all four.
The Restaurant Analogy Revisited
Before writing any code, understand the problem through a real-world scenario.
You own a restaurant. One day your kitchen equipment breaks down and every order takes 45 minutes instead of 15. Customers are waiting, getting frustrated, and new customers are still walking in and placing orders. Soon the entire restaurant is backed up. Nobody is getting served. The restaurant collapses under the load of waiting orders.
What should you have done?
- After waiting 20 minutes with no food, tell the customer we cannot serve you right now. That is a Timeout.
- If the kitchen fails on the first attempt, try again once or twice before giving up. That is a Retry.
- If the kitchen has failed 10 times in a row, stop sending orders there and tell customers immediately instead of making them wait. That is a Circuit Breaker.
- Only allow 5 orders in the kitchen at once. If more come in, queue them or reject them. That is a Bulkhead.
These four concepts apply directly to your microservices.
Fault Tolerance Mechanism 1: Timeout
A timeout says: if this action does not respond within a certain time, stop waiting and throw an error.
Without a timeout, a slow service can block your entire application. Your calling service waits forever, occupying resources, and eventually your whole system grinds to a halt.
Global timeout in moleculer.config.js:
module.exports = {
requestTimeout: 10 * 1000 // 10 seconds for every action call
};
Per-call timeout override:
actions: {
async createOrder(ctx) {
// This specific call has a 3 second timeout
// Overrides the global 10 second timeout
const user = await ctx.call("user.getById", { id: ctx.params.userId }, {
timeout: 3000
});
return user;
}
}
Per-action timeout on the action definition itself:
module.exports = {
name: "report",
actions: {
generate: {
// This action allows up to 30 seconds
// because generating reports is slow
timeout: 30000,
handler(ctx) {
// slow report generation
}
}
}
};
What happens when a timeout occurs:
Moleculer throws a RequestTimeoutError. The calling service receives this error and can handle it:
actions: {
async getDashboard(ctx) {
try {
const data = await ctx.call("slow.service", {}, { timeout: 3000 });
return data;
} catch (err) {
if (err.name === "RequestTimeoutError") {
// Return a fallback response instead of crashing
return { message: "Service is slow right now. Please try again." };
}
throw err;
}
}
}
Setting timeout to zero disables it for that call:
// No timeout for this call — wait forever
const result = await ctx.call("long.running.job", {}, { timeout: 0 });
Fault Tolerance Mechanism 2: Retry
A retry says: if this action call fails, automatically try again a few times before giving up.
Some failures are temporary. A service might be restarting, a database connection might be briefly lost, a network blip might have occurred. Retrying after a short delay often resolves these temporary failures without any user-visible error.
Global retry policy in moleculer.config.js:
module.exports = {
retryPolicy: {
enabled: true,
retries: 3, // Try up to 3 times after the first failure
delay: 100, // Wait 100ms before first retry
maxDelay: 2000, // Never wait more than 2 seconds between retries
factor: 2, // Double the delay each time (exponential backoff)
check: err => err && !!err.retryable // Only retry if error is marked retryable
}
};
With factor: 2 and delay: 100, the retry timing looks like this:
First attempt → fails
Wait 100ms
Second attempt → fails
Wait 200ms
Third attempt → fails
Wait 400ms
Fourth attempt → fails or succeeds
Give up if still failing
This is called exponential backoff. Waiting longer between each retry gives the failing service more time to recover.
Per-call retry override:
actions: {
async processPayment(ctx) {
// Payment is critical — retry up to 5 times
const result = await ctx.call("payment.charge", {
amount: ctx.params.amount
}, {
retries: 5
});
return result;
}
}
Making your errors retryable:
By default, Moleculer only retries errors marked as retryable. You control this when throwing errors:
const { MoleculerRetryableError } = require("moleculer").Errors;
actions: {
getFromExternalAPI(ctx) {
try {
// Call an external API
} catch (err) {
if (err.message.includes("ECONNRESET")) {
// Network error — worth retrying
throw new MoleculerRetryableError("External API unavailable", 503);
}
// Logic error — do not retry
throw err;
}
}
}
Important: Do not retry non-idempotent operations blindly
An idempotent operation is one that produces the same result no matter how many times you run it. Reading data is idempotent. Charging a credit card is not — you do not want to charge three times just because the response was slow.
Be careful enabling global retries. It is safer to enable retries per-call for operations you know are safe to retry.
Fault Tolerance Mechanism 3: Circuit Breaker
This is the most important fault tolerance pattern. Understand it well.
The Problem Without Circuit Breaker
Imagine your user service is down. Every time order service calls user.getById, it waits 10 seconds for the timeout, then fails. If 100 requests per second are coming in, that is 100 requests all waiting 10 seconds each, occupying memory and connections. Your order service slows to a crawl because of one broken downstream service.
What Circuit Breaker Does
The Circuit Breaker monitors calls to each service. If too many calls fail within a time window, it opens the circuit. An open circuit means it stops trying to call the service immediately — it throws an error right away without waiting for a timeout. This protects the calling service from being dragged down by a broken dependency.
There are three states:
Closed — Normal operation. Calls go through. Failures are counted.
Open — Too many failures detected. Calls are blocked immediately. No actual call is made. Error is thrown instantly.
Half-Open — After a cooldown period, one test call is allowed through. If it succeeds, circuit closes again. If it fails, circuit stays open.
CLOSED → (too many failures) → OPEN → (cooldown passes) → HALF-OPEN → (test succeeds) → CLOSED
→ (test fails) → OPEN
Enabling Circuit Breaker in moleculer.config.js:
module.exports = {
circuitBreaker: {
enabled: true,
threshold: 0.5, // Open if 50% of calls fail
minRequestCount: 20, // Need at least 20 requests before evaluating
windowTime: 60, // Look at failures in the last 60 seconds
halfOpenTime: 10000, // Wait 10 seconds before allowing a test call
check: err => err && err.code >= 500 // Only count 5xx errors as failures
}
};
Let us go through each option:
- threshold: 0.5 means if 50 percent or more of calls fail, open the circuit
- minRequestCount: 20 means do not open the circuit until at least 20 calls have been made. Prevents opening on just one or two failures during startup.
- windowTime: 60 means count failures that happened in the last 60 seconds only
- halfOpenTime: 10000 means after the circuit opens, wait 10 seconds before trying one test call
- check defines what counts as a failure. Here only server errors (500+) count. A 404 or validation error does not trip the circuit breaker.
What the caller sees:
actions: {
async createOrder(ctx) {
try {
const user = await ctx.call("user.getById", { id: ctx.params.userId });
return user;
} catch (err) {
if (err.name === "CircuitBreakerOpenError") {
// Circuit is open. User service is known to be broken.
// Return a graceful response instead of making the user wait.
return { error: "User service is temporarily unavailable" };
}
throw err;
}
}
}
Without the circuit breaker, every call waits 10 seconds before failing. With the circuit breaker, once it opens, every call fails in milliseconds. Your order service stays responsive even though user service is broken.
Testing Circuit Breaker
Create this file to see circuit breaker behavior:
"use strict";
const { ServiceBroker } = require("moleculer");
const broker = new ServiceBroker({
logLevel: "info",
circuitBreaker: {
enabled: true,
threshold: 0.5,
minRequestCount: 3, // Low number for testing purposes
windowTime: 60,
halfOpenTime: 5000
}
});
// A service that always fails
broker.createService({
name: "broken",
actions: {
doSomething(ctx) {
throw new Error("I am always broken");
}
}
});
// A service that calls the broken service
broker.createService({
name: "caller",
actions: {
async test(ctx) {
try {
await ctx.call("broken.doSomething", {});
} catch (err) {
return `Error type: ${err.name} — ${err.message}`;
}
}
}
});
broker.start()
.then(async () => {
// Make several calls — watch the error type change
for (let i = 1; i <= 8; i++) {
const result = await broker.call("caller.test", {});
console.log(`Call ${i}: ${result}`);
await new Promise(r => setTimeout(r, 200));
}
await broker.stop();
});
Run this file:
node circuit-test.js
Output:
Call 1: Error type: Error — I am always broken
Call 2: Error type: Error — I am always broken
Call 3: Error type: Error — I am always broken
Call 4: Error type: CircuitBreakerOpenError — Circuit breaker is open
Call 5: Error type: CircuitBreakerOpenError — Circuit breaker is open
Call 6: Error type: CircuitBreakerOpenError — Circuit breaker is open
Call 7: Error type: CircuitBreakerOpenError — Circuit breaker is open
Call 8: Error type: CircuitBreakerOpenError — Circuit breaker is open
After 3 failures the circuit opens. Subsequent calls fail instantly without actually calling the broken service.
Fault Tolerance Mechanism 4: Bulkhead
A bulkhead limits how many concurrent calls can be active at the same time for a service. If the limit is reached, additional calls are queued or rejected.
The name comes from ship design. A bulkhead is a wall that divides a ship into sections. If one section floods, the bulkhead prevents the entire ship from sinking. In software, if one service is overwhelmed, the bulkhead prevents it from taking down everything else.
Enabling Bulkhead in moleculer.config.js:
module.exports = {
bulkhead: {
enabled: true,
concurrency: 10, // Only 10 calls active at the same time
maxQueueSize: 100 // Queue up to 100 additional calls
}
};
With these settings:
- First 10 calls execute immediately
- Calls 11 to 110 wait in a queue
- Call 111 and beyond are rejected with a QueueIsFullError
Per-action bulkhead:
You can also set bulkhead limits on individual actions:
module.exports = {
name: "report",
actions: {
generate: {
// Report generation is heavy — only 3 at a time
bulkhead: {
enabled: true,
concurrency: 3,
maxQueueSize: 10
},
async handler(ctx) {
// Heavy report generation
await generateHeavyReport();
return { status: "done" };
}
},
// Other actions in this service are not limited
list: {
handler(ctx) {
return [];
}
}
}
};
This is useful when one action is resource-heavy and you do not want it to consume all available resources and starve other actions.
Fallback — The Safety Net
A fallback is a function that runs when an action call fails for any reason — timeout, circuit open, service not found, any error. Instead of propagating the error to the user, you return a default response.
Fallback can be defined at the call level:
actions: {
async getDashboard(ctx) {
const result = await ctx.call("recommendations.get", {
userId: ctx.params.userId
}, {
// If recommendations service fails for any reason,
// return this instead of throwing an error
fallbackResponse: {
recommendations: [],
message: "Recommendations unavailable right now"
}
});
return result;
}
}
Fallback can also be a function:
const result = await ctx.call("recommendations.get", {
userId: ctx.params.userId
}, {
fallbackResponse(ctx, err) {
this.logger.warn(`Recommendations failed: ${err.message}`);
return {
recommendations: [],
message: "Showing default recommendations"
};
}
});
Use fallback for non-critical features. Recommendations, personalization, analytics — these are nice to have but your app should work without them.
Putting It All Together — Production-Ready Config
Here is a realistic moleculer.config.js for a production application with all fault tolerance features enabled:
"use strict";
module.exports = {
namespace: "ecommerce",
nodeID: null,
logLevel: "warn",
transporter: "nats://localhost:4222",
// Global timeout — 10 seconds
requestTimeout: 10 * 1000,
// Retry policy — retry up to 3 times with exponential backoff
retryPolicy: {
enabled: true,
retries: 3,
delay: 100,
maxDelay: 2000,
factor: 2,
check: err => err && !!err.retryable
},
// Circuit breaker
circuitBreaker: {
enabled: true,
threshold: 0.5,
minRequestCount: 20,
windowTime: 60,
halfOpenTime: 10 * 1000,
check: err => err && err.code >= 500
},
// Bulkhead — limit concurrent calls per service
bulkhead: {
enabled: true,
concurrency: 10,
maxQueueSize: 100
},
// Load balancing
registry: {
strategy: "RoundRobin",
preferLocal: true
}
};
When to Use Each Mechanism
Timeout — Always. Set a global timeout. Every call should have a limit.
Retry — For network calls, external APIs, temporary failures.
Not for payment processing or any non-idempotent operation.
Circuit Breaker — Always in production. Protects healthy services from
being dragged down by broken ones.
Bulkhead — For resource-heavy operations like report generation,
file processing, or calls to slow external services.
Fallback — For non-critical features. Recommendations, analytics,
personalization. Your app should work without them.
Summary
- Fault tolerance is not optional in production microservices. Things will break.
- Timeout prevents your app from waiting forever. Set a global timeout always.
- Retry automatically retries failed calls. Use exponential backoff. Be careful with non-idempotent operations.
- Circuit Breaker monitors failure rates. When too many fail, it opens and rejects calls instantly. This protects healthy services from broken ones.
- Circuit has three states: Closed (normal), Open (blocking), Half-Open (testing recovery).
- Bulkhead limits concurrent calls to prevent resource exhaustion.
- Fallback provides a default response when everything else fails.
- Configure all four in moleculer.config.js for global behavior.
- Override per-call or per-action when specific operations need different limits.
Up Next
Post 9 covers Caching — one of the easiest wins for performance in Moleculer. Built-in caching with zero extra code on most actions. We will cover memory caching, Redis caching, cache keys, TTL, and how to invalidate cache when data changes.
Course Progress: 8 of 15 posts complete.
No comments:
Post a Comment