Skip to main content

Command Palette

Search for a command to run...

Designing NodeSync: A Distributed Key-Value Store with Replication, Consensus, and Failure Handling

Exploring Heartbeats, leader election, replication, and consistency models

Updated
5 min read
Designing NodeSync: A Distributed Key-Value Store with Replication, Consensus, and Failure Handling
M

Aspiring MS/MEng in Computer Science (Fall 2026) specializing in systems, software engineering, and scalable distributed technologies. Proven track record as an open-source contributor (100+ merged PRs in Checkstyle) and IEEE-published researcher in blockchain smart-contract systems. Skilled in building high-performance developer platforms, optimizing code quality pipelines, and delivering research-driven, large-scale engineering solutions.

Introduction

Distributed systems are defined less by what they do and more by what can go wrong. NodeSync is a distributed key-value store built to explore this reality firsthand. Instead of abstract algorithms, NodeSync focuses on concrete behaviors: node crashes, network delays, and leadership changes.

This article focuses on design decisions, trade-offs, and implementation insights gained while building NodeSync.

🔗 Project: https://github.com/MANISH-K-07/NodeSync

📘 Docs: https://manish-k-07.github.io/NodeSync/


Design Goals

NodeSync was designed with the following principles:

  • Minimal external dependencies

  • Explicit network communication

  • Clear separation of responsibilities

  • Emphasis on correctness over performance

The project intentionally avoids:

  • Frameworks that hide complexity

  • External coordination services

  • Advanced optimizations

This ensures that every behavior is visible and understandable.


Node Responsibilities

Each node in NodeSync acts independently.

Responsibilities include:

  • Managing local key-value state

  • Handling client commands

  • Monitoring peer health

  • Participating in leader election

  • Replicating writes when required

There is no special bootstrap node or controller.

3 nodes (example)

python nodes/node.py 5000 127.0.0.1:5001 127.0.0.1:5002
python nodes/node.py 5001 127.0.0.1:5000 127.0.0.1:5002
python nodes/node.py 5002 127.0.0.1:5000 127.0.0.1:5001

The node with the highest port becomes leader.

16:55:49 [NodeSync  ] Node 5000 running on 127.0.0.1:5000
16:55:49 [NodeSync  ] Peers: [('127.0.0.1', 5001), ('127.0.0.1', 5002)]
16:55:49 [ELECTION  ] New leader elected: 5002

Each node runs as an independent process.

  • The first argument is the node's port (used as its ID)

  • Remaining arguments are peer addresses


Client Interaction Model

Clients communicate with nodes over TCP using simple text-based commands.

Supported commands include:

  • SET key value

  • GET key

  • LEADER

  • CONSISTENCY strong | eventual

Clients are unaware of which node is leader. This transparency simplifies client logic but increases system responsibility.

Client Interaction (PowerShell Example)

$client = New-Object System.Net.Sockets.TcpClient("127.0.0.1", 5002)
$stream = $client.GetStream()
$writer = New-Object System.IO.StreamWriter($stream)
$reader = New-Object System.IO.StreamReader($stream)
$writer.AutoFlush = $true

$writer.WriteLine("SET hello world")
$reader.ReadLine()

$writer.WriteLine("GET hello")
$reader.ReadLine()

Forwarding Writes to the Leader

When a client sends a SET command to a follower:

  • The follower forwards the command to the leader

  • The leader executes the write

  • The response is relayed back to the client

This forwarding mechanism ensures:

  • Centralized write ordering

  • Reduced conflict risk

  • Predictable behavior under concurrency


Replication and Quorum

Replication is explicit and acknowledgment-based.

Key concepts:

  • Leader counts itself as one acknowledgment

  • Followers respond with ACK after applying updates

  • Strong consistency requires quorum

  • Eventual consistency does not

This design makes quorum mechanics observable rather than implicit.


Handling Failures

Failures are treated as first-class events.

NodeSync handles:

  • Peer crashes

  • Temporary unavailability

  • Recovery after failure

Behavior includes:

  • Marking peers DOWN on heartbeat failure

  • Logging recovery events

  • Re-electing leaders dynamically

These behaviors expose the instability that real systems must tolerate.

Example Failure Scenario

Below is a typical sequence when a leader node fails:

16:55:32 [NodeSync  ] Node 5001 running on 127.0.0.1:5001
16:55:32 [NodeSync  ] Peers: [('127.0.0.1', 5000), ('127.0.0.1', 5002)]
16:55:34 [ELECTION  ] New leader elected: 5002
16:56:03 [FAILURE   ] Peer ('127.0.0.1', 5002) is DOWN
16:56:03 [ELECTION  ] New leader elected: 5001

This sequence demonstrates:

  1. Failure detection via heartbeat timeout

  2. Automatic leader re-election

  3. Continued system operation without manual intervention


Consistency as a Runtime Choice

Allowing consistency to be changed at runtime provides a powerful learning tool.

It demonstrates:

  • How availability changes under failures

  • How write guarantees affect success rates

  • Why system designers must choose carefully

Few features made the trade-offs as tangible as this one.

Available Modes

  • Eventual consistency (default)

    • Writes are replicated asynchronously

    • Lower latency

    • Temporary inconsistencies may occur

  • Strong consistency (quorum-based)

    • Leader waits for acknowledgements from a majority of nodes

    • Higher latency

    • Linearizable writes as long as quorum is available

Switching Consistency at Runtime

Consistency can be changed dynamically using a client command:

CONSISTENCY eventual
CONSISTENCY strong

The consistency mode applies to the node receiving the command (typically the leader).

If quorum cannot be reached in strong mode, the write fails with:

FAIL: quorum not reached

This enables direct comparison of consistency-performance trade-offs.


Observability as a Design Requirement

Logging was not added as an afterthought.

NodeSync logs are:

  • Structured

  • Timestamped

  • Human-readable

  • Event-driven

This made debugging distributed behavior feasible.

Log Format

Each log entry follows the format:

  • HH:MM:SS [TAG ] message

Example:

17:09:47 [NodeSync  ] Node 5000 running on 127.0.0.1:5000

What This Project Taught Me

Key takeaways include:

  • Distributed systems are dominated by edge cases

  • Failure handling outweighs happy-path logic

  • Simplicity improves correctness

  • Design trade-offs are unavoidable

Most importantly, building NodeSync changed how I read research papers and system designs. Abstract ideas now map to real behaviors.


Closing Thoughts

NodeSync is a learning system, not a production system. But its value lies precisely in that limitation. By stripping away abstraction layers, it exposes the true nature of distributed systems.

For anyone studying systems engineering or distributed computing, building a project like NodeSync is an invaluable experience.

👉https://github.com/MANISH-K-07/NodeSync


👨‍💻About the Author

I’m Manish Krishna Kandrakota — a computer science undergraduate exploring distributed systems, backend infrastructure, and systems-level engineering. I build projects from scratch to understand how real-world systems handle failures, coordination, and consistency. I plan to pursue graduate studies in computer science with a focus on research-driven software engineering and security.

My recent work includes distributed storage systems, static analysis tools, and security-focused software projects. I write to document my learning and share practical insights from building complex systems hands-on. I have built multiple static analysis and security-focused tools, including CodeChecker, PyScope, and SecureFlow, with an emphasis on clarity of design, explicit trade-offs, and academic relevance.

🔗 GitHub: MANISH-K-07
🔗 LinkedIn: manish-k-kandrakota