Designing NodeSync: A Distributed Key-Value Store with Replication, Consensus, and Failure Handling
Exploring Heartbeats, leader election, replication, and consistency models

Aspiring MS/MEng in Computer Science (Fall 2026) specializing in systems, software engineering, and scalable distributed technologies. Proven track record as an open-source contributor (100+ merged PRs in Checkstyle) and IEEE-published researcher in blockchain smart-contract systems. Skilled in building high-performance developer platforms, optimizing code quality pipelines, and delivering research-driven, large-scale engineering solutions.
Introduction
Distributed systems are defined less by what they do and more by what can go wrong. NodeSync is a distributed key-value store built to explore this reality firsthand. Instead of abstract algorithms, NodeSync focuses on concrete behaviors: node crashes, network delays, and leadership changes.
This article focuses on design decisions, trade-offs, and implementation insights gained while building NodeSync.
🔗 Project: https://github.com/MANISH-K-07/NodeSync
📘 Docs: https://manish-k-07.github.io/NodeSync/
Design Goals
NodeSync was designed with the following principles:
Minimal external dependencies
Explicit network communication
Clear separation of responsibilities
Emphasis on correctness over performance
The project intentionally avoids:
Frameworks that hide complexity
External coordination services
Advanced optimizations
This ensures that every behavior is visible and understandable.
Node Responsibilities
Each node in NodeSync acts independently.
Responsibilities include:
Managing local key-value state
Handling client commands
Monitoring peer health
Participating in leader election
Replicating writes when required
There is no special bootstrap node or controller.
3 nodes (example)
python nodes/node.py 5000 127.0.0.1:5001 127.0.0.1:5002
python nodes/node.py 5001 127.0.0.1:5000 127.0.0.1:5002
python nodes/node.py 5002 127.0.0.1:5000 127.0.0.1:5001
The node with the highest port becomes leader.
16:55:49 [NodeSync ] Node 5000 running on 127.0.0.1:5000
16:55:49 [NodeSync ] Peers: [('127.0.0.1', 5001), ('127.0.0.1', 5002)]
16:55:49 [ELECTION ] New leader elected: 5002
Each node runs as an independent process.
The first argument is the node's port (used as its ID)
Remaining arguments are peer addresses
Client Interaction Model
Clients communicate with nodes over TCP using simple text-based commands.
Supported commands include:
SET key value
GET key
LEADER
CONSISTENCY strong | eventual
Clients are unaware of which node is leader. This transparency simplifies client logic but increases system responsibility.
Client Interaction (PowerShell Example)
$client = New-Object System.Net.Sockets.TcpClient("127.0.0.1", 5002)
$stream = $client.GetStream()
$writer = New-Object System.IO.StreamWriter($stream)
$reader = New-Object System.IO.StreamReader($stream)
$writer.AutoFlush = $true
$writer.WriteLine("SET hello world")
$reader.ReadLine()
$writer.WriteLine("GET hello")
$reader.ReadLine()
Forwarding Writes to the Leader
When a client sends a SET command to a follower:
The follower forwards the command to the leader
The leader executes the write
The response is relayed back to the client
This forwarding mechanism ensures:
Centralized write ordering
Reduced conflict risk
Predictable behavior under concurrency
Replication and Quorum
Replication is explicit and acknowledgment-based.
Key concepts:
Leader counts itself as one acknowledgment
Followers respond with ACK after applying updates
Strong consistency requires quorum
Eventual consistency does not
This design makes quorum mechanics observable rather than implicit.
Handling Failures
Failures are treated as first-class events.
NodeSync handles:
Peer crashes
Temporary unavailability
Recovery after failure
Behavior includes:
Marking peers DOWN on heartbeat failure
Logging recovery events
Re-electing leaders dynamically
These behaviors expose the instability that real systems must tolerate.
Example Failure Scenario
Below is a typical sequence when a leader node fails:
16:55:32 [NodeSync ] Node 5001 running on 127.0.0.1:5001
16:55:32 [NodeSync ] Peers: [('127.0.0.1', 5000), ('127.0.0.1', 5002)]
16:55:34 [ELECTION ] New leader elected: 5002
16:56:03 [FAILURE ] Peer ('127.0.0.1', 5002) is DOWN
16:56:03 [ELECTION ] New leader elected: 5001
This sequence demonstrates:
Failure detection via heartbeat timeout
Automatic leader re-election
Continued system operation without manual intervention
Consistency as a Runtime Choice
Allowing consistency to be changed at runtime provides a powerful learning tool.
It demonstrates:
How availability changes under failures
How write guarantees affect success rates
Why system designers must choose carefully
Few features made the trade-offs as tangible as this one.
Available Modes
Eventual consistency (default)
Writes are replicated asynchronously
Lower latency
Temporary inconsistencies may occur
Strong consistency (quorum-based)
Leader waits for acknowledgements from a majority of nodes
Higher latency
Linearizable writes as long as quorum is available
Switching Consistency at Runtime
Consistency can be changed dynamically using a client command:
CONSISTENCY eventual
CONSISTENCY strong
The consistency mode applies to the node receiving the command (typically the leader).
If quorum cannot be reached in strong mode, the write fails with:
FAIL: quorum not reached
This enables direct comparison of consistency-performance trade-offs.
Observability as a Design Requirement
Logging was not added as an afterthought.
NodeSync logs are:
Structured
Timestamped
Human-readable
Event-driven
This made debugging distributed behavior feasible.
Log Format
Each log entry follows the format:
HH:MM:SS [TAG ] message
Example:
17:09:47 [NodeSync ] Node 5000 running on 127.0.0.1:5000
What This Project Taught Me
Key takeaways include:
Distributed systems are dominated by edge cases
Failure handling outweighs happy-path logic
Simplicity improves correctness
Design trade-offs are unavoidable
Most importantly, building NodeSync changed how I read research papers and system designs. Abstract ideas now map to real behaviors.
Closing Thoughts
NodeSync is a learning system, not a production system. But its value lies precisely in that limitation. By stripping away abstraction layers, it exposes the true nature of distributed systems.
For anyone studying systems engineering or distributed computing, building a project like NodeSync is an invaluable experience.
👉https://github.com/MANISH-K-07/NodeSync
👨💻About the Author
I’m Manish Krishna Kandrakota — a computer science undergraduate exploring distributed systems, backend infrastructure, and systems-level engineering. I build projects from scratch to understand how real-world systems handle failures, coordination, and consistency. I plan to pursue graduate studies in computer science with a focus on research-driven software engineering and security.
My recent work includes distributed storage systems, static analysis tools, and security-focused software projects. I write to document my learning and share practical insights from building complex systems hands-on. I have built multiple static analysis and security-focused tools, including CodeChecker, PyScope, and SecureFlow, with an emphasis on clarity of design, explicit trade-offs, and academic relevance.
🔗 GitHub: MANISH-K-07
🔗 LinkedIn: manish-k-kandrakota



