Avatolabs GraghDB Course Part I

Marlon ShelbyMarlon Shelby
13 min read

Preface

This course aims to provide comprehensive graph database resources for our team, supporting our development work across multiple innovative areas.

Leveraging Neo4j's graph database capabilities to support our three technology domains:

  1. Avato Chat Social Network Features - Our instant messaging platform implements efficient social relationship management through graph databases, enabling users to establish more meaningful connections and interactions.

  2. RGB Protocol Bitcoin Ecosystem - We utilize Neo4j to map and optimize Bitcoin extension layer solutions based on the RGB protocol, providing more secure and transparent infrastructure for digital asset management.

  3. Dual-Track Marketing System (For legal reasons, only applicable to overseas markets) - Our MLM system tracks complex distribution network relationships through graph database technology, ensuring transaction transparency and efficiency.

Neo4j Resources

Download:
https://neo4j.com/download/

Learning Resources:

  • Bilibili: Analysis of Graph Databases (using Neo4j as an example)

  • Bilibili: Neo4j Tutorial - Latest version of high-performance graph database from beginner to advanced

  • Neo4j Official Video Tutorials (requires Google account OAuth):
    https://graphacademy.neo4j.com/courses/neo4j-fundamentals/

Part 1: Graph Databases and Neo4j Fundamentals

1.1 Essential Differences Between Graph Databases and Relational Databases

1.1.1 Fundamental Differences in Data Models

Relational Databases (e.g., MySQL) use a tabular data storage model where data is organized into predefined tables, rows, and columns. This structure enforces a strict data schema that requires data to conform to predefined schema definitions before insertion. In MySQL, relationships are implemented through foreign keys and require JOIN operations to query related data.

Neo4j Graph Database takes a completely different approach:

  • Nodes: Represent entities and can have any number of properties

  • Relationships: Directly connect nodes and can also have properties

  • Properties: Key-value pairs stored on nodes and relationships

This native graph structure allows Neo4j to more naturally express complex relationship networks found in the real world, such as social networks, recommendation systems, or supply chain management.

1.1.2 Storage Architecture Comparison

Relational databases like MySQL:

  • Data stored in tables with fixed structures

  • Relationships between tables are implicit and require JOIN operations to reconstruct

  • Use indexes (B+ tree structures, etc.) to accelerate queries

  • Separate storage, high disk I/O cost

Neo4j:

  • Uses native graph storage rather than mapping graphs to tables

  • Physical storage structure reflects logical relationship structure

  • Relationships are first-class citizens in physical storage, stored directly as pointers on disk

  • Uses linked list structures to directly connect related nodes

1.2 Neo4j's Core Technical Principles

1.2.1 Native Graph Processing Engine

Neo4j employs a storage and processing engine optimized specifically for graph operations, with these key features:

Index-free Adjacency: This is the core of Neo4j's performance advantage. In Neo4j, each node directly maintains physical pointers to its neighboring nodes, eliminating the need for index lookups during traversal. When accessing a node's neighbors, the system directly follows these physical pointers rather than performing expensive JOIN operations and index lookups like relational databases.

Practical Impact: When processing highly connected data, Neo4j can find all direct relationships of a node in constant time, regardless of the overall database size. This prevents performance degradation of deep relationship queries as data volume grows.

Locality Principle: Neo4j emphasizes the physical storage location of data, storing related nodes and relationships in physically adjacent locations whenever possible to reduce disk seek time. This is particularly important for graph traversal operations, as graph algorithms typically access data that's adjacent to each other.

1.2.2 Storage Architecture Details

Neo4j's storage architecture includes several key components:

Storage File Structure:

  • Node storage: Contains node IDs, labels, and property pointers

  • Relationship storage: Contains relationship types, directions, start and end node pointers, and property pointers

  • Property storage: Stores property data as key-value pairs

  • String storage: Dedicated to efficiently storing and reusing string values

  • Label indexes: For quickly finding nodes with specific labels

  • Relationship type indexes: For finding specific types of relationships

ACID Transaction Support: Neo4j uses Write-Ahead Logging (WAL) to ensure transaction atomicity and durability, while implementing Multi-Version Concurrency Control (MVCC) to support high-concurrency read operations without blocking writes.

1.3 Query Execution Principles and Performance Advantages

1.3.1 Cypher Query Language

Cypher is Neo4j's declarative query language, designed specifically for graph data operations:

MATCH (person:Person)-[:KNOWS]->(friend:Person)
WHERE person.name = 'Alice'
RETURN friend.name, friend.age

Unlike SQL, Cypher's syntax more intuitively expresses pattern matching in graphs, making complex relationship queries easier to write and understand.

1.3.2 Query Planning and Execution

Neo4j's query execution process:

  1. Parsing: Converts Cypher queries into abstract syntax trees

  2. Planning: Generates multiple possible execution plans based on cost models

  3. Optimization: Selects the lowest-cost execution plan

  4. Execution: Executes the query plan using an iterator pattern

Neo4j's query optimizer particularly focuses on:

  • Executing the most selective filtering conditions first

  • Using available indexes to accelerate pattern matching

  • Optimizing path finding and graph algorithm execution

1.3.3 Why Neo4j's Relationship Queries Are Faster

Compared to relational databases like MySQL, Neo4j has significant advantages in relationship queries:

JOIN Operation Comparison:

  • MySQL: Executing JOINs requires building, comparing, and merging temporary tables. Computational complexity increases rapidly with data volume. Multi-table JOINs have exponential cost growth.

  • Neo4j: Traversing relationships is a native operation, accessing related nodes through direct pointers. Complexity is proportional to the result set size, not the overall database size.

Deep Relationship Queries:

  • MySQL: Requires self-joins or recursive CTEs (in supported versions), with performance degrading significantly with each additional relationship depth.

  • Neo4j: Can efficiently execute relationship traversals of any depth, such as "friends of friends of friends," with performance only related to the actual result set size.

Path Finding Algorithms: Neo4j has built-in implementations of efficient graph algorithms:

  • Dijkstra's algorithm (shortest path)

  • A* algorithm (heuristic path search)

  • All-paths algorithm (finding all possible paths)

These algorithms run directly on the graph storage structure, avoiding the complex queries and temporary table operations required in relational databases to implement the same functionality.

1.4 Neo4j's Performance Optimization Techniques

1.4.1 Indexing Strategies

Although index-free adjacency is Neo4j's core advantage, indexes are still crucial for initial node lookups:

  • B+ tree indexes: For quick property value lookups

  • Full-text indexes: Support text searches

  • Spatial indexes: Optimize geospatial queries

  • Composite indexes: Joint indexes based on multiple properties

1.4.2 Caching Mechanisms

Neo4j leverages multi-level caching to optimize performance:

  • Page cache: Keeps the most frequently used parts of the graph structure in memory

  • Object cache: Caches parsed node, relationship, and property objects

  • Query plan cache: Reuses compiled query execution plans

1.4.3 Memory Management

Neo4j exercises fine-grained control over memory usage:

  • Off-heap memory management reduces GC pressure

  • Memory-mapped files improve I/O performance

  • Precise memory budget control avoids OOM errors

1.5 Performance Comparison in Real Scenarios

1.5.1 Social Network Analysis

Scenario: Finding second-degree connections (friends of friends) for a specific user

MySQL Implementation:

SELECT DISTINCT u3.*
FROM users u1
JOIN friendships f1 ON u1.id = f1.user1_id
JOIN users u2 ON f2.user2_id = u2.id
JOIN friendships f2 ON u2.id = f2.user1_id
JOIN users u3 ON f2.user2_id = u3.id
WHERE u1.name = 'Alice'
AND u3.id != u1.id
AND NOT EXISTS (
  SELECT 1 FROM friendships 
  WHERE (user1_id = u1.id AND user2_id = u3.id)
  OR (user1_id = u3.id AND user2_id = u1.id)
);

Neo4j Implementation:

MATCH (u1:User {name: 'Alice'})-[:FRIENDS]-(u2)-[:FRIENDS]-(u3)
WHERE u1 <> u3 
AND NOT (u1)-[:FRIENDS]-(u3)
RETURN DISTINCT u3

Performance Comparison:

  • For a dataset with 1,000 users and an average of 50 connections per user, the MySQL query might take several seconds to complete

  • For the same scale, the Neo4j query typically completes in milliseconds

  • As the dataset grows, the gap widens further: with 1 million users, the MySQL query might take minutes, while Neo4j still maintains sub-second performance

1.5.2 Complex Relationship Queries

Scenario: Finding supply chain paths for a specific product in an e-commerce system

In a system with tens of thousands of suppliers and hundreds of thousands of products:

  • MySQL might need to create temporary tables and perform multiple JOINs, with query time growing exponentially with chain depth

  • Neo4j leverages native graph traversal to return results in milliseconds, with performance almost unaffected by the total database size

1.6 Strategic Value of Neo4j in AvatoLabs Technology Stack

1.6.1 Avato Chat Social Network Features

Neo4j's graph data model is ideal for expressing and querying social relationships between users:

  • Quickly identifying mutual friends

  • Efficiently calculating community and group member relationships

  • Real-time recommendation system support

  • Personalization and prioritization of information feeds

1.6.2 RGB Protocol Bitcoin Ecosystem

Using Neo4j to track and analyze complex asset transfer networks:

  • Visualizing and analyzing transaction flows

  • Identifying asset ownership paths

  • Optimizing graph computations in consensus algorithms

  • Implementing efficient on-chain data analysis

Important Limitation Note: It's particularly important to emphasize that if the RGB community ultimately chooses to implement zero-knowledge proof (ZK) technology at the transaction level and stash storage, graph databases may not be able to build complete transaction chains. In this case, Neo4j's application scope will likely be limited to supporting only fully public assets exclusively owned by our platform. This technical decision will directly impact the scope and depth of our data analysis capabilities, and the team should plan response strategies in advance, including developing hybrid analysis models or enhancing local data capture mechanisms.

1.6.3 Dual-Track MLM System

Multi-level marketing systems are essentially network structures, where Neo4j can provide significant advantages:

  • Instant calculation of user positions and relationships in the network

  • Real-time commission calculation and distribution

  • Effective detection of anomalous patterns and potential fraud

  • Support for complex incentive rules and dynamic reward structures

1.7 Neo4j's System Architecture

1.7.1 Index-free Adjacency

Index-free adjacency is the essence of Neo4j's core architectural design and the source of its performance advantage:

Principle Explanation:

  • Each node and relationship has a unique fixed-length ID at the storage level

  • Relationship entities directly store the physical addresses (pointers) of their start and target nodes

  • Node entities maintain pointers to their first relationship

  • Relationship entities form a doubly-linked list structure, allowing nodes to efficiently traverse all related relationships

Performance Impact:

  • The time complexity of finding all adjacent nodes of a node is O(k), where k is the number of adjacent nodes, regardless of the total database size

  • Avoids the extensive index lookups and table scans required by JOIN operations in traditional relational databases

  • Allows graph algorithms to execute with near-theoretically optimal performance

Limitations and Trade-offs:

  • Requires more storage space to maintain these direct references

  • Modification operations (especially deleting relationships) may require more steps to maintain these linked list structures

  • Initial node lookups still rely on indexes

1.7.2 Neo4j's Underlying Storage Structure

Neo4j's storage architecture consists of multiple dedicated files, each responsible for a specific type of data:

Main Storage Files:

  1. Node store (nodestore):

    • Fixed-length record structure, typically 15 bytes

    • Contains: flags, label pointers, property pointers, relationship pointers

    • Uses compact notation to maximize storage efficiency

  2. Relationship store (relationshipstore):

    • Fixed-length records, typically 34 bytes

    • Contains: relationship type ID, start node ID, end node ID, property pointers, relationship chain pointers

    • Maintains four linked list pointers for efficient traversal of relationships in all directions

  3. Property store (propertystore):

    • Used to store properties of nodes and relationships

    • Contains property type, property key, and property value

    • Short property values are stored inline directly, longer values reference dynamic storage areas

  4. Label store (labelstore):

    • Stores label information and their associations with nodes
  5. Dynamic record storage:

    • Used to store variable-length data such as strings, arrays, etc.

    • Supports arbitrary length values through record linking

Physical Storage Layout:

  • All records use fixed-length design for fast random access

  • Uses record linking techniques to support variable-length data

  • Employs page caching mechanisms to reduce disk I/O

1.7.3 Neo4j's Traversal Methods

Neo4j supports multiple efficient graph traversal strategies, each with advantages in different scenarios:

Depth-First Search (DFS):

  • Implemented using stack data structures

  • Suitable for finding deep paths and connectivity analysis

  • Used in Neo4j for path finding and reachability checks

Breadth-First Search (BFS):

  • Implemented using queue data structures

  • Suitable for finding shortest paths

  • Used in Neo4j for shortest path and nearest neighbor analysis

Bidirectional Search:

  • Searches simultaneously from both start and target nodes

  • Significantly reduces the search space

  • Used for efficient implementation of short path queries

Weighted Path Algorithms:

  • Dijkstra's algorithm: Finds shortest paths based on weights

  • A* algorithm: Uses heuristic methods to optimize path search

  • Supports defining custom weight properties on relationships

Neo4j's Practical Implementation Optimizations:

  • Uses bitmaps to mark visited nodes, avoiding revisits

  • Uses iterators instead of recursion, reducing stack overflow risk

  • Supports dynamic pruning during traversal

  • Allows setting depth and expansion limits to avoid traversal explosion

1.7.4 Neo4j's Storage Optimizations

Neo4j provides multiple mechanisms to optimize storage space and query performance:

Data Compression Techniques:

  • Compact representation of property values

  • String reuse and reference counting

  • Inline storage of small property values

Memory-Mapped Files:

  • Leverages the operating system's virtual memory management

  • Reduces JVM heap memory pressure

  • Provides a more direct I/O path

Cache Hierarchy:

  1. Page Cache:

    • Caches raw storage pages

    • Directly maps to operating system memory

    • Configurable size, typically occupying most available memory

  2. Object Cache:

    • Caches high-level data structures (nodes, relationships, etc.)

    • Reduces deserialization overhead

    • Size can be adjusted as needed

  3. Query Cache:

    • Caches compiled query plans

    • Reuses execution paths for common queries

    • Adaptively optimizes popular queries

Storage Partitioning Strategies:

  • Supports physical grouping by label or relationship type

  • Allows highly related data to be stored physically adjacent

  • Optimizes locality for specific query patterns

1.8 Neo4j Version Overview

1.8.1 Neo4j 1.x - Foundation Era

Neo4j 1.x series (2010-2013) laid the foundation for graph databases:

  • Introduced core graph data model and traversal API

  • Provided basic transaction support

  • Included early versions of the Cypher query language

  • Mainly optimized for single-machine deployment

  • Lacked enterprise-grade high availability features

1.8.2 Neo4j 2.x - Labels and Clustering

Neo4j 2.x series (2013-2015) brought important feature upgrades:

  • Introduced the concept of node labels, greatly enhancing data modeling capabilities

  • Improved the Cypher query language, adding more graph operation functionality

  • Enhanced indexing and constraint capabilities

  • Enterprise version provided initial high availability clustering functionality

  • Improved locking mechanisms and concurrent performance

1.8.3 Neo4j 3.x - Maturity and Expansion

Neo4j 3.x series (2016-2019) marked the product's maturation:

  • Introduced causal clustering architecture, providing enterprise-grade scalability

  • Significantly optimized the Cypher execution engine

  • Added procedural API (stored procedures)

  • Provided multi-database support

  • Improved security and authentication mechanisms

  • Introduced APOC (Awesome Procedures On Cypher) extension library

1.8.4 Neo4j 4.x - Multi-database and Federation

Neo4j 4.x series (2020-2022) brought revolutionary architectural changes:

  • Full multi-database support, allowing isolation of multiple graph databases in a single instance

  • Introduced reactive architecture

  • Provided heterogeneous clustering and horizontal scaling capabilities

  • Improved full-text indexing and geospatial functionality

  • GQL standard support (Graph Query Language)

  • Optimized memory management and storage engine

1.8.5 Neo4j 5.x - Next Generation Performance

Neo4j 5.x series (2022 to present) focuses on performance and scalability:

  • New Cypher planner and runtime

  • Improved cluster topology and replication mechanisms

  • Support for ultra-large-scale graphs (trillion-level edges)

  • Significantly improved data import and batch processing performance

  • Enhanced monitoring and observability functionality

  • Tight integration with the Graph Data Science library

1.8.6 Neo4j Community Edition

Neo4j Community Edition is a free, open-source version suitable for developers and small projects:

Core Features:

  • Complete ACID transaction support

  • Cypher query language

  • Built-in web interface (Neo4j Browser)

  • Single-instance deployment

  • Basic data import/export tools

  • Complete CRUD operation support

  • Developer APIs and drivers

Limitations:

  • No support for clustering and high availability

  • No fine-grained security controls

  • No hot backup functionality

  • Cannot perform online backups

  • No commercial support

1.8.7 Neo4j Enterprise Edition

Neo4j Enterprise Edition is aimed at commercial applications and large-scale deployment scenarios:

Enterprise Features:

  • High availability clustering architecture

  • Causal clustering and read-write separation

  • Online backup and recovery

  • Fine-grained role-based access control

  • Multi-data center support

  • Monitoring and alerting framework

  • ETL tool integration

  • Advanced indexing functionality

  • Enterprise-grade technical support

  • Full functionality of the Graph Data Science library

Licensing Model:

  • Core-based licensing

  • Subscription-based annual licenses

  • Support for cloud deployment options (AWS, Azure, GCP)

  • Includes professional support and consulting services

Deployment Options:

  • Self-managed deployment

  • Neo4j Aura (fully managed cloud service)

  • Neo4j Desktop (developer tool)

We are currently using the open-source version to validate our dual-track strategy. At this stage, we are using the Neo4j Community Edition for proof of concept and initial development. Once the business model is fully validated, we plan to migrate to Neo4j Enterprise Edition to take advantage of its high availability, security features, and performance optimization support.

0
Subscribe to my newsletter

Read articles from Marlon Shelby directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Marlon Shelby
Marlon Shelby