Add Neo4j schema initialization and validation scripts

- Introduced `neo4j-schema-init.py` for creating the foundational schema for the personal knowledge graph used by multiple AI assistants. - Implemented functionality for creating constraints, indexes, and sample nodes, along with comprehensive testing of the schema. - Added `neo4j-validate.py` to perform validation checks on the Neo4j knowledge graph, including constraints, indexes, sample nodes, relationships, and junk data detection. - Enhanced logging for better traceability and debugging during schema initialization and validation processes.
2026-03-06 14:11:52 +00:00
parent b654a04185
commit 7859264359
46 changed files with 11679 additions and 2 deletions
--- a/docs/neo4j-utils.md
+++ b/docs/neo4j-utils.md
@@ -0,0 +1,301 @@
+# Neo4j Utility Scripts
+
+> Documentation for the database management scripts in `utils/`
+
+---
+
+## Scripts Overview
+
+| Script | Purpose | Destructive? |
+|--------|---------|:------------:|
+| `neo4j-schema-init.py` | Create constraints, indexes, and sample data | No (idempotent) |
+| `neo4j-reset.py` | Wipe all data, constraints, and indexes | **Yes** |
+| `neo4j-validate.py` | Comprehensive validation report | No (read-only) |
+
+---
+
+## neo4j-schema-init.py
+
+Creates the foundational schema for the unified knowledge graph: 74 uniqueness constraints, ~94 performance indexes, and 12 sample nodes with 5 cross-domain relationships.
+
+### Usage
+
+```bash
+# Interactive — prompts for URI, user, password
+python utils/neo4j-schema-init.py
+
+# Specify URI (will prompt for user/password)
+python utils/neo4j-schema-init.py --uri bolt://ariel.incus:7687
+
+# Skip sample data creation
+python utils/neo4j-schema-init.py --uri bolt://ariel.incus:7687 --skip-samples
+
+# Test-only mode (no schema changes)
+python utils/neo4j-schema-init.py --uri bolt://ariel.incus:7687 --test-only
+
+# Quiet mode
+python utils/neo4j-schema-init.py --uri bolt://ariel.incus:7687 --quiet
+```
+
+### What It Creates
+
+1. **74 uniqueness constraints** — one per node type, on the `id` property
+2. **~94 performance indexes** — on name/title, date, type/status/category, and domain fields
+3. **12 sample nodes** — spanning all three teams (Personal, Work, Engineering)
+4. **5 sample relationships** — demonstrating cross-domain connections
+
+### Idempotent
+
+Safe to run multiple times. Uses `IF NOT EXISTS` for constraints/indexes and `MERGE` for sample data.
+
+---
+
+## neo4j-reset.py
+
+Wipes the database clean. Drops all constraints, indexes, nodes, and relationships.
+
+### Usage
+
+```bash
+# Interactive — will prompt for confirmation
+python utils/neo4j-reset.py --uri bolt://ariel.incus:7687
+
+# Skip confirmation prompt
+python utils/neo4j-reset.py --uri bolt://ariel.incus:7687 --force
+```
+
+### What It Does
+
+1. Reports current database contents (node/relationship/constraint/index counts)
+2. Drops all constraints
+3. Drops all non-lookup indexes
+4. Deletes all nodes and relationships (batched for large databases)
+5. Verifies the database is clean
+
+### Safety
+
+- Requires typing `yes` to confirm (unless `--force`)
+- Shows before/after counts so you know exactly what was removed
+
+---
+
+## neo4j-validate.py
+
+Generates a comprehensive validation report. Share the output to verify the graph is correctly built.
+
+### Usage
+
+```bash
+python utils/neo4j-validate.py --uri bolt://ariel.incus:7687
+```
+
+### What It Checks
+
+| Section | What's Validated |
+|---------|-----------------|
+| **Connection** | Database reachable, APOC plugin available |
+| **Constraints** | All 74 uniqueness constraints present, no extras |
+| **Indexes** | Total count, spot-check of 11 key indexes |
+| **Node Labels** | No unexpected labels (detects junk from Memory server, etc.) |
+| **Sample Nodes** | All 12 sample nodes exist with correct properties |
+| **Sample Relationships** | All 5 cross-domain relationships exist |
+| **Relationship Summary** | Total count and breakdown by type |
+| **Node Summary** | Total count and breakdown by label |
+
+### Expected Clean Output
+
+```
+═════════════════════════════════════════════════════════════════
+  VALIDATION REPORT — Koios Unified Knowledge Graph
+═════════════════════════════════════════════════════════════════
+  Schema Version: 2.1.0
+  ...
+  RESULT: ALL 23 CHECKS PASSED ✓
+═════════════════════════════════════════════════════════════════
+```
+
+---
+
+## Standard Workflow
+
+### Fresh Setup / Clean Slate
+
+```bash
+# 1. Wipe everything
+python utils/neo4j-reset.py --uri bolt://ariel.incus:7687
+
+# 2. Build schema and sample data
+python utils/neo4j-schema-init.py --uri bolt://ariel.incus:7687
+
+# 3. Validate
+python utils/neo4j-validate.py --uri bolt://ariel.incus:7687
+```
+
+### Routine Validation
+
+```bash
+python utils/neo4j-validate.py --uri bolt://ariel.incus:7687
+```
+
+### Environment Variables
+
+All three scripts support environment variables to avoid repeated prompts:
+
+```bash
+export NEO4J_URI="bolt://ariel.incus:7687"
+export NEO4J_USER="neo4j"
+export NEO4J_PASSWORD="your-password"
+
+# Then just:
+python utils/neo4j-reset.py --force
+python utils/neo4j-schema-init.py --skip-docs
+python utils/neo4j-validate.py
+```
+
+---
+
+## Neo4j Python Driver — Lessons Learned
+
+These patterns were discovered during development and are critical for anyone writing Cypher through the Neo4j Python driver (v5.x / v6.x).
+
+### 1. Use Explicit Transactions for Writes
+
+**Problem:** `session.run()` uses auto-commit transactions that don't reliably commit writes in the Neo4j Python driver 5.x+. Results must be fully consumed or the transaction may not commit.
+
+**Bad — silently fails to persist:**
+```python
+with driver.session() as session:
+    session.run("CREATE (n:Person {id: 'test'})")
+    # Transaction may not commit!
+```
+
+**Good — explicit transaction with context manager:**
+```python
+with driver.session() as session:
+    with session.begin_transaction() as tx:
+        tx.run("CREATE (n:Person {id: 'test'})")
+        # Auto-commits when context exits normally
+        # Auto-rolls back on exception
+```
+
+**Also good — managed write transaction:**
+```python
+def create_person_tx(tx, name):
+    result = tx.run("CREATE (a:Person {name: $name}) RETURN a.id AS id", name=name)
+    record = result.single()
+    return record["id"]
+
+with driver.session() as session:
+    node_id = session.execute_write(create_person_tx, "Alice")
+```
+
+### 2. Cypher MERGE Clause Ordering
+
+**Problem:** `ON CREATE SET` must come immediately after `MERGE`, before any general `SET` clause. Placing `SET` before `ON CREATE SET` causes a syntax error.
+
+**Bad — syntax error:**
+```cypher
+MERGE (p:Person {id: 'user_main'})
+SET p.name = 'Main User',
+    p.updated_at = datetime()
+ON CREATE SET p.created_at = datetime()  -- ERROR: Invalid input 'ON'
+```
+
+**Good — correct clause order:**
+```cypher
+MERGE (p:Person {id: 'user_main'})
+ON CREATE SET p.created_at = datetime()
+SET p.name = 'Main User',
+    p.updated_at = datetime()
+```
+
+The full MERGE clause order is:
+```
+MERGE (pattern)
+ON CREATE SET ...   ← only runs when node is first created
+ON MATCH SET ...    ← only runs when node already exists (optional)
+SET ...             ← always runs
+```
+
+### 3. Consume Results in Transactions
+
+**Problem:** In managed transactions (`execute_write`), results must be consumed within the transaction function. Unconsumed results can cause issues.
+
+**Good pattern:**
+```python
+def create_node_tx(tx, node_id):
+    result = tx.run("MERGE (n:Person {id: $id}) RETURN n.id AS id", id=node_id)
+    record = result.single()  # Consumes the result
+    return record["id"]
+```
+
+### 4. MATCH Returns No Rows ≠ Error
+
+**Problem:** If a `MATCH` clause finds nothing, the query succeeds with zero rows — it does **not** raise an error. This means `MERGE` on a relationship after a failed `MATCH` silently does nothing.
+
+```cypher
+-- If person_xyz doesn't exist, this returns 0 rows (no error)
+MATCH (p:Person {id: 'person_xyz'})
+MATCH (b:Book {id: 'book_abc'})
+MERGE (p)-[:COMPLETED]->(b)
+-- Zero rows processed, zero relationships created, zero errors
+```
+
+**Mitigation:** Always check `result.single()` for `None` to detect this case:
+```python
+record = result.single()
+if record is None:
+    logger.error("Endpoints not found — no relationship created")
+```
+
+### 5. Separate Node and Relationship Transactions
+
+**Problem:** Creating nodes and then matching them for relationships in the same auto-commit transaction can fail because the nodes aren't visible yet within the same transaction scope.
+
+**Good pattern:** Create all nodes in one explicit transaction (commit), then create relationships in a separate explicit transaction:
+```python
+# Transaction 1: Create nodes
+with session.begin_transaction() as tx:
+    for query in node_queries:
+        tx.run(query)
+    # Auto-commits on exit
+
+# Transaction 2: Create relationships (nodes now visible)
+with session.begin_transaction() as tx:
+    for query in relationship_queries:
+        tx.run(query)
+    # Auto-commits on exit
+```
+
+### 6. MCP Memory Server vs Neo4j Cypher Server
+
+**Problem:** The MCP Memory server (`@modelcontextprotocol/server-memory`) and Neo4j Cypher MCP server can both connect to the same Neo4j instance, but they use completely different data models.
+
+| | Memory Server | Cypher Server |
+|---|---|---|
+| **Schema** | Fixed: `name`, `type`, `observations` | Your full custom schema |
+| **Node labels** | `Memory`, `reference` | Your 74 defined types |
+| **Relationships** | Simple string pairs | Rich typed relationships |
+| **Query language** | API calls (`search_nodes`) | Full Cypher |
+
+**Resolution:** If you have a custom Neo4j schema, use **only** the Cypher MCP server. Remove the Memory server to prevent it from polluting your graph with its own primitive node types.
+
+---
+
+## Dependencies
+
+```
+pip install neo4j
+```
+
+All three scripts require the `neo4j` Python package. APOC is optional but recommended (the init script's test suite checks for it).
+
+---
+
+## Version History
+
+| Date | Change |
+|------|--------|
+| 2025-01-07 | Initial `neo4j-schema-init.py` |
+| 2026-02-17 | Added `neo4j-reset.py` and `neo4j-validate.py` |
+| 2026-02-17 | Fixed init script: explicit transactions, correct MERGE clause ordering |