Files
koios/docs/neo4j-utils.md
Robert Helewka 7859264359 Add Neo4j schema initialization and validation scripts
- Introduced `neo4j-schema-init.py` for creating the foundational schema for the personal knowledge graph used by multiple AI assistants.
- Implemented functionality for creating constraints, indexes, and sample nodes, along with comprehensive testing of the schema.
- Added `neo4j-validate.py` to perform validation checks on the Neo4j knowledge graph, including constraints, indexes, sample nodes, relationships, and junk data detection.
- Enhanced logging for better traceability and debugging during schema initialization and validation processes.
2026-03-06 14:11:52 +00:00

302 lines
9.5 KiB
Markdown

# Neo4j Utility Scripts
> Documentation for the database management scripts in `utils/`
---
## Scripts Overview
| Script | Purpose | Destructive? |
|--------|---------|:------------:|
| `neo4j-schema-init.py` | Create constraints, indexes, and sample data | No (idempotent) |
| `neo4j-reset.py` | Wipe all data, constraints, and indexes | **Yes** |
| `neo4j-validate.py` | Comprehensive validation report | No (read-only) |
---
## neo4j-schema-init.py
Creates the foundational schema for the unified knowledge graph: 74 uniqueness constraints, ~94 performance indexes, and 12 sample nodes with 5 cross-domain relationships.
### Usage
```bash
# Interactive — prompts for URI, user, password
python utils/neo4j-schema-init.py
# Specify URI (will prompt for user/password)
python utils/neo4j-schema-init.py --uri bolt://ariel.incus:7687
# Skip sample data creation
python utils/neo4j-schema-init.py --uri bolt://ariel.incus:7687 --skip-samples
# Test-only mode (no schema changes)
python utils/neo4j-schema-init.py --uri bolt://ariel.incus:7687 --test-only
# Quiet mode
python utils/neo4j-schema-init.py --uri bolt://ariel.incus:7687 --quiet
```
### What It Creates
1. **74 uniqueness constraints** — one per node type, on the `id` property
2. **~94 performance indexes** — on name/title, date, type/status/category, and domain fields
3. **12 sample nodes** — spanning all three teams (Personal, Work, Engineering)
4. **5 sample relationships** — demonstrating cross-domain connections
### Idempotent
Safe to run multiple times. Uses `IF NOT EXISTS` for constraints/indexes and `MERGE` for sample data.
---
## neo4j-reset.py
Wipes the database clean. Drops all constraints, indexes, nodes, and relationships.
### Usage
```bash
# Interactive — will prompt for confirmation
python utils/neo4j-reset.py --uri bolt://ariel.incus:7687
# Skip confirmation prompt
python utils/neo4j-reset.py --uri bolt://ariel.incus:7687 --force
```
### What It Does
1. Reports current database contents (node/relationship/constraint/index counts)
2. Drops all constraints
3. Drops all non-lookup indexes
4. Deletes all nodes and relationships (batched for large databases)
5. Verifies the database is clean
### Safety
- Requires typing `yes` to confirm (unless `--force`)
- Shows before/after counts so you know exactly what was removed
---
## neo4j-validate.py
Generates a comprehensive validation report. Share the output to verify the graph is correctly built.
### Usage
```bash
python utils/neo4j-validate.py --uri bolt://ariel.incus:7687
```
### What It Checks
| Section | What's Validated |
|---------|-----------------|
| **Connection** | Database reachable, APOC plugin available |
| **Constraints** | All 74 uniqueness constraints present, no extras |
| **Indexes** | Total count, spot-check of 11 key indexes |
| **Node Labels** | No unexpected labels (detects junk from Memory server, etc.) |
| **Sample Nodes** | All 12 sample nodes exist with correct properties |
| **Sample Relationships** | All 5 cross-domain relationships exist |
| **Relationship Summary** | Total count and breakdown by type |
| **Node Summary** | Total count and breakdown by label |
### Expected Clean Output
```
═════════════════════════════════════════════════════════════════
VALIDATION REPORT — Koios Unified Knowledge Graph
═════════════════════════════════════════════════════════════════
Schema Version: 2.1.0
...
RESULT: ALL 23 CHECKS PASSED ✓
═════════════════════════════════════════════════════════════════
```
---
## Standard Workflow
### Fresh Setup / Clean Slate
```bash
# 1. Wipe everything
python utils/neo4j-reset.py --uri bolt://ariel.incus:7687
# 2. Build schema and sample data
python utils/neo4j-schema-init.py --uri bolt://ariel.incus:7687
# 3. Validate
python utils/neo4j-validate.py --uri bolt://ariel.incus:7687
```
### Routine Validation
```bash
python utils/neo4j-validate.py --uri bolt://ariel.incus:7687
```
### Environment Variables
All three scripts support environment variables to avoid repeated prompts:
```bash
export NEO4J_URI="bolt://ariel.incus:7687"
export NEO4J_USER="neo4j"
export NEO4J_PASSWORD="your-password"
# Then just:
python utils/neo4j-reset.py --force
python utils/neo4j-schema-init.py --skip-docs
python utils/neo4j-validate.py
```
---
## Neo4j Python Driver — Lessons Learned
These patterns were discovered during development and are critical for anyone writing Cypher through the Neo4j Python driver (v5.x / v6.x).
### 1. Use Explicit Transactions for Writes
**Problem:** `session.run()` uses auto-commit transactions that don't reliably commit writes in the Neo4j Python driver 5.x+. Results must be fully consumed or the transaction may not commit.
**Bad — silently fails to persist:**
```python
with driver.session() as session:
session.run("CREATE (n:Person {id: 'test'})")
# Transaction may not commit!
```
**Good — explicit transaction with context manager:**
```python
with driver.session() as session:
with session.begin_transaction() as tx:
tx.run("CREATE (n:Person {id: 'test'})")
# Auto-commits when context exits normally
# Auto-rolls back on exception
```
**Also good — managed write transaction:**
```python
def create_person_tx(tx, name):
result = tx.run("CREATE (a:Person {name: $name}) RETURN a.id AS id", name=name)
record = result.single()
return record["id"]
with driver.session() as session:
node_id = session.execute_write(create_person_tx, "Alice")
```
### 2. Cypher MERGE Clause Ordering
**Problem:** `ON CREATE SET` must come immediately after `MERGE`, before any general `SET` clause. Placing `SET` before `ON CREATE SET` causes a syntax error.
**Bad — syntax error:**
```cypher
MERGE (p:Person {id: 'user_main'})
SET p.name = 'Main User',
p.updated_at = datetime()
ON CREATE SET p.created_at = datetime() -- ERROR: Invalid input 'ON'
```
**Good — correct clause order:**
```cypher
MERGE (p:Person {id: 'user_main'})
ON CREATE SET p.created_at = datetime()
SET p.name = 'Main User',
p.updated_at = datetime()
```
The full MERGE clause order is:
```
MERGE (pattern)
ON CREATE SET ... ← only runs when node is first created
ON MATCH SET ... ← only runs when node already exists (optional)
SET ... ← always runs
```
### 3. Consume Results in Transactions
**Problem:** In managed transactions (`execute_write`), results must be consumed within the transaction function. Unconsumed results can cause issues.
**Good pattern:**
```python
def create_node_tx(tx, node_id):
result = tx.run("MERGE (n:Person {id: $id}) RETURN n.id AS id", id=node_id)
record = result.single() # Consumes the result
return record["id"]
```
### 4. MATCH Returns No Rows ≠ Error
**Problem:** If a `MATCH` clause finds nothing, the query succeeds with zero rows — it does **not** raise an error. This means `MERGE` on a relationship after a failed `MATCH` silently does nothing.
```cypher
-- If person_xyz doesn't exist, this returns 0 rows (no error)
MATCH (p:Person {id: 'person_xyz'})
MATCH (b:Book {id: 'book_abc'})
MERGE (p)-[:COMPLETED]->(b)
-- Zero rows processed, zero relationships created, zero errors
```
**Mitigation:** Always check `result.single()` for `None` to detect this case:
```python
record = result.single()
if record is None:
logger.error("Endpoints not found — no relationship created")
```
### 5. Separate Node and Relationship Transactions
**Problem:** Creating nodes and then matching them for relationships in the same auto-commit transaction can fail because the nodes aren't visible yet within the same transaction scope.
**Good pattern:** Create all nodes in one explicit transaction (commit), then create relationships in a separate explicit transaction:
```python
# Transaction 1: Create nodes
with session.begin_transaction() as tx:
for query in node_queries:
tx.run(query)
# Auto-commits on exit
# Transaction 2: Create relationships (nodes now visible)
with session.begin_transaction() as tx:
for query in relationship_queries:
tx.run(query)
# Auto-commits on exit
```
### 6. MCP Memory Server vs Neo4j Cypher Server
**Problem:** The MCP Memory server (`@modelcontextprotocol/server-memory`) and Neo4j Cypher MCP server can both connect to the same Neo4j instance, but they use completely different data models.
| | Memory Server | Cypher Server |
|---|---|---|
| **Schema** | Fixed: `name`, `type`, `observations` | Your full custom schema |
| **Node labels** | `Memory`, `reference` | Your 74 defined types |
| **Relationships** | Simple string pairs | Rich typed relationships |
| **Query language** | API calls (`search_nodes`) | Full Cypher |
**Resolution:** If you have a custom Neo4j schema, use **only** the Cypher MCP server. Remove the Memory server to prevent it from polluting your graph with its own primitive node types.
---
## Dependencies
```
pip install neo4j
```
All three scripts require the `neo4j` Python package. APOC is optional but recommended (the init script's test suite checks for it).
---
## Version History
| Date | Change |
|------|--------|
| 2025-01-07 | Initial `neo4j-schema-init.py` |
| 2026-02-17 | Added `neo4j-reset.py` and `neo4j-validate.py` |
| 2026-02-17 | Fixed init script: explicit transactions, correct MERGE clause ordering |