Skip to main content

Vector indexing

With the introduction of the vector data-type a new type of index was introduced. A vector index is a dedicated index for indexing and searching through vectors.

To create this type of index use the following syntax:

CREATE VECTOR INDEX FOR <entity_pattern> ON <entity_attribute> OPTIONS <options>

The options are:

{
dimension: INT, // Required, length of the vector to be indexed
similarityFunction: STRING, // Required, currently only euclidean or cosine are allowed
M: INT, // Optional, maximum number of outgoing edges per node. default 16
efConstruction: INT, // Optional, number of candidates during construction. default 200
efRuntime: INT // Optional, number of candidates during search. default 10
}

For example, to create a vector index over all Product nodes description attribute use the following syntax:

graph.query("CREATE VECTOR INDEX FOR (p:Product) ON (p.description) OPTIONS {dimension:128, similarityFunction:'euclidean'}")

Similarly to create a vector index over all Call relationships summary attribute use the following syntax:

graph.query("CREATE VECTOR INDEX FOR ()-[e:Call]->() ON (e.summary) OPTIONS {dimension:128, similarityFunction:'euclidean'}")

Important: When creating a vector index, both the vector dimension and similarity function must be provided. Currently, the only supported similarity functions are 'euclidean' or 'cosine'.

Understanding Vector Index Parameters

Required Parameters

  • dimension: The length of the vectors to be indexed. Must match the dimensionality of your embeddings (e.g., 128, 384, 768, 1536).
  • similarityFunction: The distance metric used for similarity search:
    • euclidean: Euclidean distance (L2 norm). Best for embeddings where magnitude matters.
    • cosine: Cosine similarity. Best for normalized embeddings where direction matters more than magnitude.

Optional Parameters

These parameters control the HNSW (Hierarchical Navigable Small World) index structure:

  • M (default: 16): Maximum number of connections per node in the graph

    • Higher values improve recall but increase memory usage and build time
    • Recommended range: 12-48
    • Use 16-32 for most applications
  • efConstruction (default: 200): Number of candidates evaluated during index construction

    • Higher values improve index quality but slow down indexing
    • Recommended range: 100-400
    • Use 200-300 for balanced quality/speed
  • efRuntime (default: 10): Number of candidates evaluated during search

    • Higher values improve recall but slow down queries
    • Can be adjusted per-query for speed/accuracy tradeoffs
    • Recommended: Start with 10, increase if recall is insufficient

Inserting vectors

To create a new vector use the vecf32 function as follows:

graph.query("CREATE (p: Product {description: vecf32([2.1, 0.82, 1.3])})")

The above query creates a new Product node with a description attribute containing a vector.

Query vector index

Vector indices are used to search for similar vectors to a given query vector using the similarity function as a measure of "distance".

To query the index use either db.idx.vector.queryNodes for node retrieval or db.idx.vector.queryRelationships for relationships.

CALL db.idx.vector.queryNodes(
label: STRING,
attribute: STRING,
k: INTEGER,
query: VECTOR
) YIELD node, score
CALL db.idx.vector.queryRelationships(
relationshipType: STRING,
attribute: STRING,
k: INTEGER,
query: VECTOR
) YIELD relationship, score

To query up to 10 similar Product descriptions to a given query description vector issue the following procedure call:

result = graph.query("CALL db.idx.vector.queryNodes('Product', 'description', 10, vecf32(<array_of_vector_elements>)) YIELD node")

The procedure can yield both the indexed entity assigned to the found similar vector in addition to a similarity score of that entity.

Deleting a vector index

To remove a vector index, simply issue the drop index command as follows:

DROP VECTOR INDEX FOR <entity_pattern> (<entity_attribute>)

For example, to drop the vector index over Product description, invoke:

graph.query("DROP VECTOR INDEX FOR (p:Product) ON (p.description)")

Index Management

Listing Vector Indexes

To view all indexes (including vector) in your graph, use:

CALL db.indexes()

Vector indexes are marked with type VECTOR and show the dimension and similarity function in the options field.

Verifying Vector Index Usage

To verify that a vector index is being used, examine the query execution plan:

# Query using vector index
query_vector = [2.1, 0.82, 1.3]
result = graph.explain(f"CALL db.idx.vector.queryNodes('Product', 'description', 10, vecf32({query_vector})) YIELD node RETURN node")
print(result)
# Output shows: ProcedureCall | db.idx.vector.queryNodes

Performance Tradeoffs and Best Practices

When to Use Vector Indexes

Vector indexes are essential for:

  • Semantic search: Finding similar items based on meaning, not just keywords
  • Recommendation systems: Discovering similar products, content, or users
  • RAG (Retrieval Augmented Generation): Retrieving relevant context for LLMs
  • Duplicate detection: Finding near-duplicate items based on embeddings
  • Image/audio similarity: When using vision or audio embedding models

Performance Considerations

Benefits:

  • Enables efficient approximate nearest neighbor (ANN) search
  • Scales to millions of vectors with sub-linear query time
  • Supports both node and relationship vectors

Costs:

  • Memory usage: Vector indexes are memory-intensive
    • A 1M vector index with 768 dimensions (float32) requires ~3GB of memory
    • Formula: vectors × dimensions × 4 bytes + HNSW overhead (~20%)
  • Build time: Index construction can be slow for large datasets
  • Approximate results: Returns approximate (not exact) nearest neighbors
  • No support for filtering: Vector queries don't combine well with property filters

Recommendations:

  • Choose appropriate vector dimensions (balance between quality and cost)
  • Use cosine similarity for normalized embeddings (e.g., from OpenAI, Sentence Transformers)
  • Use euclidean distance for unnormalized data
  • Tune M and efConstruction based on your accuracy requirements
  • Consider batch indexing for large datasets
  • Monitor memory usage carefully

Similarity Function Tradeoffs

Cosine Similarity:

  • Best for: Text embeddings, normalized vectors
  • Measures: Angular distance between vectors
  • Range: -1 to 1 (1 = identical direction)
  • Use when: Vector magnitude is not meaningful

Euclidean Distance:

  • Best for: Unnormalized data, physical measurements
  • Measures: Straight-line distance between vectors
  • Range: 0 to ∞ (0 = identical)
  • Use when: Both direction and magnitude matter
# Create vector index for product embeddings
graph.query("CREATE VECTOR INDEX FOR (p:Product) ON (p.embedding) OPTIONS {dimension:768, similarityFunction:'cosine', M:32, efConstruction:200}")

# Insert products with embeddings (embeddings would come from your model)
embedding = model.encode("laptop computer") # Your embedding model
graph.query(f"CREATE (p:Product {name: 'Laptop', embedding: vecf32({embedding.tolist()})})")

# Search for similar products
query_embedding = model.encode("notebook pc")
result = graph.query(f"CALL db.idx.vector.queryNodes('Product', 'embedding', 5, vecf32({query_embedding.tolist()})) YIELD node, score RETURN node.name, score ORDER BY score DESC")
for record in result.result_set:
print(f"Product: {record[0]}, Similarity: {record[1]}")

Troubleshooting

Common Issues:

  1. Dimension mismatch: Ensure all vectors have the same dimension as specified in the index
  2. Wrong similarity function: Use cosine for normalized vectors, euclidean for unnormalized
  3. Poor recall: Increase efRuntime or efConstruction parameters
  4. Slow queries: Decrease efRuntime or reduce k (number of results)
  5. High memory usage: Reduce M parameter or use lower-dimensional embeddings