Sub Clustering#

Sub-clustering refers to when you are running clustering on clusters that already exist. This can be helpful for users who want to dive deeper into a cluster or want to cluster within a specific subset.

With Relevance AI, sub-cluster values are given by appending a “-{cluster_id}” where the cluster_id is usually a number. For example, if you set the parent_field to be cars which have values mercedes, tesla, honda , the subcluster values could potentially be mercedes-1, mercedes-2, tesla-1, honda-6.

Basic#

The easiest to cluster is to run this:

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")

# Insert a dummy dataset for now
from relevanceai.utils.datasets import mock_documents
ds.upsert_documents(mock_documents(100))

from sklearn.cluster import KMeans
model = KMeans(n_clusters=5)

cluster_ops = ds.cluster(
   model,
   vector_fields=["sample_1_vector_"],
   alias="sample_1")

# You can find the parent field in the schema or alternatively provide a field
parent_field = "_cluster_.sample_1_vector_.sample_1"

# Given the parent field - we now run subclustering
ds.subcluster(
   model=model,
   parent_field=parent_field,
   vector_fields=["sample_2_vector_"],
   alias="subcluster-kmeans-2",
   min_parent_cluster_size=2 # The minimum number of points the cluster needs to subcluster on
)

# You should also be able to track your subclusters using
ds.metadata

# Should output something like this:
# {'_subcluster_': [{'parent_field': '_cluster_.sample_1_vector_.sample_1', 'cluster_field': '_cluster_.sample_2_vector_.subcluster-kmeans-2'}]}


# You can also view your subcluster results using
ds['_cluster_.sample_2_vector_.subcluster-kmeans-2']
Subclustering Output

API Reference#

You can read more about it from the API reference here:

class relevanceai.operations.cluster.sub.SubClusterOps#
fit_predict(dataset, vector_fields, parent_field=None, filters=None, verbose=False, min_parent_cluster_size=None, cluster_ids=None)#

Run subclustering on your dataset using an in-memory clustering algorithm.

Parameters
  • dataset (Dataset) – The dataset to create

  • vector_fields (List) – The list of vector fields to run fitting, prediction and updating on

  • filters (Optional[List]) – The list of filters to run clustering on

  • verbose (bool) – If True, this should be verbose

Example

from relevanceai import Client
client = Client()

from relevanceai.package_utils.datasets import mock_documents
ds = client.Dataset("sample")

# Creates 100 sample documents
documents = mock_documents(100)
ds.upsert_documents(documents)

from sklearn.cluster import KMeans
model = KMeans(n_clusters=10)
clusterer = ClusterOps(alias="minibatchkmeans-10", model=model)
clusterer.subcluster_predict_update(
    dataset=ds,
)
list_unique(field=None, minimum_amount=3, dataset_id=None, num_clusters=1000)#

List unique cluster IDS

Example

from relevanceai import Client
client = Client()
cluster_ops = client.ClusterOps(
    alias="kmeans_8", vector_fields=["sample_vector_]
)
cluster_ops.list_unique()
Parameters
  • alias (str) – The alias to use for clustering

  • minimum_cluster_size (int) – The minimum size of the clusters

  • dataset_id (str) – The dataset ID

  • num_clusters (int) – The number of clusters

store_subcluster_metadata(parent_field, cluster_field)#

Store subcluster metadata

subcluster_predict_documents(vector_fields=None, filters=None, min_parent_cluster_size=None, cluster_ids=None, verbose=True)#

Subclustering using fit predict update. This will loop through all of the different clusters and then run subclustering on them. For this, you need to

Example

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")

# Creating 100 sample documents
from relevanceai.package_utils.datasets import mock_documents
documents = mock_documents(100)
ds.upsert_documents(documents)

# Run simple clustering first
ds.auto_cluster("kmeans-3", vector_fields=["sample_1_vector_"])

# Start KMeans
from sklearn.cluster import KMeans
model = KMeans(n_clusters=20)

# Run subclustering.
cluster_ops = client.ClusterOps(
    alias="subclusteringkmeans",
    model=model,
    parent_alias="kmeans-3")
subpartialfit_predict_update(dataset, vector_fields, filters=None, cluster_ids=None, verbose=True)#

Run partial fit subclustering on your dataset.

Parameters
  • dataset (Dataset) – The dataset to call fit predict update on

  • vector_fields (list) – The list of vector fields

  • filters (list) – The list of filters

Example

from relevanceai import Client
client = Client()

from relevanceai.package_utils.datasets import mock_documents
ds = client.Dataset("sample")
# Creates 100 sample documents
documents = mock_documents(100)
ds.upsert_documents(documents)

from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(n_clusters=10)
clusterer = ClusterOps(alias="minibatchkmeans-10", model=model)
clusterer.subpartialfit_predict_update(
    dataset=ds,
)