relevanceai.operations.cluster.base#

The ClusterBase class is intended to be inherited so that users can add their own clustering algorithms and models. A cluster base has the following abstractmethods (methods to be overwritten):

  • fit_transform

  • metadata (optional if you want to store cluster metadata)

  • get_centers (optional if you want to store cluster centroid documents)

CentroidBase is the most basic class to inherit. Use this class if you have an in-memory fitting algorithm.

If your clusters return centroids, you will want to inherit CentroidClusterBase.

If your clusters can fit on batches, you will want to inherit BatchClusterBase.

If you have both Batches and Centroids, you will want to inherit both.

import numpy as np
from faiss import Kmeans
from relevanceai import Client, CentroidClusterBase

client = Client()
df = client.Dataset("_github_repo_vectorai")

class FaissKMeans(CentroidClusterBase):
    def __init__(self, model):
        self.model = model

    def fit_predict(self, vectors):
        vectors = np.array(vectors).astype("float32")
        self.model.train(vectors)
        cluster_labels = self.model.assign(vectors)[1]
        return cluster_labels

    def metadata(self):
        return self.model.__dict__

    def get_centers(self):
        return self.model.centroids

n_clusters = 10
d = 512
alias = f"faiss-kmeans-{n_clusters}"
vector_fields = ["documentation_vector_"]

model = FaissKMeans(model=Kmeans(d=d, k=n_clusters))
clusterer = client.ClusterOps(model=model, alias=alias)
clusterer.fit_predict_update(dataset=df, vector_fields=vector_fields)

Module Contents#

class relevanceai.operations.cluster.base.ClusterBase#

A Cluster _Base for models to be inherited. The most basic class to inherit. Use this class if you have an in-memory fitting algorithm.

If your clusters return centroids, you will want to inherit CentroidClusterBase.

If your clusters can fit on batches, you will want to inherit BatchClusterBase.

If you have both Batches and Centroids, you will want to inherit both.

abstract fit_predict(self, vectors: list) List[Union[str, float, int]]#

Edit this method to implement a ClusterBase.

Parameters

vectors (list) – The vectors that are going to be clustered

Example

class KMeansModel(ClusterBase):
    def __init__(self, k=10, init="k-means++", n_init=10,
        max_iter=300, tol=1e-4, verbose=0, random_state=None,
            copy_x=True,algorithm="auto"):
            self.init = init
            self.n_init = n_init
            self.max_iter = max_iter
            self.tol = tol
            self.verbose = verbose
            self.random_state = random_state
            self.copy_x = copy_x
            self.algorithm = algorithm
            self.n_clusters = k

def _init_model(self):
    from sklearn.cluster import KMeans
    self.km = KMeans(
        n_clusters=self.n_clusters,
        init=self.init,
        verbose=self.verbose,
        max_iter=self.max_iter,
        tol=self.tol,
        random_state=self.random_state,
        copy_x=self.copy_x,
        algorithm=self.algorithm,
    )
    return

def fit_predict(self, vectors: Union[np.ndarray, List]):
    if not hasattr(self, "km"):
        self._init_model()
    self.km.fit(vectors)
    cluster_labels = self.km.labels_.tolist()
    # cluster_centroids = km.cluster_centers_
    return cluster_labels
fit_documents(self, vector_fields: list, documents: List[dict], alias: str = 'default', cluster_field: str = '_cluster_', return_only_clusters: bool = True, inplace: bool = True)#

Train clustering algorithm on documents and then store the labels inside the documents.

Parameters
  • vector_field (list) – The vector field of the documents

  • documents (list) – List of documents to run clustering on

  • alias (str) – What the clusters can be called

  • cluster_field (str) – What the cluster fields should be called

  • return_only_clusters (bool) – If True, return only clusters, otherwise returns the original document

  • inplace (bool) – If True, the documents are edited inplace otherwise, a copy is made first

  • kwargs (dict) – Any other keyword argument will go directly into the clustering algorithm

property metadata(self) dict#

If metadata is set - this willi be stored on RelevanceAI. This is useful when you are looking to compare the metadata of your clusters.

class relevanceai.operations.cluster.base.AdvancedCentroidClusterBase#

This centroid cluster base assumes that you want to specify quite advanced centroid documents.

You may want to use this if you want to get more control over what is actually inserted as a centroid.

abstract get_centroid_documents(self) List[Dict]#

Get the centroid documents.

class relevanceai.operations.cluster.base.CentroidBase#

Simple centroid base for clusters.

vector_fields :list#
abstract get_centers(self) List[List[float]]#

Add how you need to get centers here. This should return a list of vectors. The SDK will then label each center cluster-0, cluster-1, cluster-2, etc… in order. If you need more fine-grained control, please see get_centroid_documents.

get_centroid_documents(self) List#

Get the centroid documents to store. This enables you to use list_closest_to_center() and list_furthest_from_center.

{
    "_id": "document-id-1",
    "centroid_vector_": [0.23, 0.24, 0.23]
}

If multiple vector fields returns this: Returns multiple

{
    "_id": "document-id-1",
    "blue_vector_": [0.12, 0.312, 0.42],
    "red_vector_": [0.23, 0.41, 0.3]
}
class relevanceai.operations.cluster.base.SklearnCentroidBase(model)#

Simple centroid base for clusters.

get_centers(self)#

Add how you need to get centers here. This should return a list of vectors. The SDK will then label each center cluster-0, cluster-1, cluster-2, etc… in order. If you need more fine-grained control, please see get_centroid_documents.

fit_predict(self, X)#

Edit this method to implement a ClusterBase.

Parameters

vectors (list) – The vectors that are going to be clustered

Example

class KMeansModel(ClusterBase):
    def __init__(self, k=10, init="k-means++", n_init=10,
        max_iter=300, tol=1e-4, verbose=0, random_state=None,
            copy_x=True,algorithm="auto"):
            self.init = init
            self.n_init = n_init
            self.max_iter = max_iter
            self.tol = tol
            self.verbose = verbose
            self.random_state = random_state
            self.copy_x = copy_x
            self.algorithm = algorithm
            self.n_clusters = k

def _init_model(self):
    from sklearn.cluster import KMeans
    self.km = KMeans(
        n_clusters=self.n_clusters,
        init=self.init,
        verbose=self.verbose,
        max_iter=self.max_iter,
        tol=self.tol,
        random_state=self.random_state,
        copy_x=self.copy_x,
        algorithm=self.algorithm,
    )
    return

def fit_predict(self, vectors: Union[np.ndarray, List]):
    if not hasattr(self, "km"):
        self._init_model()
    self.km.fit(vectors)
    cluster_labels = self.km.labels_.tolist()
    # cluster_centroids = km.cluster_centers_
    return cluster_labels
get_unique_labels(self)#
get_centroid_documents(self) List#

Get the centroid documents to store. This enables you to use list_closest_to_center() and list_furthest_from_center.

{
    "_id": "document-id-1",
    "centroid_vector_": [0.23, 0.24, 0.23]
}

If multiple vector fields returns this: Returns multiple

{
    "_id": "document-id-1",
    "blue_vector_": [0.12, 0.312, 0.42],
    "red_vector_": [0.23, 0.41, 0.3]
}
class relevanceai.operations.cluster.base.HDBSCANClusterBase(model)#

Simple centroid base for clusters.

model :hdbscan.HDBSCAN#
get_unique_labels(self)#
get_centers(self)#

Add how you need to get centers here. This should return a list of vectors. The SDK will then label each center cluster-0, cluster-1, cluster-2, etc… in order. If you need more fine-grained control, please see get_centroid_documents.

get_centroid_documents(self) List#

Get the centroid documents to store. This enables you to use list_closest_to_center() and list_furthest_from_center.

{
    "_id": "document-id-1",
    "centroid_vector_": [0.23, 0.24, 0.23]
}

If multiple vector fields returns this: Returns multiple

{
    "_id": "document-id-1",
    "blue_vector_": [0.12, 0.312, 0.42],
    "red_vector_": [0.23, 0.41, 0.3]
}
class relevanceai.operations.cluster.base.CentroidClusterBase#

Inherit this class if you have a centroids-based clustering. The difference between this and Clusterbase is that you can also additionally specify how to get your centers in the get_centers base. This allows you to store your centers.

class relevanceai.operations.cluster.base.BatchClusterBase#

Inherit this class if you have a batch-fitting algorithm that needs to be trained and then predicted separately.

abstract partial_fit(self, vectors)#

Partial fit the vectors.

abstract predict(self)#

Predict the vectors.

fit_predict(self, X)#

Edit this method to implement a ClusterBase.

Parameters

vectors (list) – The vectors that are going to be clustered

Example

class KMeansModel(ClusterBase):
    def __init__(self, k=10, init="k-means++", n_init=10,
        max_iter=300, tol=1e-4, verbose=0, random_state=None,
            copy_x=True,algorithm="auto"):
            self.init = init
            self.n_init = n_init
            self.max_iter = max_iter
            self.tol = tol
            self.verbose = verbose
            self.random_state = random_state
            self.copy_x = copy_x
            self.algorithm = algorithm
            self.n_clusters = k

def _init_model(self):
    from sklearn.cluster import KMeans
    self.km = KMeans(
        n_clusters=self.n_clusters,
        init=self.init,
        verbose=self.verbose,
        max_iter=self.max_iter,
        tol=self.tol,
        random_state=self.random_state,
        copy_x=self.copy_x,
        algorithm=self.algorithm,
    )
    return

def fit_predict(self, vectors: Union[np.ndarray, List]):
    if not hasattr(self, "km"):
        self._init_model()
    self.km.fit(vectors)
    cluster_labels = self.km.labels_.tolist()
    # cluster_centroids = km.cluster_centers_
    return cluster_labels