relevanceai.operations.cluster.base
#
The ClusterBase class is intended to be inherited so that users can add their own clustering algorithms and models. A cluster base has the following abstractmethods (methods to be overwritten):
fit_transform
metadata
(optional if you want to store cluster metadata)get_centers
(optional if you want to store cluster centroid documents)
CentroidBase
is the most basic class to inherit. Use this class if you have an
in-memory fitting algorithm.
If your clusters return centroids, you will want to inherit
CentroidClusterBase
.
If your clusters can fit on batches, you will want to inherit
BatchClusterBase
.
If you have both Batches and Centroids, you will want to inherit both.
import numpy as np
from faiss import Kmeans
from relevanceai import Client, CentroidClusterBase
client = Client()
df = client.Dataset("_github_repo_vectorai")
class FaissKMeans(CentroidClusterBase):
def __init__(self, model):
self.model = model
def fit_predict(self, vectors):
vectors = np.array(vectors).astype("float32")
self.model.train(vectors)
cluster_labels = self.model.assign(vectors)[1]
return cluster_labels
def metadata(self):
return self.model.__dict__
def get_centers(self):
return self.model.centroids
n_clusters = 10
d = 512
alias = f"faiss-kmeans-{n_clusters}"
vector_fields = ["documentation_vector_"]
model = FaissKMeans(model=Kmeans(d=d, k=n_clusters))
clusterer = client.ClusterOps(model=model, alias=alias)
clusterer.fit_predict_update(dataset=df, vector_fields=vector_fields)
Module Contents#
- class relevanceai.operations.cluster.base.ClusterBase#
A Cluster _Base for models to be inherited. The most basic class to inherit. Use this class if you have an in-memory fitting algorithm.
If your clusters return centroids, you will want to inherit CentroidClusterBase.
If your clusters can fit on batches, you will want to inherit BatchClusterBase.
If you have both Batches and Centroids, you will want to inherit both.
- abstract fit_predict(self, vectors: list) List[Union[str, float, int]] #
Edit this method to implement a ClusterBase.
- Parameters
vectors (list) – The vectors that are going to be clustered
Example
class KMeansModel(ClusterBase): def __init__(self, k=10, init="k-means++", n_init=10, max_iter=300, tol=1e-4, verbose=0, random_state=None, copy_x=True,algorithm="auto"): self.init = init self.n_init = n_init self.max_iter = max_iter self.tol = tol self.verbose = verbose self.random_state = random_state self.copy_x = copy_x self.algorithm = algorithm self.n_clusters = k def _init_model(self): from sklearn.cluster import KMeans self.km = KMeans( n_clusters=self.n_clusters, init=self.init, verbose=self.verbose, max_iter=self.max_iter, tol=self.tol, random_state=self.random_state, copy_x=self.copy_x, algorithm=self.algorithm, ) return def fit_predict(self, vectors: Union[np.ndarray, List]): if not hasattr(self, "km"): self._init_model() self.km.fit(vectors) cluster_labels = self.km.labels_.tolist() # cluster_centroids = km.cluster_centers_ return cluster_labels
- fit_documents(self, vector_fields: list, documents: List[dict], alias: str = 'default', cluster_field: str = '_cluster_', return_only_clusters: bool = True, inplace: bool = True)#
Train clustering algorithm on documents and then store the labels inside the documents.
- Parameters
vector_field (list) – The vector field of the documents
documents (list) – List of documents to run clustering on
alias (str) – What the clusters can be called
cluster_field (str) – What the cluster fields should be called
return_only_clusters (bool) – If True, return only clusters, otherwise returns the original document
inplace (bool) – If True, the documents are edited inplace otherwise, a copy is made first
kwargs (dict) – Any other keyword argument will go directly into the clustering algorithm
- property metadata(self) dict #
If metadata is set - this willi be stored on RelevanceAI. This is useful when you are looking to compare the metadata of your clusters.
- class relevanceai.operations.cluster.base.AdvancedCentroidClusterBase#
This centroid cluster base assumes that you want to specify quite advanced centroid documents.
You may want to use this if you want to get more control over what is actually inserted as a centroid.
- abstract get_centroid_documents(self) List[Dict] #
Get the centroid documents.
- class relevanceai.operations.cluster.base.CentroidBase#
Simple centroid base for clusters.
- vector_fields :list#
- abstract get_centers(self) List[List[float]] #
Add how you need to get centers here. This should return a list of vectors. The SDK will then label each center cluster-0, cluster-1, cluster-2, etc… in order. If you need more fine-grained control, please see get_centroid_documents.
- get_centroid_documents(self) List #
Get the centroid documents to store. This enables you to use list_closest_to_center() and list_furthest_from_center.
{ "_id": "document-id-1", "centroid_vector_": [0.23, 0.24, 0.23] }
If multiple vector fields returns this: Returns multiple
{ "_id": "document-id-1", "blue_vector_": [0.12, 0.312, 0.42], "red_vector_": [0.23, 0.41, 0.3] }
- class relevanceai.operations.cluster.base.SklearnCentroidBase(model)#
Simple centroid base for clusters.
- get_centers(self)#
Add how you need to get centers here. This should return a list of vectors. The SDK will then label each center cluster-0, cluster-1, cluster-2, etc… in order. If you need more fine-grained control, please see get_centroid_documents.
- fit_predict(self, X)#
Edit this method to implement a ClusterBase.
- Parameters
vectors (list) – The vectors that are going to be clustered
Example
class KMeansModel(ClusterBase): def __init__(self, k=10, init="k-means++", n_init=10, max_iter=300, tol=1e-4, verbose=0, random_state=None, copy_x=True,algorithm="auto"): self.init = init self.n_init = n_init self.max_iter = max_iter self.tol = tol self.verbose = verbose self.random_state = random_state self.copy_x = copy_x self.algorithm = algorithm self.n_clusters = k def _init_model(self): from sklearn.cluster import KMeans self.km = KMeans( n_clusters=self.n_clusters, init=self.init, verbose=self.verbose, max_iter=self.max_iter, tol=self.tol, random_state=self.random_state, copy_x=self.copy_x, algorithm=self.algorithm, ) return def fit_predict(self, vectors: Union[np.ndarray, List]): if not hasattr(self, "km"): self._init_model() self.km.fit(vectors) cluster_labels = self.km.labels_.tolist() # cluster_centroids = km.cluster_centers_ return cluster_labels
- get_unique_labels(self)#
- get_centroid_documents(self) List #
Get the centroid documents to store. This enables you to use list_closest_to_center() and list_furthest_from_center.
{ "_id": "document-id-1", "centroid_vector_": [0.23, 0.24, 0.23] }
If multiple vector fields returns this: Returns multiple
{ "_id": "document-id-1", "blue_vector_": [0.12, 0.312, 0.42], "red_vector_": [0.23, 0.41, 0.3] }
- class relevanceai.operations.cluster.base.HDBSCANClusterBase(model)#
Simple centroid base for clusters.
- model :hdbscan.HDBSCAN#
- get_unique_labels(self)#
- get_centers(self)#
Add how you need to get centers here. This should return a list of vectors. The SDK will then label each center cluster-0, cluster-1, cluster-2, etc… in order. If you need more fine-grained control, please see get_centroid_documents.
- get_centroid_documents(self) List #
Get the centroid documents to store. This enables you to use list_closest_to_center() and list_furthest_from_center.
{ "_id": "document-id-1", "centroid_vector_": [0.23, 0.24, 0.23] }
If multiple vector fields returns this: Returns multiple
{ "_id": "document-id-1", "blue_vector_": [0.12, 0.312, 0.42], "red_vector_": [0.23, 0.41, 0.3] }
- class relevanceai.operations.cluster.base.CentroidClusterBase#
Inherit this class if you have a centroids-based clustering. The difference between this and Clusterbase is that you can also additionally specify how to get your centers in the get_centers base. This allows you to store your centers.
- class relevanceai.operations.cluster.base.BatchClusterBase#
Inherit this class if you have a batch-fitting algorithm that needs to be trained and then predicted separately.
- abstract partial_fit(self, vectors)#
Partial fit the vectors.
- abstract predict(self)#
Predict the vectors.
- fit_predict(self, X)#
Edit this method to implement a ClusterBase.
- Parameters
vectors (list) – The vectors that are going to be clustered
Example
class KMeansModel(ClusterBase): def __init__(self, k=10, init="k-means++", n_init=10, max_iter=300, tol=1e-4, verbose=0, random_state=None, copy_x=True,algorithm="auto"): self.init = init self.n_init = n_init self.max_iter = max_iter self.tol = tol self.verbose = verbose self.random_state = random_state self.copy_x = copy_x self.algorithm = algorithm self.n_clusters = k def _init_model(self): from sklearn.cluster import KMeans self.km = KMeans( n_clusters=self.n_clusters, init=self.init, verbose=self.verbose, max_iter=self.max_iter, tol=self.tol, random_state=self.random_state, copy_x=self.copy_x, algorithm=self.algorithm, ) return def fit_predict(self, vectors: Union[np.ndarray, List]): if not hasattr(self, "km"): self._init_model() self.km.fit(vectors) cluster_labels = self.km.labels_.tolist() # cluster_centroids = km.cluster_centers_ return cluster_labels