relevanceai.operations_new.cluster.ops#

Module Contents#

class relevanceai.operations_new.cluster.ops.ClusterOps(dataset_id: str, vector_fields: list, alias: str, model=None, model_kwargs=None, cluster_field: str = '_cluster_', byo_cluster_field: str = None, include_cluster_report: bool = False, verbose: bool = False, **kwargs)#

Cluster-related functionalities

model_name :str#
post_run(self, dataset, documents, updated_documents)#
insert_centroids(self, centroid_documents) None#

Insert centroids Centroids look below

cluster_ops = client.ClusterOps(
    vector_field=["sample_1_vector_"],
    alias="sample"
)
cluster_ops.insert_centroids(
    centorid_documents=[
        {"_id" : "cluster-0", "sample_1_vector_": [1, 1, 1]},
        {"_id" : "cluster-1", "sample_1_vector_": [1, 2, 2]},
    ]
)
calculate_centroids(self, method='mean')#

calculates the centroids from the dataset vectors

create_centroids(self, insert: bool = True)#

Calculate centroids from your dataset vectors.

Example

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")
cluster_ops = ds.ClusterOps(
    alias="kmeans-25",
    vector_fields=['sample_vector_']
)
centroids = cluster_ops.create_centroids()
get_centroid_documents(self)#
property centroids(self)#

Access the centroids of your dataset easily

ds = client.Dataset("sample")
cluster_ops = ds.ClusterOps(
    vector_fields=["sample_vector_"],
    alias="simple"
)
cluster_ops.centroids
get_centroid_from_id(self, cluster_id: str) Dict[str, Any]#

> It takes a cluster id and returns the centroid with that id

Parameters

cluster_id (str) – The id of the cluster to get the centroid for.

Return type

The centroid with the given id.

list_cluster_ids(self, alias: str = None, minimum_cluster_size: int = 0, num_clusters: int = 1000)#

List unique cluster IDS

Example

from relevanceai import Client
client = Client()
cluster_ops = client.ClusterOps(
    alias="kmeans_8", vector_fields=["sample_vector_]
)
cluster_ops.list_cluster_ids()
Parameters
  • alias (str) – The alias to use for clustering

  • minimum_cluster_size (int) – The minimum size of the clusters

  • num_clusters (int) – The number of clusters

list_closest(self, cluster_ids: Optional[list] = None, select_fields: Optional[List] = None, approx: int = 0, page_size: int = 1, page: int = 1, similarity_metric: str = 'cosine', filters: Optional[list] = None, facets: Optional[list] = None, include_vector: bool = False, cluster_properties_filters: Optional[Dict] = None, include_count: bool = False, include_facets: bool = False, verbose: bool = False)#

List of documents closest from the center. :param dataset_id: Unique name of dataset :type dataset_id: string :param vector_fields: The vector fields where a clustering task runs :type vector_fields: list :param cluster_ids: Any of the cluster ids :type cluster_ids: list :param alias: Alias is used to name a cluster :type alias: string :param centroid_vector_fields: Vector fields stored :type centroid_vector_fields: list :param select_fields: Fields to include in the search results, empty array/list means all fields :type select_fields: list :param approx: Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate :type approx: int :param sum_fields: Whether to sum the multiple vectors similarity search score as 1 or seperate :type sum_fields: bool :param page_size: Size of each page of results :type page_size: int :param page: Page of the results :type page: int :param similarity_metric: Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’] :type similarity_metric: string :param filters: Query for filtering the search results :type filters: list :param facets: Fields to include in the facets, if [] then all :type facets: list :param min_score: Minimum score for similarity metric :type min_score: int :param include_vectors: Include vectors in the search results :type include_vectors: bool :param include_count: Include the total count of results in the search results :type include_count: bool :param include_facets: Include facets in the search results :type include_facets: bool :param cluster_properties_filter: Filter if clusters with certain characteristics should be hidden in results :type cluster_properties_filter: dict

list_furthest(self, cluster_ids: Optional[List] = None, centroid_vector_fields: Optional[List] = None, select_fields: Optional[List] = None, approx: int = 0, sum_fields: bool = True, page_size: int = 3, page: int = 1, similarity_metric: str = 'cosine', filters: Optional[List] = None, min_score: int = 0, include_vector: bool = False, include_count: bool = True, cluster_properties_filter: Optional[Dict] = {})#

List documents furthest from the center.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_fields (list) – The vector field where a clustering task was run.

  • cluster_ids (list) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

store_operation_metadatas(self)#
merge(self, target_cluster_id: str, cluster_ids: list)#

Merge clusters into the target cluster. The centroids are re-calculated and become a new middle.

create_byo_clusters(self)#

Create BYO clusters for a given field

property labels(self)#
create_parent_cluster(self, to_merge: dict, new_cluster_field: str)#

to_merge should look similar to below:

to_merge = {
    0: [
        'cluster_1',
        'cluster_2'
    ]
}
explain_text_clusters(self, text_field, encode_fn_or_model, n_closest: int = 5, highlight_output_field='_explain_', algorithm: str = 'relational', model_kwargs: Optional[dict] = None)#

It takes a text field and a function that encodes the text field into a vector. It then returns the top n closest vectors to each cluster centroid. .. code-block:

def encode(X):
    return [1, 2, 1]
cluster_ops.explain_text_clusters(text_field="hey", encode_fn_or_model=encode)
Parameters
  • text_field – The field in the dataset that contains the text to be explained.

  • encode_fn – This is the function that will be used to encode the text.

  • n_closest (int, optional) – The number of closest documents to each cluster to return.

  • highlight_output_field – The name of the field that will be added to the output dataset.

  • optional – The name of the field that will be added to the output dataset.

  • algorithm (str) – Algorithm is either “centroid” or “relational”

Return type

A new dataset with the same data as the original dataset, but with a new field called _explain_

aggregate(self, metrics: Optional[list] = None, sort: Optional[list] = None, groupby: Optional[list] = None, filters: Optional[list] = None, page_size: int = 20, page: int = 1, asc: bool = False, flatten: bool = True)#

Takes an aggregation query and gets the aggregate of each cluster in a collection. This helps you interpret each cluster and what is in them. It can only can be used after a vector field has been clustered.

Aggregation/Groupby of a collection using an aggregation query. The aggregation query is a json body that follows the schema of:

{
    "groupby" : [
        {"name": <alias>, "field": <field in the collection>, "agg": "category"},
        {"name": <alias>, "field": <another groupby field in the collection>, "agg": "numeric"}
    ],
    "metrics" : [
        {"name": <alias>, "field": <numeric field in the collection>, "agg": "avg"}
        {"name": <alias>, "field": <another numeric field in the collection>, "agg": "max"}
    ]
}

For example, one can use the following aggregations to group score based on region and player name.

{
    "groupby" : [
        {"name": "region", "field": "player_region", "agg": "category"},
        {"name": "player_name", "field": "name", "agg": "category"}
    ],
    "metrics" : [
        {"name": "average_score", "field": "final_score", "agg": "avg"},
        {"name": "max_score", "field": "final_score", "agg": "max"},
        {'name':'total_score','field':"final_score", 'agg':'sum'},
        {'name':'average_deaths','field':"final_deaths", 'agg':'avg'},
        {'name':'highest_deaths','field':"final_deaths", 'agg':'max'},
    ]
}
“groupby” is the fields you want to split the data into. These are the available groupby types:
  • category : groupby a field that is a category

  • numeric: groupby a field that is a numeric

“metrics” is the fields and metrics you want to calculate in each of those, every aggregation includes a frequency metric. These are the available metric types:
  • “avg”, “max”, “min”, “sum”, “cardinality”

The response returned has the following in descending order.

If you want to return documents, specify a “group_size” parameter and a “select_fields” parameter if you want to limit the specific fields chosen. This looks as such:

For array-aggregations, you can add “agg”: “array” into the aggregation query.

Parameters
  • dataset_id (string) – Unique name of dataset

  • metrics (list) – Fields and metrics you want to calculate

  • groupby (list) – Fields you want to split the data into

  • filters (list) – Query for filtering the search results

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • asc (bool) – Whether to sort results by ascending or descending order

  • flatten (bool) – Whether to flatten

  • alias (string) – Alias used to name a vector field. Belongs in field_{alias} vector

  • metrics – Fields and metrics you want to calculate

  • groupby – Fields you want to split the data into

  • filters – Query for filtering the search results

  • page_size – Size of each page of results

  • page – Page of the results

  • asc – Whether to sort results by ascending or descending order

  • flatten – Whether to flatten

Example