Cluster#

Basic#

The easiest way to cluster a dataset is to use the cluster method from a Dataset object (an example is shown below).

from relevanceai import Client
client= Client()

from relevanceai import mock_documents
docs = mock_documents()
ds = client.Dataset("sample")
ds.upsert_documents(docs)

cluster_ops = ds.cluster(
    vector_fields=["sample_1_vector_"],
    model="kmeans"
)

Native Scikit-learn Integration#

You can easily cluster with the model if your cluster model has a fit_predict method.

from sklearn.cluster import KMeans
model = KMeans(n_clusters=100)
cluster_ops = ds.cluster(
    vector_fields=["sample_1_vector_"],
    model=model.
    alias="native-sklearn" # alias is anything you want
)

Once clustered, you can access all of the useful Scikit-learn integrations.

# List the closest to each centroid in a cluster
cluster_ops.list_closest()

# Launch a cluster app
ds.launch_cluster_app()

Reloading ClusterOps#

Often you may have clustered but want to just re-load your clusterops object without having to re-fit the model. You can do that in 2 ways.

# State the vector fields and alias in the ClusterOps object
ds = client.Dataset("sample_dataset_id")
cluster_ops = ds.ClusterOps(
    alias="kmeans-16",
    vector_fields=['sample_vector_'])
)

cluster_ops.list_closest()

# State the vector fields and alias in the operational call
cluster_ops = client.ClusterOps(alias="kmeans-16")
cluster_ops.list_closest(dataset="sample_dataset_id",
    vector_fields=["documentation_vector_])

API Reference#

class relevanceai.operations.cluster.cluster.ClusterOps#
aggregate(vector_fields=None, metrics=None, sort=None, groupby=None, filters=None, page_size=20, page=1, asc=False, flatten=True, dataset=None)#

Takes an aggregation query and gets the aggregate of each cluster in a collection. This helps you interpret each cluster and what is in them. It can only can be used after a vector field has been clustered.

Aggregation/Groupby of a collection using an aggregation query. The aggregation query is a json body that follows the schema of:

{
    "groupby" : [
        {"name": <alias>, "field": <field in the collection>, "agg": "category"},
        {"name": <alias>, "field": <another groupby field in the collection>, "agg": "numeric"}
    ],
    "metrics" : [
        {"name": <alias>, "field": <numeric field in the collection>, "agg": "avg"}
        {"name": <alias>, "field": <another numeric field in the collection>, "agg": "max"}
    ]
}

For example, one can use the following aggregations to group score based on region and player name.

{
    "groupby" : [
        {"name": "region", "field": "player_region", "agg": "category"},
        {"name": "player_name", "field": "name", "agg": "category"}
    ],
    "metrics" : [
        {"name": "average_score", "field": "final_score", "agg": "avg"},
        {"name": "max_score", "field": "final_score", "agg": "max"},
        {'name':'total_score','field':"final_score", 'agg':'sum'},
        {'name':'average_deaths','field':"final_deaths", 'agg':'avg'},
        {'name':'highest_deaths','field':"final_deaths", 'agg':'max'},
    ]
}
“groupby” is the fields you want to split the data into. These are the available groupby types:
  • category : groupby a field that is a category

  • numeric: groupby a field that is a numeric

“metrics” is the fields and metrics you want to calculate in each of those, every aggregation includes a frequency metric. These are the available metric types:
  • “avg”, “max”, “min”, “sum”, “cardinality”

The response returned has the following in descending order.

If you want to return documents, specify a “group_size” parameter and a “select_fields” parameter if you want to limit the specific fields chosen. This looks as such:

For array-aggregations, you can add “agg”: “array” into the aggregation query.

Parameters
  • dataset_id (string) – Unique name of dataset

  • metrics (list) – Fields and metrics you want to calculate

  • groupby (list) – Fields you want to split the data into

  • filters (list) – Query for filtering the search results

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • asc (bool) – Whether to sort results by ascending or descending order

  • flatten (bool) – Whether to flatten

  • alias (string) – Alias used to name a vector field. Belongs in field_{alias} vector

  • metrics – Fields and metrics you want to calculate

  • groupby – Fields you want to split the data into

  • filters – Query for filtering the search results

  • page_size – Size of each page of results

  • page – Page of the results

  • asc – Whether to sort results by ascending or descending order

  • flatten – Whether to flatten

Example

closest(dataset_id=None, vector_field=None, alias=None, cluster_ids=None, centroid_vector_fields=None, select_fields=None, approx=0, sum_fields=True, page_size=3, page=1, similarity_metric='cosine', filters=None, min_score=0, include_vector=False, include_count=True, cluster_properties_filter={}, verbose=True)#

List of documents closest from the center.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (list) – The vector field where a clustering task was run.

  • cluster_ids (list) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • centroid_vector_fields (list) – Vector fields stored

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

  • cluster_properties_filter (dict) – Filter if clusters with certain characteristics should be hidden in results

furthest(dataset_id=None, vector_field=None, alias=None, cluster_ids=None, centroid_vector_fields=None, select_fields=None, approx=0, sum_fields=True, page_size=3, page=1, similarity_metric='cosine', filters=None, min_score=0, include_vector=False, include_count=True, cluster_properties_filter={})#

List documents furthest from the center.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_fields (list) – The vector field where a clustering task was run.

  • cluster_ids (list) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

list_closest(dataset_id=None, vector_field=None, alias=None, cluster_ids=None, centroid_vector_fields=None, select_fields=None, approx=0, sum_fields=True, page_size=3, page=1, similarity_metric='cosine', filters=None, min_score=0, include_vector=False, include_count=True, cluster_properties_filter={}, verbose=True)#

List of documents closest from the center.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (list) – The vector field where a clustering task was run.

  • cluster_ids (list) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • centroid_vector_fields (list) – Vector fields stored

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

  • cluster_properties_filter (dict) – Filter if clusters with certain characteristics should be hidden in results

list_furthest(dataset_id=None, vector_field=None, alias=None, cluster_ids=None, centroid_vector_fields=None, select_fields=None, approx=0, sum_fields=True, page_size=3, page=1, similarity_metric='cosine', filters=None, min_score=0, include_vector=False, include_count=True, cluster_properties_filter={})#

List documents furthest from the center.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_fields (list) – The vector field where a clustering task was run.

  • cluster_ids (list) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

merge(cluster_labels, alias=None, show_progress_bar=True, **update_kwargs)#
Parameters
  • cluster_labels (Tuple[int]) – a tuple of integers representing the cluster ids you would like to merge

  • alias (str) – the alias of the clustering you like to merge labels within

  • show_progress_bar (bool) – whether or not to show the progress bar

Example

summarize_closest(summarize_fields, dataset_id=None, vector_field=None, alias=None, cluster_ids=None, centroid_vector_fields=None, approx=0, sum_fields=True, page_size=3, page=1, similarity_metric='cosine', filters=None, min_score=0, include_vector=False, include_count=True, cluster_properties_filter={}, model_name='philschmid/bart-large-cnn-samsum', tokenizer=None, max_length=100, deployable_id=None, first_sentence_only=True, **kwargs)#

List of documents closest from the center.

summarize_fields: list

Fields to perform summarization, empty array/list means all fields

dataset_id: string

Unique name of dataset

vector_field: list

The vector field where a clustering task was run.

cluster_ids: list

Any of the cluster ids

alias: string

Alias is used to name a cluster

centroid_vector_fields: list

Vector fields stored

approx: int

Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

sum_fields: bool

Whether to sum the multiple vectors similarity search score as 1 or seperate

page_size: int

Size of each page of results

page: int

Page of the results

similarity_metric: string

Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

filters: list

Query for filtering the search results

facets: list

Fields to include in the facets, if [] then all

min_score: int

Minimum score for similarity metric

include_vectors: bool

Include vectors in the search results

include_count: bool

Include the total count of results in the search results

include_facets: bool

Include facets in the search results

cluster_properties_filter: dict

Filter if clusters with certain characteristics should be hidden in results

model_name: str

Huggingface Model to use for summarization. Pick from https://huggingface.co/models?pipeline_tag=summarization&sort=downloadshttps://huggingface.co/models?pipeline_tag=summarization

tokenizer: str

Tokenizer to use for summarization, allows you to bring your own tokenizer, else will instantiate pre-trained from selected model

Warning

This function is currently in beta and is liable to change in the future. We recommend not using this in production systems.

summarize_furthest(summarize_fields, dataset_id=None, vector_field=None, alias=None, cluster_ids=None, centroid_vector_fields=None, approx=0, sum_fields=True, page_size=3, page=1, similarity_metric='cosine', filters=None, min_score=0, include_vector=False, include_count=True, cluster_properties_filter={}, model_name='sshleifer/distilbart-cnn-6-6', tokenizer=None, **kwargs)#

List of documents furthest from the center.

summarize_fields: list

Fields to perform summarization, empty array/list means all fields

dataset_id: string

Unique name of dataset

vector_field: list

The vector field where a clustering task was run.

cluster_ids: list

Any of the cluster ids

alias: string

Alias is used to name a cluster

centroid_vector_fields: list

Vector fields stored

approx: int

Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

sum_fields: bool

Whether to sum the multiple vectors similarity search score as 1 or seperate

page_size: int

Size of each page of results

page: int

Page of the results

similarity_metric: string

Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

filters: list

Query for filtering the search results

facets: list

Fields to include in the facets, if [] then all

min_score: int

Minimum score for similarity metric

include_vectors: bool

Include vectors in the search results

include_count: bool

Include the total count of results in the search results

include_facets: bool

Include facets in the search results

cluster_properties_filter: dict

Filter if clusters with certain characteristics should be hidden in results

model_name: str

Huggingface Model to use for summarization. Pick from https://huggingface.co/models?pipeline_tag=summarization&sort=downloadshttps://huggingface.co/models?pipeline_tag=summarization

tokenizer: str

Tokenizer to use for summarization, allows you to bring your own tokenizer, else will instantiate pre-trained from selected model

Warning

This function is currently in beta and is liable to change in the future. We recommend not using this in production systems.

class relevanceai.operations.cluster.cluster.ClusterWriteOps#

You can load ClusterOps instances in 2 ways.

# State the vector fields and alias in the ClusterOps object
cluster_ops = client.ClusterOps(
   alias="kmeans-25",
    dataset_id="sample_dataset_id",
    vector_fields=["sample_vector_"]
)
cluster_ops.list_closest()

# State the vector fields and alias in the operational call
cluster_ops = client.ClusterOps(alias="kmeans-25")
cluster_ops.list_closest(
    dataset="sample_dataset_id",
    vector_fields=["sample_vector_"]
)
property centroids#

Access the centroids of your dataset easily

ds = client.Dataset("sample")
cluster_ops = ds.ClusterOps(
    vector_fields=["sample_vector_"],
    alias="simple"
)
cluster_ops.centroids
create_centroids()#

Calculate centroids from your vectors

Example

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")
cluster_ops = ds.ClusterOps(
    alias="kmeans-25",
    vector_fields=['sample_vector_']
)
centroids = cluster_ops.create_centroids()
fit_predict_update(*args, **kwargs)#

Note

This function has been deprecated as of 1.0.0

insert_centroids(centroid_documents)#

Insert your own centroids

Example

ds = client.Dataset("sample")
cluster_ops = ds.ClusterOps(
    vector_fields=["sample_vector_"],
    alias="simple"
)
cluster_ops.insert_centroids(
    [
        {
            "_id": "cluster-1",
            "sample_vector_": [1, 1, 1]
        }
    ]
)
run(dataset_id, vector_fields=None, filters=None, show_progress_bar=True, verbose=True, include_cluster_report=True, report_name='cluster-report')#

Run clustering on a dataset

Parameters
  • dataset_id (Optional[Union[str, Any]]) – The dataset ID

  • vector_fields (Optional[List[str]]) – List of vector fields

  • show_progress_bar (bool) – If True, the progress bar can be shown

Return type

None

ClusterOps.aggregate(vector_fields=None, metrics=None, sort=None, groupby=None, filters=None, page_size=20, page=1, asc=False, flatten=True, dataset=None)#

Takes an aggregation query and gets the aggregate of each cluster in a collection. This helps you interpret each cluster and what is in them. It can only can be used after a vector field has been clustered.

Aggregation/Groupby of a collection using an aggregation query. The aggregation query is a json body that follows the schema of:

{
    "groupby" : [
        {"name": <alias>, "field": <field in the collection>, "agg": "category"},
        {"name": <alias>, "field": <another groupby field in the collection>, "agg": "numeric"}
    ],
    "metrics" : [
        {"name": <alias>, "field": <numeric field in the collection>, "agg": "avg"}
        {"name": <alias>, "field": <another numeric field in the collection>, "agg": "max"}
    ]
}

For example, one can use the following aggregations to group score based on region and player name.

{
    "groupby" : [
        {"name": "region", "field": "player_region", "agg": "category"},
        {"name": "player_name", "field": "name", "agg": "category"}
    ],
    "metrics" : [
        {"name": "average_score", "field": "final_score", "agg": "avg"},
        {"name": "max_score", "field": "final_score", "agg": "max"},
        {'name':'total_score','field':"final_score", 'agg':'sum'},
        {'name':'average_deaths','field':"final_deaths", 'agg':'avg'},
        {'name':'highest_deaths','field':"final_deaths", 'agg':'max'},
    ]
}
“groupby” is the fields you want to split the data into. These are the available groupby types:
  • category : groupby a field that is a category

  • numeric: groupby a field that is a numeric

“metrics” is the fields and metrics you want to calculate in each of those, every aggregation includes a frequency metric. These are the available metric types:
  • “avg”, “max”, “min”, “sum”, “cardinality”

The response returned has the following in descending order.

If you want to return documents, specify a “group_size” parameter and a “select_fields” parameter if you want to limit the specific fields chosen. This looks as such:

For array-aggregations, you can add “agg”: “array” into the aggregation query.

Parameters
  • dataset_id (string) – Unique name of dataset

  • metrics (list) – Fields and metrics you want to calculate

  • groupby (list) – Fields you want to split the data into

  • filters (list) – Query for filtering the search results

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • asc (bool) – Whether to sort results by ascending or descending order

  • flatten (bool) – Whether to flatten

  • alias (string) – Alias used to name a vector field. Belongs in field_{alias} vector

  • metrics – Fields and metrics you want to calculate

  • groupby – Fields you want to split the data into

  • filters – Query for filtering the search results

  • page_size – Size of each page of results

  • page – Page of the results

  • asc – Whether to sort results by ascending or descending order

  • flatten – Whether to flatten

Example

ClusterOps.list_closest(dataset_id=None, vector_field=None, alias=None, cluster_ids=None, centroid_vector_fields=None, select_fields=None, approx=0, sum_fields=True, page_size=3, page=1, similarity_metric='cosine', filters=None, min_score=0, include_vector=False, include_count=True, cluster_properties_filter={}, verbose=True)#

List of documents closest from the center.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_field (list) – The vector field where a clustering task was run.

  • cluster_ids (list) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • centroid_vector_fields (list) – Vector fields stored

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

  • cluster_properties_filter (dict) – Filter if clusters with certain characteristics should be hidden in results

ClusterOps.list_furthest(dataset_id=None, vector_field=None, alias=None, cluster_ids=None, centroid_vector_fields=None, select_fields=None, approx=0, sum_fields=True, page_size=3, page=1, similarity_metric='cosine', filters=None, min_score=0, include_vector=False, include_count=True, cluster_properties_filter={})#

List documents furthest from the center.

Parameters
  • dataset_id (string) – Unique name of dataset

  • vector_fields (list) – The vector field where a clustering task was run.

  • cluster_ids (list) – Any of the cluster ids

  • alias (string) – Alias is used to name a cluster

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields

  • approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • filters (list) – Query for filtering the search results

  • facets (list) – Fields to include in the facets, if [] then all

  • min_score (int) – Minimum score for similarity metric

  • include_vectors (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • include_facets (bool) – Include facets in the search results

ClusterOps.centroids()#

Access the centroids of your dataset easily

ds = client.Dataset("sample")
cluster_ops = ds.ClusterOps(
    vector_fields=["sample_vector_"],
    alias="simple"
)
cluster_ops.centroids
ClusterOps.insert_centroids(centroid_documents)#

Insert your own centroids

Example

ds = client.Dataset("sample")
cluster_ops = ds.ClusterOps(
    vector_fields=["sample_vector_"],
    alias="simple"
)
cluster_ops.insert_centroids(
    [
        {
            "_id": "cluster-1",
            "sample_vector_": [1, 1, 1]
        }
    ]
)
ClusterOps.create_centroids()#

Calculate centroids from your vectors

Example

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")
cluster_ops = ds.ClusterOps(
    alias="kmeans-25",
    vector_fields=['sample_vector_']
)
centroids = cluster_ops.create_centroids()
ClusterOps.merge(cluster_labels, alias=None, show_progress_bar=True, **update_kwargs)#
Parameters
  • cluster_labels (Tuple[int]) – a tuple of integers representing the cluster ids you would like to merge

  • alias (str) – the alias of the clustering you like to merge labels within

  • show_progress_bar (bool) – whether or not to show the progress bar

Example