relevanceai.operations.cluster.cluster
#
Module Contents#
- class relevanceai.operations.cluster.cluster.ClusterWriteOps(credentials: relevanceai.client.helpers.Credentials, model: Any = None, alias: str = None, n_clusters: Optional[int] = None, cluster_config: Optional[Dict[str, Any]] = None, outlier_value: int = - 1, outlier_label: str = 'outlier', verbose: bool = True, vector_fields: Optional[list] = None, **kwargs)#
You can load ClusterOps instances in 2 ways.
# State the vector fields and alias in the ClusterOps object cluster_ops = client.ClusterOps( alias="kmeans-25", dataset_id="sample_dataset_id", vector_fields=["sample_vector_"] ) cluster_ops.list_closest() # State the vector fields and alias in the operational call cluster_ops = client.ClusterOps(alias="kmeans-25") cluster_ops.list_closest( dataset="sample_dataset_id", vector_fields=["sample_vector_"] )
- dataset_id :str#
- cluster_field :str#
- fit_predict_update(self, *args, **kwargs)#
- run(self, dataset_id: str, vector_fields: Optional[List[str]] = None, filters: Optional[list] = None, show_progress_bar: bool = True, verbose: bool = True, include_cluster_report: bool = True, report_name: str = 'cluster-report') None #
Run clustering on a dataset
- Parameters
dataset_id (Optional[Union[str, Any]]) – The dataset ID
vector_fields (Optional[List[str]]) – List of vector fields
show_progress_bar (bool) – If True, the progress bar can be shown
- cluster_report(self, X, cluster_labels, centroids)#
- property centroids(self)#
Access the centroids of your dataset easily
ds = client.Dataset("sample") cluster_ops = ds.ClusterOps( vector_fields=["sample_vector_"], alias="simple" ) cluster_ops.centroids
- create_centroids(self)#
Calculate centroids from your vectors
Example
from relevanceai import Client client = Client() ds = client.Dataset("sample") cluster_ops = ds.ClusterOps( alias="kmeans-25", vector_fields=['sample_vector_'] ) centroids = cluster_ops.create_centroids()
- insert_centroids(self, centroid_documents)#
Insert your own centroids
Example
ds = client.Dataset("sample") cluster_ops = ds.ClusterOps( vector_fields=["sample_vector_"], alias="simple" ) cluster_ops.insert_centroids( [ { "_id": "cluster-1", "sample_vector_": [1, 1, 1] } ] )
- class relevanceai.operations.cluster.cluster.ClusterOps(credentials: relevanceai.client.helpers.Credentials, model: Any = None, alias: str = None, n_clusters: Optional[int] = None, cluster_config: Optional[Dict[str, Any]] = None, outlier_value: int = - 1, outlier_label: str = 'outlier', verbose: bool = True, vector_fields: Optional[list] = None, **kwargs)#
You can load ClusterOps instances in 2 ways.
# State the vector fields and alias in the ClusterOps object cluster_ops = client.ClusterOps( alias="kmeans-25", dataset_id="sample_dataset_id", vector_fields=["sample_vector_"] ) cluster_ops.list_closest() # State the vector fields and alias in the operational call cluster_ops = client.ClusterOps(alias="kmeans-25") cluster_ops.list_closest( dataset="sample_dataset_id", vector_fields=["sample_vector_"] )
- list_closest#
- list_furthest#
- aggregate(self, vector_fields: List[str] = None, metrics: Optional[list] = None, sort: Optional[list] = None, groupby: Optional[list] = None, filters: Optional[list] = None, page_size: int = 20, page: int = 1, asc: bool = False, flatten: bool = True, dataset=None)#
Takes an aggregation query and gets the aggregate of each cluster in a collection. This helps you interpret each cluster and what is in them. It can only can be used after a vector field has been clustered.
Aggregation/Groupby of a collection using an aggregation query. The aggregation query is a json body that follows the schema of:
{ "groupby" : [ {"name": <alias>, "field": <field in the collection>, "agg": "category"}, {"name": <alias>, "field": <another groupby field in the collection>, "agg": "numeric"} ], "metrics" : [ {"name": <alias>, "field": <numeric field in the collection>, "agg": "avg"} {"name": <alias>, "field": <another numeric field in the collection>, "agg": "max"} ] }
For example, one can use the following aggregations to group score based on region and player name.
{ "groupby" : [ {"name": "region", "field": "player_region", "agg": "category"}, {"name": "player_name", "field": "name", "agg": "category"} ], "metrics" : [ {"name": "average_score", "field": "final_score", "agg": "avg"}, {"name": "max_score", "field": "final_score", "agg": "max"}, {'name':'total_score','field':"final_score", 'agg':'sum'}, {'name':'average_deaths','field':"final_deaths", 'agg':'avg'}, {'name':'highest_deaths','field':"final_deaths", 'agg':'max'}, ] }
- “groupby” is the fields you want to split the data into. These are the available groupby types:
category : groupby a field that is a category
numeric: groupby a field that is a numeric
- “metrics” is the fields and metrics you want to calculate in each of those, every aggregation includes a frequency metric. These are the available metric types:
“avg”, “max”, “min”, “sum”, “cardinality”
The response returned has the following in descending order.
- If you want to return documents, specify a “group_size” parameter and a “select_fields” parameter if you want to limit the specific fields chosen. This looks as such:
For array-aggregations, you can add “agg”: “array” into the aggregation query.
- Parameters
dataset_id (string) – Unique name of dataset
metrics (list) – Fields and metrics you want to calculate
groupby (list) – Fields you want to split the data into
filters (list) – Query for filtering the search results
page_size (int) – Size of each page of results
page (int) – Page of the results
asc (bool) – Whether to sort results by ascending or descending order
flatten (bool) – Whether to flatten
alias (string) – Alias used to name a vector field. Belongs in field_{alias} vector
metrics – Fields and metrics you want to calculate
groupby – Fields you want to split the data into
filters – Query for filtering the search results
page_size – Size of each page of results
page – Page of the results
asc – Whether to sort results by ascending or descending order
flatten – Whether to flatten
Example
- merge(self, cluster_labels: List, alias: Optional[str] = None, show_progress_bar: bool = True, **update_kwargs)#
- Parameters
cluster_labels (Tuple[int]) – a tuple of integers representing the cluster ids you would like to merge
alias (str) – the alias of the clustering you like to merge labels within
show_progress_bar (bool) – whether or not to show the progress bar
Example
- closest(self, dataset_id: Optional[str] = None, vector_field: Optional[str] = None, alias: Optional[str] = None, cluster_ids: Optional[List] = None, centroid_vector_fields: Optional[List] = None, select_fields: Optional[List] = None, approx: int = 0, sum_fields: bool = True, page_size: int = 3, page: int = 1, similarity_metric: str = 'cosine', filters: Optional[List] = None, min_score: int = 0, include_vector: bool = False, include_count: bool = True, cluster_properties_filter: Optional[Dict] = {}, verbose: bool = True)#
List of documents closest from the center.
- Parameters
dataset_id (string) – Unique name of dataset
vector_field (list) – The vector field where a clustering task was run.
cluster_ids (list) – Any of the cluster ids
alias (string) – Alias is used to name a cluster
centroid_vector_fields (list) – Vector fields stored
select_fields (list) – Fields to include in the search results, empty array/list means all fields
approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate
sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate
page_size (int) – Size of each page of results
page (int) – Page of the results
similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]
filters (list) – Query for filtering the search results
facets (list) – Fields to include in the facets, if [] then all
min_score (int) – Minimum score for similarity metric
include_vectors (bool) – Include vectors in the search results
include_count (bool) – Include the total count of results in the search results
include_facets (bool) – Include facets in the search results
cluster_properties_filter (dict) – Filter if clusters with certain characteristics should be hidden in results
- furthest(self, dataset_id: Optional[str] = None, vector_field: Optional[str] = None, alias: Optional[str] = None, cluster_ids: Optional[List] = None, centroid_vector_fields: Optional[List] = None, select_fields: Optional[List] = None, approx: int = 0, sum_fields: bool = True, page_size: int = 3, page: int = 1, similarity_metric: str = 'cosine', filters: Optional[List] = None, min_score: int = 0, include_vector: bool = False, include_count: bool = True, cluster_properties_filter: Optional[Dict] = {})#
List documents furthest from the center.
- Parameters
dataset_id (string) – Unique name of dataset
vector_fields (list) – The vector field where a clustering task was run.
cluster_ids (list) – Any of the cluster ids
alias (string) – Alias is used to name a cluster
select_fields (list) – Fields to include in the search results, empty array/list means all fields
approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate
sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate
page_size (int) – Size of each page of results
page (int) – Page of the results
similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]
filters (list) – Query for filtering the search results
facets (list) – Fields to include in the facets, if [] then all
min_score (int) – Minimum score for similarity metric
include_vectors (bool) – Include vectors in the search results
include_count (bool) – Include the total count of results in the search results
include_facets (bool) – Include facets in the search results
- static get_cluster_summary(summarizer, docs: Dict, summarize_fields: List[str], max_length: int = 100, first_sentence_only: bool = True)#
- summarize_closest(self, summarize_fields: List[str], dataset_id: Optional[str] = None, vector_field: Optional[str] = None, alias: Optional[str] = None, cluster_ids: Optional[List] = None, centroid_vector_fields: Optional[List] = None, approx: int = 0, sum_fields: bool = True, page_size: int = 3, page: int = 1, similarity_metric: str = 'cosine', filters: Optional[List] = None, min_score: int = 0, include_vector: bool = False, include_count: bool = True, cluster_properties_filter: Optional[Dict] = {}, model_name: str = 'philschmid/bart-large-cnn-samsum', tokenizer: Optional[str] = None, max_length: int = 100, deployable_id: Optional[str] = None, first_sentence_only: bool = True, **kwargs)#
List of documents closest from the center.
- Parameters
summarize_fields (list) – Fields to perform summarization, empty array/list means all fields
dataset_id (string) – Unique name of dataset
vector_field (list) – The vector field where a clustering task was run.
cluster_ids (list) – Any of the cluster ids
alias (string) – Alias is used to name a cluster
centroid_vector_fields (list) – Vector fields stored
approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate
sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate
page_size (int) – Size of each page of results
page (int) – Page of the results
similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]
filters (list) – Query for filtering the search results
facets (list) – Fields to include in the facets, if [] then all
min_score (int) – Minimum score for similarity metric
include_vectors (bool) – Include vectors in the search results
include_count (bool) – Include the total count of results in the search results
include_facets (bool) – Include facets in the search results
cluster_properties_filter (dict) – Filter if clusters with certain characteristics should be hidden in results
model_name (str) – Huggingface Model to use for summarization. Pick from https://huggingface.co/models?pipeline_tag=summarization&sort=downloadshttps://huggingface.co/models?pipeline_tag=summarization
tokenizer (str) – Tokenizer to use for summarization, allows you to bring your own tokenizer, else will instantiate pre-trained from selected model
- summarize_furthest(self, summarize_fields: List[str], dataset_id: Optional[str] = None, vector_field: Optional[str] = None, alias: Optional[str] = None, cluster_ids: Optional[List] = None, centroid_vector_fields: Optional[List] = None, approx: int = 0, sum_fields: bool = True, page_size: int = 3, page: int = 1, similarity_metric: str = 'cosine', filters: Optional[List] = None, min_score: int = 0, include_vector: bool = False, include_count: bool = True, cluster_properties_filter: Optional[Dict] = {}, model_name: str = 'sshleifer/distilbart-cnn-6-6', tokenizer: Optional[str] = None, **kwargs)#
List of documents furthest from the center.
- Parameters
summarize_fields (list) – Fields to perform summarization, empty array/list means all fields
dataset_id (string) – Unique name of dataset
vector_field (list) – The vector field where a clustering task was run.
cluster_ids (list) – Any of the cluster ids
alias (string) – Alias is used to name a cluster
centroid_vector_fields (list) – Vector fields stored
approx (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate
sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate
page_size (int) – Size of each page of results
page (int) – Page of the results
similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]
filters (list) – Query for filtering the search results
facets (list) – Fields to include in the facets, if [] then all
min_score (int) – Minimum score for similarity metric
include_vectors (bool) – Include vectors in the search results
include_count (bool) – Include the total count of results in the search results
include_facets (bool) – Include facets in the search results
cluster_properties_filter (dict) – Filter if clusters with certain characteristics should be hidden in results
model_name (str) – Huggingface Model to use for summarization. Pick from https://huggingface.co/models?pipeline_tag=summarization&sort=downloadshttps://huggingface.co/models?pipeline_tag=summarization
tokenizer (str) – Tokenizer to use for summarization, allows you to bring your own tokenizer, else will instantiate pre-trained from selected model