relevanceai.operations.operations#

Module Contents#

class relevanceai.operations.operations.Operations(credentials: relevanceai.client.helpers.Credentials, dataset_id: str, **kwargs)#

A Pandas Like datatset API for interacting with the RelevanceAI python package

dimensionality_reduction#
cluster(self, model: Any = None, vector_fields: Optional[List[str]] = None, alias: Optional[str] = None, filters: Optional[list] = None, include_cluster_report: bool = True, **kwargs)#

Run clustering on your dataset.

Example

from sklearn.cluster import KMeans
model = KMeans()

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")
cluster_ops = ds.cluster(
    model=model, vector_fields=["sample_vector_"],
    alias="kmeans-8"
)
Parameters
  • model (Union[str, Any]) – Any K-Means model

  • vector_fields (List[str]) – A list of possible vector fields

  • alias (str) – The alias to be used to store your model

  • cluster_config (dict) – The cluster config to use You can change the number of clusters for kmeans using: cluster_config={“n_clusters”: 10}. For a full list of possible parameters for different models, simply check how the cluster models are instantiated.

reduce_dims(self, alias: str, vector_fields: List[str], model: Any = 'pca', n_components: int = 3, filters: Optional[list] = None, **kwargs)#

Reduce dimensions

Parameters
  • model (Callable) – model to reduce dimensions

  • n_components (int) – The number of components

  • alias (str) – The alias of the model

  • vector_fields (List[str]) – The list of vector fields to support

  • code-block:: (..) –

    from relevanceai import Client client = Client() ds = client.Dataset(“sample”) ds.reduce_dims(

    alias=”sample”, vector_fields=[”sample_1_vector_”], model=”pca”

    )

vectorize(self, fields: List[str] = None, filters: Optional[List] = None, **kwargs)#

Vectorize the model

Parameters
  • fields (List[str]) – A list of fields to vectorize

  • encoders (Dict[str, List[Any]]) – A dictionary that creates a mapping between your unstructured fields and a list of encoders to run over those unstructured fields

Returns

If the vectorization process is successful, this dict contains the added vector names. Else, the dict is the request result containing error information.

Return type

dict

Example

from relevanceai import Client

client = Client()

dataset_id = "sample_dataset_id"
ds = client.Dataset(dataset_id)

ds.vectorize(
    fields=["text_field_1", "text_field_2"],
    encoders={
        "text": ["mpnet", "use"]
    }
)

# This operation with create 4 new vector fields
#
# text_field_1_mpnet_vector_, text_field_1_mpnet_vector_
# text_field_1_use_vector_, text_field_1_use_vector_
advanced_vectorize(self, vectorizers: List[relevanceai.operations.vector.vectorizer.Vectorizer])#

Advanced vectorization. By setting an

Example

# When first vectorizing
from relevanceai.operations import Vectorizer
vectorizer = Vectorizer(field="field_1", model=model, alias="value")
ds.advanced_vectorize(
    [vectorizer],
)
Parameters

vectorize_mapping (dict) – Vectorize mapping

Allows you to leverage vector similarity search to create a semantic search engine. Powerful features of Relevance vector search:

1. Multivector search that allows you to search with multiple vectors and give each vector a different weight. e.g. Search with a product image vector and text description vector to find the most similar products by what it looks like and what its described to do. You can also give weightings of each vector field towards the search, e.g. image_vector_ weights 100%, whilst description_vector_ 50%

An example of a simple multivector query:

>>> [
>>>     {"vector": [0.12, 0.23, 0.34], "fields": ["name_vector_"], "alias":"text"},
>>>     {"vector": [0.45, 0.56, 0.67], "fields": ["image_vector_"], "alias":"image"},
>>> ]

An example of a weighted multivector query:

>>> [
>>>     {"vector": [0.12, 0.23, 0.34], "fields": {"name_vector_":0.6}, "alias":"text"},
>>>     {"vector": [0.45, 0.56, 0.67], "fields": {"image_vector_"0.4}, "alias":"image"},
>>> ]

An example of a weighted multivector query with multiple fields for each vector:

>>> [
>>>     {"vector": [0.12, 0.23, 0.34], "fields": {"name_vector_":0.6, "description_vector_":0.3}, "alias":"text"},
>>>     {"vector": [0.45, 0.56, 0.67], "fields": {"image_vector_"0.4}, "alias":"image"},
>>> ]
  1. Utilise faceted search with vector search. For information on how to apply facets/filters check out datasets.documents.get_where

  2. Sum Fields option to adjust whether you want multiple vectors to be combined in the scoring or compared in the scoring. e.g. image_vector_ + text_vector_ or image_vector_ vs text_vector_.

    When sum_fields=True:

    • Multi-vector search allows you to obtain search scores by taking the sum of these scores.

    • TextSearchScore + ImageSearchScore = SearchScore

    • We then rank by the new SearchScore, so for searching 1000 documents there will be 1000 search scores and results

    When sum_fields=False:

    • Multi vector search but not summing the score, instead including it in the comparison!

    • TextSearchScore = SearchScore1

    • ImagSearchScore = SearchScore2

    • We then rank by the 2 new SearchScore, so for searching 1000 documents there should be 2000 search scores and results.

  3. Personalization with positive and negative document ids.

    • For more information about the positive and negative document ids to personalize check out services.recommend.vector

For more even more advanced configuration and customisation of vector search, reach out to us at dev@tryrelevance.com and learn about our new advanced_vector_search.

Parameters
  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • positive_document_ids (dict) – Positive document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • negative_document_ids (dict) – Negative document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • vector_operation (string) – Aggregation for the vectors when using positive and negative document IDs, choose from [‘mean’, ‘sum’, ‘min’, ‘max’, ‘divide’, ‘mulitple’]

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into Relevance. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

  • search_history_id (string) – Search history ID, only used for storing search histories.

  • query (string) – What to store as the query name in the dashboard

Example

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")
results = ds.vector_search(multivector_query=MULTIVECTOR_QUERY)

Combine the best of both traditional keyword faceted search with semantic vector search to create the best search possible.

For information on how to use vector search check out services.search.vector.

For information on how to use traditional keyword faceted search check out services.search.traditional.

Parameters
  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • text (string) – Text Search Query (not encoded as vector)

  • fields (list) – Text fields to search against

  • positive_document_ids (dict) – Positive document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • negative_document_ids (dict) – Negative document IDs to personalize the results with, this will retrive the vectors from the document IDs and consider it in the operation.

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • vector_operation (string) – Aggregation for the vectors when using positive and negative document IDs, choose from [‘mean’, ‘sum’, ‘min’, ‘max’, ‘divide’, ‘mulitple’]

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (float) – Minimum score for similarity metric

  • select_fields (list) – Fields to include in the search results, empty array/list means all fields.

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into Relevance. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

  • search_history_id (string) – Search history ID, only used for storing search histories.

  • edit_distance (int) – This refers to the amount of letters it takes to reach from 1 string to another string. e.g. band vs bant is a 1 word edit distance. Use -1 if you would like this to be automated.

  • ignore_spaces (bool) – Whether to consider cases when there is a space in the word. E.g. Go Pro vs GoPro.

  • traditional_weight (int) – Multiplier of traditional search score. A value of 0.025~0.075 is the ideal range

Example

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")
MULTIVECTOR_QUERY = [{"vector": [0, 1, 2], "fields": ["sample_vector_"]}]
results = ds.vector_search(multivector_query=MULTIVECTOR_QUERY)

Chunks are data that has been divided into different units. e.g. A paragraph is made of many sentence chunks, a sentence is made of many word chunks, an image frame in a video. By searching through chunks you can pinpoint more specifically where a match is occuring. When creating a chunk in your document use the suffix “chunk” and “chunkvector”. An example of a document with chunks:

>>> {
>>>     "_id" : "123",
>>>     "title" : "Lorem Ipsum Article",
>>>     "description" : "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.",
>>>     "description_vector_" : [1.1, 1.2, 1.3],
>>>     "description_sentence_chunk_" : [
>>>         {"sentence_id" : 0, "sentence_chunkvector_" : [0.1, 0.2, 0.3], "sentence" : "Lorem Ipsum is simply dummy text of the printing and typesetting industry."},
>>>         {"sentence_id" : 1, "sentence_chunkvector_" : [0.4, 0.5, 0.6], "sentence" : "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."},
>>>         {"sentence_id" : 2, "sentence_chunkvector_" : [0.7, 0.8, 0.9], "sentence" : "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."},
>>>     ]
>>> }

For combining chunk search with other search check out services.search.advanced_chunk.

Parameters
  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • chunk_field (string) – Field where the array of chunked documents are.

  • chunk_scoring (string) – Scoring method for determining for ranking between document chunks.

  • chunk_page_size (int) – Size of each page of chunk results

  • chunk_page (int) – Page of the chunk results

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into Relevance. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

  • query (string) – What to store as the query name in the dashboard

Example

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")
results = ds.chunk_search(
    chunk_field="_chunk_",
    multivector_query=MULTIVECTOR_QUERY
)

Multistep chunk search involves a vector search followed by chunk search, used to accelerate chunk searches or to identify context before delving into relevant chunks. e.g. Search against the paragraph vector first then sentence chunkvector after.

For more information about chunk search check out datasets.search.chunk.

For more information about vector search check out services.search.vector

Example

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")
results = ds.search.multistep_chunk(
    chunk_field="_chunk_",
    multivector_query=MULTIVECTOR_QUERY,
    first_step_multivector_query=FIRST_STEP_MULTIVECTOR_QUERY
)
Parameters
  • multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • chunk_field (string) – Field where the array of chunked documents are.

  • chunk_scoring (string) – Scoring method for determining for ranking between document chunks.

  • chunk_page_size (int) – Size of each page of chunk results

  • chunk_page (int) – Page of the chunk results

  • approximation_depth (int) – Used for approximate search to speed up search. The higher the number, faster the search but potentially less accurate.

  • sum_fields (bool) – Whether to sum the multiple vectors similarity search score as 1 or seperate

  • page_size (int) – Size of each page of results

  • page (int) – Page of the results

  • similarity_metric (string) – Similarity Metric, choose from [‘cosine’, ‘l1’, ‘l2’, ‘dp’]

  • facets (list) – Fields to include in the facets, if [] then all

  • filters (list) – Query for filtering the search results

  • min_score (int) – Minimum score for similarity metric

  • include_vector (bool) – Include vectors in the search results

  • include_count (bool) – Include the total count of results in the search results

  • asc (bool) – Whether to sort results by ascending or descending order

  • keep_search_history (bool) – Whether to store the history into Relevance. This will increase the storage costs over time.

  • hundred_scale (bool) – Whether to scale up the metric by 100

  • first_step_multivector_query (list) – Query for advance search that allows for multiple vector and field querying.

  • first_step_page (int) – Page of the results

  • first_step_page_size (int) – Size of each page of results

  • query (string) – What to store as the query name in the dashboard

launch_cluster_app(self, configuration: dict = None)#

Launch an app with a given configuration

Example

ds.launch_cluster_app()
Parameters

configuration (dict) – The configuration can be found in the deployable once created.

subcluster(self, model, alias: str, vector_fields, parent_field, filters: Optional[list] = None, cluster_ids: Optional[list] = None, min_parent_cluster_size: Optional[int] = None, **kwargs)#
analyze_sentiment(self, text_fields: list, model_name: str = 'siebert/sentiment-roberta-large-english', output_field: str = None, highlight: bool = False, positive_sentiment_name: str = 'positive', max_number_of_shap_documents: Optional[int] = None, min_abs_score: float = 0.1, **apply_args)#

Easily add sentiment to your dataset

Example

ds.analyze_sentiment(field="sample_1_label")
Parameters
  • field (str) – The field to add sentiment to

  • output_field (str) – Where to store the sentiment values

  • model_name (str) – The HuggingFace Model name.

  • log_to_file (bool) – If True, puts the logs in a file. Otherwise, it will

  • highlight (bool) – If True, this will include a SHAP explainer of what is causing positive and negative sentiment

  • max_number_of_shap_documents (int) – The maximum number of shap documents

  • min_abs_score (float) – The minimum absolute score for it to be considered important based on SHAP algorithm.

question_answer(self, input_field: str, questions: Union[List[str], str], output_field: Optional[str] = None, model_name: str = 'mrm8488/deberta-v3-base-finetuned-squadv2', verbose: bool = True, log_to_file: bool = True, filters: Optional[list] = None)#

Question your dataset and retrieve answers from it.

Example

from relevanceai import Client
client = Client()
ds = client.Dataset("ecommerce")
ds.question_answer(
    input_field="product_title",
    question="What brand shoes",
    output_field="_question_test"
)
Parameters
  • field (str) – The field to add sentiment to

  • output_field (str) – Where to store the sentiment values

  • model_name (str) – The HuggingFace Model name.

  • verbose (bool) – If True, prints progress bar workflow

  • log_to_file (bool) – If True, puts the logs in a file.

search(self, query: str = None, vector_search_query: Optional[dict] = None, fields_to_search: Optional[List] = None, select_fields: Optional[List] = None, include_vectors: bool = True, filters: Optional[List] = None, page: int = 0, page_size: int = 10, sort: dict = None, minimum_relevance: int = 0, query_config: dict = None, **kwargs)#

Advanced Search

Parameters
  • query (str) – The query to use

  • vector_search_query (dict) – The vector search query

  • fields_to_search (list) – The list of fields to search

  • select_fields (list) – The fields to select

list_deployables(self)#

Use this function to list available deployables

train_text_model_with_gpl(self, text_field: str, title_field: Optional[str] = None)#

Train a text model using GPL (Generative Pseudo-Labelling) This can be helpful for domain adaptation.

Example

from relevanceai import Client
client = Client()
ds = client.Dataset("sample")
ds.train_text_model(method="gpl")
Parameters

text_field (str) – Text field

train_text_model_with_tripleloss(self, text_field: str, label_field: str, output_dir: str = 'trained_model', percentage_for_dev=None)#

Supervised training a text model using tripleloss

Example

Parameters
  • text_field (str) – The field you want to use as input text for fine-tuning

  • label_field (str) – The field indicating the classes of the input

  • output_dir (str) – The path of the output directory

  • percentage_for_dev (float) – a number between 0 and 1 showing how much of the data should be used for evaluation. No evaluation if None

ClusterOps(self, alias, vector_fields: List, verbose: bool = False, **kwargs)#

ClusterOps object

label_from_list(self, vector_field: str, model: Callable, label_list: list, similarity_metric='cosine', number_of_labels: int = 1, score_field: str = '_search_score', alias: Optional[str] = None)#

Label from a given list.

Parameters
  • vector_field (str) – The vector field to label in the original dataset

  • model (Callable) – This will take a list of strings and then encode them

  • label_list (List) – A list of labels to accept

  • similarity_metric (str) – The similarity metric to accept

  • number_of_labels (int) – The number of labels to accept

  • score_field (str) – What to call the scoring of the labels

  • alias (str) – The alias of the labels

Example

from relevanceai import Client
client = Client()
df = client.Dataset("sample")

# Get a model to help us encode
from vectorhub.encoders.text.tfhub import USE2Vec
enc = USE2Vec()

# Use that model to help with encoding
label_list = ["dog", "cat"]

df = client.Dataset("_github_repo_vectorai")

df.label_from_list("documentation_vector_", enc.bulk_encode, label_list, alias="pets")