Filters (Complex)#

Filters are great tools to retrieve a subset of documents whose data match the criteria specified in the filter. For instance, in an e-commerce dataset, we can retrieve all products: * with prices between 200 and 300 dollars * with the phrase “free return” included in description field * that are produced after January 2020

📘 Filters help us find what we need.

Filters are great tools to retrieve a subset of documents whose data match certain criteria. This allows us to have a more fine-grained overview of the data since only documents that meet the filtering criteria will be displayed.

How to form a filter?

Filters at Relevance AI are defined as Python dictionaries with four main keys: - field (i.e. the data filed in the document you want to filter on) - condition (i.e. operators such as greater than or equal) - filter_type (i.e. the type of filter you want to apply - whether it be date/numeric/text etc.) - condition_value (dependent on the filter type but decides what value to filter on)

filter = [
    {
        "field": description,
        "filter_type": contains,
        "condition": ==,
        "condition_value": Durian Club
    }
]

Filtering operators#

Relevance AI covers all common operators: * “==” (a == b, a equals b) * “!=” (a != b, a not equals b) * “>=” (a >= b, a greater that or equals b) * “>” (a > b, a greater than b) * “<” (a < b, a smaller than b) * “<=” (a <= b, a smaller than or equals b)

Filter types

Supported filter types at Relevance AI are listed below.

  • contains

  • exact_match

  • word_match

  • categories

  • exists

  • date

  • numeric

  • ids

  • support for mixing together multiple filters such as in OR situations

We will explain each filter type followed by a sample code snippet in the next pages. There is also a guide on how to combine filters and vector search.

Alternative text

Exists#

This filter returns entries in a database if a certain field (as opposed to the field values in previously mentioned filter types) exists or doesn’t exist in them. For instance, filtering out documents in which there is no field ‘purchase-info’. Note that this filter is case-sensitive.

You can sign up/login and find your credentials here: https://cloud.relevance.ai/sdk/api Once you have signed up, click on the value under Activation token and paste it here

from relevanceai import Client
client = Client()

DATASET_ID = "ecommerce-sample-dataset"
ds = client.Dataset(DATASET_ID)

filters = [
    {
        "field": "brand",
        "filter_type": "exists",
        "condition": "==",
        "condition_value": " "
    }
]

filtered_data = ds.get_documents(filters)

Exact Match#

This filter works with string values and only returns documents with a field value that exactly matches the filtered criteria. For instance under filtering by ‘Samsung galaxy s21’, the result will only contain products explicitly having ‘Samsung galaxy s21’ in their specified field. Note that this filter is case-sensitive.

from relevanceai import Client


DATASET_ID = "ecommerce-sample-dataset"
df = client.Dataset(DATASET_ID)

filter = [
    {
        "field": product_name,
        "filter_type": exact_match,
        "condition": ==,
        "condition_value": Durian Leather 2 Seater Sofa
    }
]

filtered_data = ds.get_where(filter)

Categories#

This filter checks the entries in a database and returns ones in which a field value exists in a given filter list. For instance, if the product name is any of Sony, Samsung, or LG. Note that this filter is case-sensitive.

filter = [
    {
        "field": brand,
        "filter_type": categories,
        "condition": >=,
        "condition_value": ['LG', 'samsung']
    }
]

filtered_data = ds.get_where(filter)

Contains#

This filter returns a document only if it contains a string value. Note that substrings are covered in this category. For instance, if a product name is composed of a name and a number (e.g. ABC-123), one might remember the name but not the number. This filter can easily return all products including the ABC string. Note that this filter is case-sensitive.

You can sign up/login and find your credentials here: https://cloud.relevance.ai/sdk/api Once you have signed up, click on the value under Activation token and paste it here

from relevanceai import Client
client = Client()

DATASET_ID = "ecommerce-sample-dataset"
ds = client.Dataset(DATASET_ID)

filter = [
    {
        "field": "description",
        "filter_type": "contains",
        "condition": "==",
        "condition_value": "Durian BID"
    }
]

filtered_data = ds.get_where(filter)

Date#

This filter performs date analysis and filters documents based on their date information. For instance, it is possible to filter out any documents with a production date before January 2021.

from relevanceai import Client
client = Client()

DATASET_ID = "ecommerce-sample-dataset"
ds = client.Dataset(DATASET_ID)

filter = [
    {
        "field": "insert_date_",
        "filter_type": "date",
        "condition": "==",
        "condition_value": "2020-07-01"
    }
]

Note that the default format is “yyyy-mm-dd” but can be changed to “yyyy-dd-mm” through the format parameter as shown in the example below.

filters = [
    {
        "field": "insert_date_",
        "filter_type": "date",
        "condition": "==",
        "condition_value": "2020-07-01",
        "format": "yyyy-dd-MM"
    }
]

filtered_data = ds.get_documents(filters)

Word Match#

This filter has similarities to both exact_match and contains. It returns a document only if it contains a word value matching the filter; meaning substrings are covered in this category but as long as they can be extracted with common word separators like the white-space (blank). For instance, the filter value “Home Gallery”, can lead to extraction of a document with “Buy Home Fashion Gallery Polyester …” in the description field as both words are explicitly seen in the text. Note that this filter is case-sensitive.

from relevanceai import Client
client = Client()

DATASET_ID = "ecommerce-sample-dataset"
ds = client.Dataset(DATASET_ID)

filter = [
    {
        "field": description,
        "filter_type": "word_match",
        "condition": "==",
        "condition_value": "Home curtain"
    }
]

filtered_data = ds.get_where(filter)

IDs#

This filter returns documents whose unique id exists in a given list. It may look similar to ‘categories’. The main difference is the search speed.

from relevanceai import Client
client = Client()

DATASET_ID = "ecommerce-sample-dataset"
ds = client.Dataset(DATASET_ID)

filter = [
    {
        "field": _id,
        "filter_type": ids,
        "condition": ==,
        "condition_value": 7790e058cbe1b1e10e20cd22a1e53d36
    }
]

filtered_data = ds.get_documents(filter)

Numeric#

This filter is to perform the filtering operators on a numeric value. For instance, returning the documents with a price larger than 1000 dollars.

You can sign up/login and find your credentials here: https://cloud.relevance.ai/sdk/api Once you have signed up, click on the value under Activation token and paste it here

from relevanceai import Client
client = Client()

DATASET_ID = "ecommerce-sample-dataset"
ds = client.Dataset(DATASET_ID)

filter = [
    {
        "field": "retail_price",
        "filter_type": "numeric",
        "condition": ">",
        "condition_value": 5000
    }
]

filtered_data = ds.get_documents(filter)

Or#

The or filter helps you filter for multiple conditions. Unlike other filters, the only values used here are filter_type and condition_value.

from relevanceai import Client
client = Client()

filters = [
    {
    'filter_type' : 'or',
    "condition_value": [
        {
            'field' : 'price',
            'filter_type' : 'numeric',
            "condition":"<=", "condition_value":90
        },
        {
            'field' : 'price',
            'filter_type' : 'numeric',
            "condition":">=",
            "condition_value": 150
        }
    ]}
]

filtered_data = df.get_documents(filter)

(A or B) and (C or D)#

Below, we show an example of how to use 2 lists of filters with or logic.

from relevanceai import Client
client = Client()

filter = [{
    'filter_type' : 'or',
    "condition_value": [
        {
            'field' : 'price',
            'filter_type' : 'numeric',
            "condition":"<=",
            "condition_value":90
        },
        {
            'field' : 'price',
            'filter_type' : 'numeric',
            "condition":">=",
            "condition_value": 150
        }
    ]},
    'filter_type' : 'or',
    "condition_value": [
        {
            'field' : 'animal',
            'filter_type' : 'category',
            "condition":"==",
            "condition_value":"cat"
        },
        {
            'field' : 'animal',
            'filter_type' : 'category',
            "condition":"==",
            "condition_value": "dog"
        }
    ]}
]

filtered_data = ds.get_where(filter)

(A or B or C) and D#

Below, we show an example of how to use 2 lists of filters with or logic.

from relevanceai import Client
client = Client()

filter = [{
    'filter_type' : 'or',
    "condition_value": [
        {
            'field' : 'price',
            'filter_type' : 'numeric',
            "condition":"<=",
            "condition_value":90
        },
        {
            'field' : 'price',
            'filter_type' : 'numeric',
            "condition":">=",
            "condition_value": 150
        },
        {
            'field' : 'value',
            'filter_type' : 'numeric',
            "condition":">=",
            "condition_value": 2
        },
        ],
        {
            'field' : 'animal',
            'filter_type' : 'category',
            "condition":"==",
            "condition_value":"cat"
        },
]

filtered_data = ds.get_documents(filter)

Regex#

This filter returns a document only if it matches regexp (i.e. regular expression). Note that substrings are covered in this category. For instance, if a product name is composed of a name and a number (e.g. ABC-123), one might remember the name but not the number. This filter can easily return all products including the ABC string.

Relevance AI has the same regular expression schema as Apache Lucene’s ElasticSearch to parse queries.

Note that this filter is case-sensitive.

from relevanceai import Client
client = Client()

DATASET_ID = "ecommerce-sample-dataset"
ds = client.Dataset(DATASET_ID)

filter = [
    {
        "field": description,
        "filter_type": regexp,
        "condition": ==,
        "condition_value": .*Durian (\w+)
    }
]
filtered_data = ds.get_where(filter)

Combining filters#

It is possible to combine multiple filters. For instance, the sample code below shows a filter that searches for

  • a Lenovo flip cover

  • produced after January 2020

  • by either Lapguard or 4D brand.

A screenshot of the results can be seen on top.

You can sign up/login and find your credentials here: https://cloud.relevance.ai/sdk/api Once you have signed up, click on the value under Activation token and paste it here

from relevanceai import Client
client = Client()


DATASET_ID = "ecommerce-sample-dataset"
ds = client.Dataset(DATASET_ID)
filter = [
    {
        "field": description,
        "filter_type" : contains,
        "condition": ==,
        "condition_value": Lenovo
    },
    {
        "field" : brand,
        "filter_type" : categories,
        "condition": ==,
        "condition_value": ['Lapguard', '4D']
    },
    {
        "field" : "insert_date_",
        "filter_type" : date,
        "condition": >=,
        "condition_value": 2020-01-01
    }
]

filtered_data = ds.get_where(filter)