Developer-first vector platform for ML teams

Open In Colab

πŸ€– Basic Sub-clustering#

This notebook is a quick guide on how to use Relevance AI for subclustering. Subclustering allows users to infinitely drill down into their clusters by running more clusters.

Basic sub-clustering allows users to rely on clustering in simple ways.

For more details, please refer to the references.

!pip install -q RelevanceAI[notebook]

You can sign up/login and find your credentials here: https://cloud.tryrelevance.com/sdk/api Once you have signed up, click on the value under Authorization token and paste it here

from relevanceai import Client

client = Client()

🚣 Inserting data#

We use a sample ecommerce dataset - with vectors product_image_clip_vector_ and product_title_clip_vector_ already encoded for us.

from relevanceai.utils.datasets import get_ecommerce_dataset_encoded

docs = get_ecommerce_dataset_encoded()
docs[0].keys()
dict_keys(['product_image', 'query', 'product_price', 'source', 'product_title', 'product_link', 'product_image_clip_vector_', 'product_title_clip_vector_', 'insert_date_', '_id'])
ds = client.Dataset("basic_subclustering")
ds.delete()
ds.upsert_documents(docs)
βœ… All documents inserted/edited successfully.
ds.schema
{'insert_date_': 'date',
 'product_image': 'text',
 'product_image_clip_vector_': {'vector': 512},
 'product_link': 'text',
 'product_price': 'text',
 'product_title': 'text',
 'product_title_clip_vector_': {'vector': 512},
 'query': 'text',
 'source': 'text'}
vector_fields = ds.list_vector_fields()
vector_fields
['product_image_clip_vector_', 'product_title_clip_vector_']

πŸ’ Running the initial clustering approach:#

Let’s instantiate a clustering model and set an appropriate parent alias for n_clusters. Let’s vectorize over all available vector fields.

n_clusters = 10
vector_field = "product_image_clip_vector_"
parent_alias = f"kmeans_{n_clusters}"

from sklearn.cluster import KMeans

model = KMeans(n_clusters=n_clusters)

for v in vector_fields:
    cluster_ops = ds.cluster(model, vector_fields=[v], alias=parent_alias)

ds.schema
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/2 [00:00<?, ?it/s]
Build your clustering app here: https://cloud.tryrelevance.com/dataset/basic_subclustering/deploy/recent/cluster/
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/2 [00:00<?, ?it/s]
Build your clustering app here: https://cloud.tryrelevance.com/dataset/basic_subclustering/deploy/recent/cluster/
{'_cluster_': 'dict',
 '_cluster_.product_image_clip_vector_': 'dict',
 '_cluster_.product_image_clip_vector_.kmeans_10': 'text',
 '_cluster_.product_title_clip_vector_': 'dict',
 '_cluster_.product_title_clip_vector_.kmeans_10': 'text',
 'insert_date_': 'date',
 'product_image': 'text',
 'product_image_clip_vector_': {'vector': 512},
 'product_link': 'text',
 'product_price': 'text',
 'product_title': 'text',
 'product_title_clip_vector_': {'vector': 512},
 'query': 'text',
 'source': 'text'}
# You can find the parent field in the schema or alternatively provide a field.
parent_field = f"_cluster_.{vector_field}.{parent_alias}"

If we have a look at the resulting clusters in the clustering dashboard link above, we will see that there is potential for further break down of the clusters. At a high-level, we can see electronics and shoes, but we could further break down these clusters using subclustering.

Screen Shot 2022-04-07 at 2.41.57 pm.png

Screen Shot 2022-04-07 at 2.41.57 pm.png#

🫐 Running sub-clustering#

vector_field = "product_image_clip_vector_"

"""
Given the parent field - we now run subclustering
Let's dive deeper to view 3 subclusters
"""

subcluster_n_clusters = 3
subcluster_alias = f"{parent_alias}_{subcluster_n_clusters}"

from sklearn.cluster import KMeans

model = KMeans(n_clusters=subcluster_n_clusters)

ds.subcluster(
    model=model,
    parent_field=parent_field,
    vector_fields=[vector_field],
    alias=subcluster_alias,
)
0%|          | 0/10 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
"""
We can see the new subcluster in the schema
"""

ds.schema
{'_cluster_': 'dict',
 '_cluster_.product_image_clip_vector_': 'dict',
 '_cluster_.product_image_clip_vector_.kmeans_10': 'text',
 '_cluster_.product_image_clip_vector_.kmeans_10_3': 'text',
 '_cluster_.product_title_clip_vector_': 'dict',
 '_cluster_.product_title_clip_vector_.kmeans_10': 'text',
 'insert_date_': 'date',
 'product_image': 'text',
 'product_image_clip_vector_': {'vector': 512},
 'product_link': 'text',
 'product_price': 'text',
 'product_title': 'text',
 'product_title_clip_vector_': {'vector': 512},
 'query': 'text',
 'source': 'text'}

You should also be able to track your subclusters using ds.metadata.

ds.metadata
{'_subcluster_': [{'parent_field': '_cluster_.product_image_clip_vector_.kmeans_10', 'cluster_field': '_cluster_.product_image_clip_vector_.kmeans_10_3'}]}

You can also view your subcluster results using

subcluster_field = f"_cluster_.{vector_field}.{subcluster_alias}"

ds[subcluster_field]
/usr/local/lib/python3.7/dist-packages/relevanceai/dataset/series.py:93: UserWarning: Displaying using pandas. To get image functionality please install RelevanceAI[notebook].
  warnings.warn(Warning.MISSING_RELEVANCE_NOTEBOOK)
0%|          | 0/1 [00:00<?, ?it/s]
_cluster_.product_image_clip_vector_.kmeans_10_3
_id
0007a669-07e9-4a4a-b63c-40312690b381 cluster-9-1
00445000-a8ed-4523-b610-f70aa79d47f7 cluster-1-1
00a3d45e-2096-46aa-94c6-7d8480fb1436 cluster-1-2
01317a4c-2136-4fa3-be56-c07d79a646b3 cluster-4-1
0165f12a-cc93-4306-8161-750511e9a997 cluster-3-1
0186fa90-2de2-4b9c-9496-b395bf5cab51 cluster-0-0
01e4dba0-147e-41a7-8efa-95c33e23c93d cluster-5-2
026d370b-0660-468c-aeb0-63d6849713e2 cluster-5-0
02f4c283-23eb-432c-8dff-b6fece2aa869 cluster-5-0
03db6840-58de-4dd8-820a-0bd3a5f6b6d0 cluster-1-0
042210e2-382d-4483-8a94-72711505a56f cluster-3-0
0435795a-899f-4cdf-89be-a0f3f189d69e cluster-1-1
0478c702-b53c-46e6-8ab1-915670145163 cluster-3-1
04f72125-5a90-4574-a996-e41f2db7a767 cluster-7-0
050a9f63-3549-4720-9be7-9daa07f868e8 cluster-2-0
054c64cc-bb4b-48c6-a01d-99b532c07347 cluster-3-2
056cf704-162d-4ba5-8622-23695ee24216 cluster-5-1
05f401ef-d3f2-404b-a433-666fe410028d cluster-3-1
060bf51a-5918-4709-a2bc-8e74452ff853 cluster-9-0
0614f0a9-adcb-4c6c-939c-e7869525549c cluster-1-1
"""
View dataset health
"""
ds.health()
You can view your dashboard at: https://cloud.tryrelevance.com/dataset/basic_subclustering/dashboard/monitor/schema
exists missing
_cluster_ 739 0
_cluster_.product_image_clip_vector_ 739 0
_cluster_.product_image_clip_vector_.kmeans_10 739 0
_cluster_.product_image_clip_vector_.kmeans_10_3 739 0
_cluster_.product_title_clip_vector_ 739 0
_cluster_.product_title_clip_vector_.kmeans_10 739 0
insert_date_ 739 0
product_image 739 0
product_image_clip_vector_ 739 0
product_link 739 0
product_price 739 0
product_title 739 0
product_title_clip_vector_ 739 0
query 739 0
source 739 0

🧐 Looking into our subclusters#

Let’s build a subcluster lookup to help us further analyze our clusters

# subclusters =
# {
#  'parent_cluster_id': {
#     'subcluster_id': [ subcluster_docs ]
#   }
# }, ...
from collections import defaultdict
from pprint import pprint


def build_subcluster_lut(ds, vector_field, parent_alias, subcluster_alias):
    ## Let's retrieve our docs again with the new subcluster field
    docs = ds.get_all_documents(include_vector=True)
    subclusters = defaultdict(dict)
    doc_fields = [
        k
        for k in ds.schema.keys()
        if "." not in k
        if not any([f in k for f in ["_vector_", "insert_date_"]])
    ]

    for d in docs:
        parent_cluster = d["_cluster_"][vector_field][parent_alias]
        subcluster = d["_cluster_"][vector_field][subcluster_alias]
        doc = {k: v for k, v in d.items() if k in doc_fields}
        subclusters[parent_cluster].setdefault(subcluster, []).append(doc)
    return subclusters


subclusters_3 = build_subcluster_lut(ds, vector_field, parent_alias, subcluster_alias)
0%|          | 0/1 [00:00<?, ?it/s]
from relevanceai import show_json
from random import sample


def get_subcluster(subclusters, cluster_id, subcluster_ids=[]):
    if not subcluster_ids:
        subcluster_ids=list(subclusters[cluster_id].keys())
    return {k:v for k, v in subclusters[cluster_id].items() if k in subcluster_ids}


def sample_subclusters(subclusters, cluster_id,  subcluster_id=None, n_docs=10):
    docs=[]
    subcluster_ids=list(subclusters[cluster_id].keys()) if not subcluster_id else [subcluster_id]

    for subcluster_id in subcluster_ids:
    docs += get_subcluster(subclusters, cluster_id, subcluster_id)[subcluster_id]

    print(f'==========')
    print(f"Cluster: {cluster_id}")
    print(f"Subclusters: {' '.join(subcluster_ids)}")
    print(f'Displaying {n_docs} of {len(docs)} documents ... ')
    print(f'==========')

    display(
      show_json(
          sample(docs, n_docs),
          image_fields=['product_image'],
          text_fields=['query', 'product_title', 'product_price'],
        )
    )

# We can see from sampling the cluster itself, we can see a mixtures of items in our cluster

cluster_id = 'cluster-0'
sample_subclusters(subclusters_3, cluster_id )
==========
Cluster: cluster-0
Subclusters: cluster-0-0 cluster-0-1 cluster-0-2
Displaying 10 of 89 documents ...
==========
product_image query product_title product_price
0 nike womens Nike Women's Luxe Rectangular Sunglasses (As Is Item) $48.49
1 Levis Levi's Men's 514 Grey Twill Soft Washed Slim-straight Jeans $39.99
2 gold dress White Mark Women's Madelyn Mulitcolor Patterned Dress $40.99
3 gold dress Tahari Arthur S. Levine Women's Sequin Animal Jacquard Bust Dress $85.99 - $86.99
4 Levis Levi's Women's Black Ink 'Perfect Waist' Straight Leg Jeans $44.99
5 workout clothes for women Marika Women's Heather Grey Leggings $24.99 - $27.99
6 yellow dress White Mark Women's Fit-and-Flare Floral Skater Dress $38.99
7 Levis Levi's Women's Petite Dark Ice Mid-rise Bootcut Jeans $39.99
8 gold dress White Mark Women's Plus 'Venezia' Gold Turquoise Dress $31.99
9 yellow dress White Mark Women's Teal and Yellow Printed Bell-sleeve Dress $38.99

Sub-clustering allows us to further drill down into our clusters to find more well-defined groups -

cluster_id = "cluster-0"
subcluster_id = "cluster-0-0"

print(f"Sampling {subcluster_alias} in {vector_field} ...")

sample_subclusters(subclusters_3, cluster_id, subcluster_id)
Searching kmeans_10_3 in product_image_clip_vector_ ...
==========
Cluster: cluster-0
Subclusters: cluster-0-0
Displaying 10 of 45 documents ...
==========
product_image query product_title product_price
0 gold dress Daniella Collection Women's Black/ Gold Beaded Rhinestone Dress $329.99
1 gold dress A.B.S. by Allen Schwartz Women's Gold Sequined Fitted Cocktail Dress $84.99 - $226.99
2 gold dress Kayla Collection Women's Two-tone Metallic and Black Maxi Dress $114.99
3 yellow dress Amelia Women's Cotton Satin Front-zip Dress $43.99 - $53.49
4 gold dress White Mark Women's Madelyn Mulitcolor Patterned Dress $40.99
5 yellow dress White Mark Women's Fit-and-Flare Floral Skater Dress $38.99
6 gold dress Tahari Arthur S. Levine Women's Sequin Animal Jacquard Bust Dress $85.99 - $86.99
7 gold dress R & M Richards Women's Plus Size Fortuny Pleated Metallic 2-piece Dress $84.99 - $88.99
8 yellow dress Von Ronen New York Women's Short Transformer Dress One Size Fits 0-12 $82.99 - $84.99
9 gold dress La Femme Gold Sequined Sweetheart Rhinestone Strapless Formal Dress $319.99

πŸ‡ You can then run sub-clustering again on a separate parent alias!#

If we find our initial subclusters are insufficient, we can run subclustering again even more clusters to drill down down even furher.

You are also able to infinitely continue subclustering as required by constantly referring back to the parent alias.

"""
Given the parent field - we now run subclustering
Before, we subclustered on 3 subclusters
Let's dive even deeper to view 5 subclusters
"""

subcluster_n_clusters = 5
subcluster_alias = f"{parent_alias}_{subcluster_n_clusters}"

from sklearn.cluster import KMeans

model = KMeans(n_clusters=subcluster_n_clusters)

ds.subcluster(
    model=model,
    parent_field=parent_field,
    vector_fields=[vector_field],
    alias=subcluster_alias,
)
0%|          | 0/10 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]
0%|          | 0/1 [00:00<?, ?it/s]

Let’s sample again with 5 subclusters We can see comparatively, these results are even more finegrained than when subclustering with 3 subclusters

subclusters_5 = build_subcluster_lut(ds, vector_field, parent_alias, subcluster_alias)

cluster_id = "cluster-0"
subcluster_id = "cluster-0-0"

print(f"Sampling {subcluster_alias} in {vector_field} ...")
sample_subclusters(subclusters_5, cluster_id, subcluster_id)
Searching kmeans_10_5 in product_image_clip_vector_ ...
==========
Cluster: cluster-0
Subclusters: cluster-0-0
Displaying 10 of 13 documents ...
==========
product_image query product_title product_price
0 gold dress Aidan Mattox Gold Cap Sleeve Lace Side Pocket Evening Dress $469.99
1 gold dress Little Mistress Women's Black and Gold Sequin Dress $111.99 - $124.99
2 gold dress Daniella Collection Women's Black/ Gold Beaded Rhinestone Dress $329.99
3 gold dress Ignite Evenings by Carol Lin Women's Sequin Halter Gown $131.99
4 gold dress Halston Heritage Women's Gold Allover Sequined Evening Dress $154.99 - $172.99
5 gold dress Tahari Arthur S. Levine Women's Sequin Animal Jacquard Bust Dress $85.99 - $86.99
6 yellow dress R & M Richards Women's Lurex Draped Jacket and Dress Set $89.99
7 gold dress La Femme Gold Sequined Sweetheart Rhinestone Strapless Formal Dress $319.99
8 gold dress Aidan Mattox Gold Sequin Tulle V-neck Sleeveless Long Evening Dress $399.99
9 gold dress Aidan Mattox Gold Strapless Fairy Tale Empire Waist Bead Evening Dress $469.99

Next Steps#

Next steps

If you require more indepth knowledge around subclustering, we will be writing more guides on how to adapt these to different aliases and models in the near future.

For more details, please refer to the references.