Developer-first vector platform for ML teams
π€ Basic Sub-clustering#
This notebook is a quick guide on how to use Relevance AI for subclustering. Subclustering allows users to infinitely drill down into their clusters by running more clusters.
Basic sub-clustering allows users to rely on clustering in simple ways.
For more details, please refer to the references.
!pip install -q RelevanceAI[notebook]
You can sign up/login and find your credentials here:
https://cloud.tryrelevance.com/sdk/api Once you have signed up, click on the
value under Authorization token
and paste it here
from relevanceai import Client
client = Client()
π£ Inserting data#
We use a sample ecommerce dataset - with vectors
product_image_clip_vector_
and product_title_clip_vector_
already encoded for us.
from relevanceai.utils.datasets import get_ecommerce_dataset_encoded
docs = get_ecommerce_dataset_encoded()
docs[0].keys()
dict_keys(['product_image', 'query', 'product_price', 'source', 'product_title', 'product_link', 'product_image_clip_vector_', 'product_title_clip_vector_', 'insert_date_', '_id'])
ds = client.Dataset("basic_subclustering")
ds.delete()
ds.upsert_documents(docs)
β
All documents inserted/edited successfully.
ds.schema
{'insert_date_': 'date',
'product_image': 'text',
'product_image_clip_vector_': {'vector': 512},
'product_link': 'text',
'product_price': 'text',
'product_title': 'text',
'product_title_clip_vector_': {'vector': 512},
'query': 'text',
'source': 'text'}
vector_fields = ds.list_vector_fields()
vector_fields
['product_image_clip_vector_', 'product_title_clip_vector_']
π Running the initial clustering approach:#
Letβs instantiate a clustering model and set an appropriate parent alias
for n_clusters
. Letβs vectorize over all available vector fields.
n_clusters = 10
vector_field = "product_image_clip_vector_"
parent_alias = f"kmeans_{n_clusters}"
from sklearn.cluster import KMeans
model = KMeans(n_clusters=n_clusters)
for v in vector_fields:
cluster_ops = ds.cluster(model, vector_fields=[v], alias=parent_alias)
ds.schema
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/2 [00:00<?, ?it/s]
Build your clustering app here: https://cloud.tryrelevance.com/dataset/basic_subclustering/deploy/recent/cluster/
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/2 [00:00<?, ?it/s]
Build your clustering app here: https://cloud.tryrelevance.com/dataset/basic_subclustering/deploy/recent/cluster/
{'_cluster_': 'dict',
'_cluster_.product_image_clip_vector_': 'dict',
'_cluster_.product_image_clip_vector_.kmeans_10': 'text',
'_cluster_.product_title_clip_vector_': 'dict',
'_cluster_.product_title_clip_vector_.kmeans_10': 'text',
'insert_date_': 'date',
'product_image': 'text',
'product_image_clip_vector_': {'vector': 512},
'product_link': 'text',
'product_price': 'text',
'product_title': 'text',
'product_title_clip_vector_': {'vector': 512},
'query': 'text',
'source': 'text'}
# You can find the parent field in the schema or alternatively provide a field.
parent_field = f"_cluster_.{vector_field}.{parent_alias}"
If we have a look at the resulting clusters in the clustering dashboard link above, we will see that there is potential for further break down of the clusters. At a high-level, we can see electronics and shoes, but we could further break down these clusters using subclustering.
Screen Shot 2022-04-07 at 2.41.57 pm.png#
π« Running sub-clustering#
vector_field = "product_image_clip_vector_"
"""
Given the parent field - we now run subclustering
Let's dive deeper to view 3 subclusters
"""
subcluster_n_clusters = 3
subcluster_alias = f"{parent_alias}_{subcluster_n_clusters}"
from sklearn.cluster import KMeans
model = KMeans(n_clusters=subcluster_n_clusters)
ds.subcluster(
model=model,
parent_field=parent_field,
vector_fields=[vector_field],
alias=subcluster_alias,
)
0%| | 0/10 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
"""
We can see the new subcluster in the schema
"""
ds.schema
{'_cluster_': 'dict',
'_cluster_.product_image_clip_vector_': 'dict',
'_cluster_.product_image_clip_vector_.kmeans_10': 'text',
'_cluster_.product_image_clip_vector_.kmeans_10_3': 'text',
'_cluster_.product_title_clip_vector_': 'dict',
'_cluster_.product_title_clip_vector_.kmeans_10': 'text',
'insert_date_': 'date',
'product_image': 'text',
'product_image_clip_vector_': {'vector': 512},
'product_link': 'text',
'product_price': 'text',
'product_title': 'text',
'product_title_clip_vector_': {'vector': 512},
'query': 'text',
'source': 'text'}
You should also be able to track your subclusters using ds.metadata
.
ds.metadata
{'_subcluster_': [{'parent_field': '_cluster_.product_image_clip_vector_.kmeans_10', 'cluster_field': '_cluster_.product_image_clip_vector_.kmeans_10_3'}]}
You can also view your subcluster results using
subcluster_field = f"_cluster_.{vector_field}.{subcluster_alias}"
ds[subcluster_field]
/usr/local/lib/python3.7/dist-packages/relevanceai/dataset/series.py:93: UserWarning: Displaying using pandas. To get image functionality please install RelevanceAI[notebook].
warnings.warn(Warning.MISSING_RELEVANCE_NOTEBOOK)
0%| | 0/1 [00:00<?, ?it/s]
_cluster_.product_image_clip_vector_.kmeans_10_3 | |
---|---|
_id | |
0007a669-07e9-4a4a-b63c-40312690b381 | cluster-9-1 |
00445000-a8ed-4523-b610-f70aa79d47f7 | cluster-1-1 |
00a3d45e-2096-46aa-94c6-7d8480fb1436 | cluster-1-2 |
01317a4c-2136-4fa3-be56-c07d79a646b3 | cluster-4-1 |
0165f12a-cc93-4306-8161-750511e9a997 | cluster-3-1 |
0186fa90-2de2-4b9c-9496-b395bf5cab51 | cluster-0-0 |
01e4dba0-147e-41a7-8efa-95c33e23c93d | cluster-5-2 |
026d370b-0660-468c-aeb0-63d6849713e2 | cluster-5-0 |
02f4c283-23eb-432c-8dff-b6fece2aa869 | cluster-5-0 |
03db6840-58de-4dd8-820a-0bd3a5f6b6d0 | cluster-1-0 |
042210e2-382d-4483-8a94-72711505a56f | cluster-3-0 |
0435795a-899f-4cdf-89be-a0f3f189d69e | cluster-1-1 |
0478c702-b53c-46e6-8ab1-915670145163 | cluster-3-1 |
04f72125-5a90-4574-a996-e41f2db7a767 | cluster-7-0 |
050a9f63-3549-4720-9be7-9daa07f868e8 | cluster-2-0 |
054c64cc-bb4b-48c6-a01d-99b532c07347 | cluster-3-2 |
056cf704-162d-4ba5-8622-23695ee24216 | cluster-5-1 |
05f401ef-d3f2-404b-a433-666fe410028d | cluster-3-1 |
060bf51a-5918-4709-a2bc-8e74452ff853 | cluster-9-0 |
0614f0a9-adcb-4c6c-939c-e7869525549c | cluster-1-1 |
"""
View dataset health
"""
ds.health()
You can view your dashboard at: https://cloud.tryrelevance.com/dataset/basic_subclustering/dashboard/monitor/schema
exists | missing | |
---|---|---|
_cluster_ | 739 | 0 |
_cluster_.product_image_clip_vector_ | 739 | 0 |
_cluster_.product_image_clip_vector_.kmeans_10 | 739 | 0 |
_cluster_.product_image_clip_vector_.kmeans_10_3 | 739 | 0 |
_cluster_.product_title_clip_vector_ | 739 | 0 |
_cluster_.product_title_clip_vector_.kmeans_10 | 739 | 0 |
insert_date_ | 739 | 0 |
product_image | 739 | 0 |
product_image_clip_vector_ | 739 | 0 |
product_link | 739 | 0 |
product_price | 739 | 0 |
product_title | 739 | 0 |
product_title_clip_vector_ | 739 | 0 |
query | 739 | 0 |
source | 739 | 0 |
π§ Looking into our subclusters#
Letβs build a subcluster lookup to help us further analyze our clusters
# subclusters =
# {
# 'parent_cluster_id': {
# 'subcluster_id': [ subcluster_docs ]
# }
# }, ...
from collections import defaultdict
from pprint import pprint
def build_subcluster_lut(ds, vector_field, parent_alias, subcluster_alias):
## Let's retrieve our docs again with the new subcluster field
docs = ds.get_all_documents(include_vector=True)
subclusters = defaultdict(dict)
doc_fields = [
k
for k in ds.schema.keys()
if "." not in k
if not any([f in k for f in ["_vector_", "insert_date_"]])
]
for d in docs:
parent_cluster = d["_cluster_"][vector_field][parent_alias]
subcluster = d["_cluster_"][vector_field][subcluster_alias]
doc = {k: v for k, v in d.items() if k in doc_fields}
subclusters[parent_cluster].setdefault(subcluster, []).append(doc)
return subclusters
subclusters_3 = build_subcluster_lut(ds, vector_field, parent_alias, subcluster_alias)
0%| | 0/1 [00:00<?, ?it/s]
from relevanceai import show_json
from random import sample
def get_subcluster(subclusters, cluster_id, subcluster_ids=[]):
if not subcluster_ids:
subcluster_ids=list(subclusters[cluster_id].keys())
return {k:v for k, v in subclusters[cluster_id].items() if k in subcluster_ids}
def sample_subclusters(subclusters, cluster_id, subcluster_id=None, n_docs=10):
docs=[]
subcluster_ids=list(subclusters[cluster_id].keys()) if not subcluster_id else [subcluster_id]
for subcluster_id in subcluster_ids:
docs += get_subcluster(subclusters, cluster_id, subcluster_id)[subcluster_id]
print(f'==========')
print(f"Cluster: {cluster_id}")
print(f"Subclusters: {' '.join(subcluster_ids)}")
print(f'Displaying {n_docs} of {len(docs)} documents ... ')
print(f'==========')
display(
show_json(
sample(docs, n_docs),
image_fields=['product_image'],
text_fields=['query', 'product_title', 'product_price'],
)
)
# We can see from sampling the cluster itself, we can see a mixtures of items in our cluster
cluster_id = 'cluster-0'
sample_subclusters(subclusters_3, cluster_id )
==========
Cluster: cluster-0
Subclusters: cluster-0-0 cluster-0-1 cluster-0-2
Displaying 10 of 89 documents ...
==========
product_image | query | product_title | product_price | |
---|---|---|---|---|
0 | ![]() |
nike womens | Nike Women's Luxe Rectangular Sunglasses (As Is Item) | $48.49 |
1 | ![]() |
Levis | Levi's Men's 514 Grey Twill Soft Washed Slim-straight Jeans | $39.99 |
2 | ![]() |
gold dress | White Mark Women's Madelyn Mulitcolor Patterned Dress | $40.99 |
3 | ![]() |
gold dress | Tahari Arthur S. Levine Women's Sequin Animal Jacquard Bust Dress | $85.99 - $86.99 |
4 | ![]() |
Levis | Levi's Women's Black Ink 'Perfect Waist' Straight Leg Jeans | $44.99 |
5 | ![]() |
workout clothes for women | Marika Women's Heather Grey Leggings | $24.99 - $27.99 |
6 | ![]() |
yellow dress | White Mark Women's Fit-and-Flare Floral Skater Dress | $38.99 |
7 | ![]() |
Levis | Levi's Women's Petite Dark Ice Mid-rise Bootcut Jeans | $39.99 |
8 | ![]() |
gold dress | White Mark Women's Plus 'Venezia' Gold Turquoise Dress | $31.99 |
9 | ![]() |
yellow dress | White Mark Women's Teal and Yellow Printed Bell-sleeve Dress | $38.99 |
Sub-clustering allows us to further drill down into our clusters to find more well-defined groups -
cluster_id = "cluster-0"
subcluster_id = "cluster-0-0"
print(f"Sampling {subcluster_alias} in {vector_field} ...")
sample_subclusters(subclusters_3, cluster_id, subcluster_id)
Searching kmeans_10_3 in product_image_clip_vector_ ...
==========
Cluster: cluster-0
Subclusters: cluster-0-0
Displaying 10 of 45 documents ...
==========
product_image | query | product_title | product_price | |
---|---|---|---|---|
0 | ![]() |
gold dress | Daniella Collection Women's Black/ Gold Beaded Rhinestone Dress | $329.99 |
1 | ![]() |
gold dress | A.B.S. by Allen Schwartz Women's Gold Sequined Fitted Cocktail Dress | $84.99 - $226.99 |
2 | ![]() |
gold dress | Kayla Collection Women's Two-tone Metallic and Black Maxi Dress | $114.99 |
3 | ![]() |
yellow dress | Amelia Women's Cotton Satin Front-zip Dress | $43.99 - $53.49 |
4 | ![]() |
gold dress | White Mark Women's Madelyn Mulitcolor Patterned Dress | $40.99 |
5 | ![]() |
yellow dress | White Mark Women's Fit-and-Flare Floral Skater Dress | $38.99 |
6 | ![]() |
gold dress | Tahari Arthur S. Levine Women's Sequin Animal Jacquard Bust Dress | $85.99 - $86.99 |
7 | ![]() |
gold dress | R & M Richards Women's Plus Size Fortuny Pleated Metallic 2-piece Dress | $84.99 - $88.99 |
8 | ![]() |
yellow dress | Von Ronen New York Women's Short Transformer Dress One Size Fits 0-12 | $82.99 - $84.99 |
9 | ![]() |
gold dress | La Femme Gold Sequined Sweetheart Rhinestone Strapless Formal Dress | $319.99 |
π You can then run sub-clustering again on a separate parent alias!#
If we find our initial subclusters are insufficient, we can run subclustering again even more clusters to drill down down even furher.
You are also able to infinitely continue subclustering as required by constantly referring back to the parent alias.
"""
Given the parent field - we now run subclustering
Before, we subclustered on 3 subclusters
Let's dive even deeper to view 5 subclusters
"""
subcluster_n_clusters = 5
subcluster_alias = f"{parent_alias}_{subcluster_n_clusters}"
from sklearn.cluster import KMeans
model = KMeans(n_clusters=subcluster_n_clusters)
ds.subcluster(
model=model,
parent_field=parent_field,
vector_fields=[vector_field],
alias=subcluster_alias,
)
0%| | 0/10 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
Letβs sample again with 5 subclusters We can see comparatively, these results are even more finegrained than when subclustering with 3 subclusters
subclusters_5 = build_subcluster_lut(ds, vector_field, parent_alias, subcluster_alias)
cluster_id = "cluster-0"
subcluster_id = "cluster-0-0"
print(f"Sampling {subcluster_alias} in {vector_field} ...")
sample_subclusters(subclusters_5, cluster_id, subcluster_id)
Searching kmeans_10_5 in product_image_clip_vector_ ...
==========
Cluster: cluster-0
Subclusters: cluster-0-0
Displaying 10 of 13 documents ...
==========
product_image | query | product_title | product_price | |
---|---|---|---|---|
0 | ![]() |
gold dress | Aidan Mattox Gold Cap Sleeve Lace Side Pocket Evening Dress | $469.99 |
1 | ![]() |
gold dress | Little Mistress Women's Black and Gold Sequin Dress | $111.99 - $124.99 |
2 | ![]() |
gold dress | Daniella Collection Women's Black/ Gold Beaded Rhinestone Dress | $329.99 |
3 | ![]() |
gold dress | Ignite Evenings by Carol Lin Women's Sequin Halter Gown | $131.99 |
4 | ![]() |
gold dress | Halston Heritage Women's Gold Allover Sequined Evening Dress | $154.99 - $172.99 |
5 | ![]() |
gold dress | Tahari Arthur S. Levine Women's Sequin Animal Jacquard Bust Dress | $85.99 - $86.99 |
6 | ![]() |
yellow dress | R & M Richards Women's Lurex Draped Jacket and Dress Set | $89.99 |
7 | ![]() |
gold dress | La Femme Gold Sequined Sweetheart Rhinestone Strapless Formal Dress | $319.99 |
8 | ![]() |
gold dress | Aidan Mattox Gold Sequin Tulle V-neck Sleeveless Long Evening Dress | $399.99 |
9 | ![]() |
gold dress | Aidan Mattox Gold Strapless Fairy Tale Empire Waist Bead Evening Dress | $469.99 |
Next Steps#
Next steps
If you require more indepth knowledge around subclustering, we will be writing more guides on how to adapt these to different aliases and models in the near future.
For more details, please refer to the references.