Numeric Summaries#

Describe#

Statistics.describe(return_type='pandas')#

Descriptive statistics include those that summarize the central tendency dispersion and shape of a dataset’s distribution, excluding NaN values.

Example

from relevanceai import Client
client = Client()
dataset_id = "sample_dataset_id"
df = client.Dataset(dataset_id)
field = "sample_field"
df.describe() # returns pandas dataframe of stats
df.describe(return_type='dict') # return raw json stats
Return type

dict

Value Counts#

Statistics.value_counts(field)#

Return a Series containing counts of unique values.

Parameters

field (str) – dataset field to which to do value counts on

Return type

Series

Example

from relevanceai import Client
client = Client()
dataset_id = "sample_dataset_id"
df = client.Dataset(dataset_id)
field = "sample_field"
value_counts_df = df.value_counts(field)

Correlation#

Statistics.corr(X, Y, vector_field, alias, groupby=None, fontsize=16, show_plot=True)#

Note

This function was introduced in 1.2.2.

Aggregate#

Statistics.aggregate(groupby=None, metrics=None, filters=None, page_size=20, page=1, asc=False, aggregation_query=None, sort=None)#

Aggregation/Groupby of a collection using an aggregation query. The aggregation query is a json body that follows the schema of:

Example

{
    "groupby" : [
        {"name": <alias>, "field": <field in the collection>, "agg": "category"},
        {"name": <alias>, "field": <another groupby field in the collection>, "agg": "numeric"}
    ],
    "metrics" : [
        {"name": <alias>, "field": <numeric field in the collection>, "agg": "avg"}
        {"name": <alias>, "field": <another numeric field in the collection>, "agg": "max"}
    ]
}
For example, one can use the following aggregations to group score based on region and player name.
{
    "groupby" : [
        {"name": "region", "field": "player_region", "agg": "category"},
        {"name": "player_name", "field": "name", "agg": "category"}
    ],
    "metrics" : [
        {"name": "average_score", "field": "final_score", "agg": "avg"},
        {"name": "max_score", "field": "final_score", "agg": "max"},
        {'name':'total_score','field':"final_score", 'agg':'sum'},
        {'name':'average_deaths','field':"final_deaths", 'agg':'avg'},
        {'name':'highest_deaths','field':"final_deaths", 'agg':'max'},
    ]
}

Facets#

Statistics.facets(fields=[], date_interval='monthly', page_size=5, page=1, asc=False)#

Get a summary of fields - such as most common, their min/max, etc.

Example

from relevanceai import Client
client = Client()
from relevanceai.datasets import mock_documents
documents = mock_documents(100)
ds = client.Dataset("mock_documents")
ds.upsert_documents(documents)
ds.facets(["sample_1_value"])

Health#

Statistics.health(output_format='dataframe')#

Gives you a summary of the health of your vectors, e.g. how many documents with vectors are missing, how many documents with zero vectors

Parameters

output_format (str) – The format of the output. Must either be “dataframe” or “json”.

Example

from relevanceai import Client
client = Client()
df = client.Dataset("sample_dataset_id")
df.health
Return type

Union[DataFrame, dict]