relevanceai.dataset.read.statistics.statistics#

Statistics API

Module Contents#

class relevanceai.dataset.read.statistics.statistics.Statistics(credentials: relevanceai.client.helpers.Credentials, dataset_id: str, **kwargs)#

Batch API client

value_counts(self, field: str)#

Return a Series containing counts of unique values.

Parameters

field (str) – dataset field to which to do value counts on

Return type

Series

Example

from relevanceai import Client
client = Client()
dataset_id = "sample_dataset_id"
df = client.Dataset(dataset_id)
field = "sample_field"
value_counts_df = df.value_counts(field)
describe(self, return_type='pandas') dict#

Descriptive statistics include those that summarize the central tendency dispersion and shape of a dataset’s distribution, excluding NaN values.

Example

from relevanceai import Client
client = Client()
dataset_id = "sample_dataset_id"
df = client.Dataset(dataset_id)
field = "sample_field"
df.describe() # returns pandas dataframe of stats
df.describe(return_type='dict') # return raw json stats
corr(self, X: str, Y: str, vector_field: str, alias: str, groupby: Optional[str] = None, fontsize: int = 16, show_plot: bool = True)#

Returns the Pearson correlation between two fields.

Parameters
  • X (str) – A dataset field

  • Y (str) – The other dataset field

  • vector_field (str) – The vector field over which the clustering has been performed

  • alias (str) – The alias of the clustering

  • groupby (Optional[str]) – A field to group the correlations over

  • fontsize (int) – The font size of the values in the image

health(self, output_format='dataframe') Union[pandas.DataFrame, dict]#

Gives you a summary of the health of your vectors, e.g. how many documents with vectors are missing, how many documents with zero vectors

Parameters

output_format (str) – The format of the output. Must either be “dataframe” or “json”.

Example

from relevanceai import Client
client = Client()
df = client.Dataset("sample_dataset_id")
df.health
aggregate(self, groupby: Optional[list] = None, metrics: Optional[list] = None, filters: Optional[list] = None, page_size: int = 20, page: int = 1, asc: bool = False, aggregation_query: Optional[dict] = None, sort: list = None)#

Aggregation/Groupby of a collection using an aggregation query. The aggregation query is a json body that follows the schema of:

Example

{
    "groupby" : [
        {"name": <alias>, "field": <field in the collection>, "agg": "category"},
        {"name": <alias>, "field": <another groupby field in the collection>, "agg": "numeric"}
    ],
    "metrics" : [
        {"name": <alias>, "field": <numeric field in the collection>, "agg": "avg"}
        {"name": <alias>, "field": <another numeric field in the collection>, "agg": "max"}
    ]
}
For example, one can use the following aggregations to group score based on region and player name.
{
    "groupby" : [
        {"name": "region", "field": "player_region", "agg": "category"},
        {"name": "player_name", "field": "name", "agg": "category"}
    ],
    "metrics" : [
        {"name": "average_score", "field": "final_score", "agg": "avg"},
        {"name": "max_score", "field": "final_score", "agg": "max"},
        {'name':'total_score','field':"final_score", 'agg':'sum'},
        {'name':'average_deaths','field':"final_deaths", 'agg':'avg'},
        {'name':'highest_deaths','field':"final_deaths", 'agg':'max'},
    ]
}
facets(self, fields: Optional[list] = [], date_interval: str = 'monthly', page_size: int = 5, page: int = 1, asc: bool = False)#

Get a summary of fields - such as most common, their min/max, etc.

Example

from relevanceai import Client
client = Client()
from relevanceai.datasets import mock_documents
documents = mock_documents(100)
ds = client.Dataset("mock_documents")
ds.upsert_documents(documents)
ds.facets(["sample_1_value"])
abstract health_check(self, **kwargs)#