relevanceai.dataset.read.statistics.statistics
#
Statistics API
Module Contents#
- class relevanceai.dataset.read.statistics.statistics.Statistics(credentials: relevanceai.client.helpers.Credentials, dataset_id: str, **kwargs)#
Batch API client
- value_counts(self, field: str)#
Return a Series containing counts of unique values.
- Parameters
field (str) – dataset field to which to do value counts on
- Return type
Example
from relevanceai import Client client = Client() dataset_id = "sample_dataset_id" df = client.Dataset(dataset_id) field = "sample_field" value_counts_df = df.value_counts(field)
- describe(self, return_type='pandas') dict #
Descriptive statistics include those that summarize the central tendency dispersion and shape of a dataset’s distribution, excluding NaN values.
Example
from relevanceai import Client client = Client() dataset_id = "sample_dataset_id" df = client.Dataset(dataset_id) field = "sample_field" df.describe() # returns pandas dataframe of stats df.describe(return_type='dict') # return raw json stats
- corr(self, X: str, Y: str, vector_field: str, alias: str, groupby: Optional[str] = None, fontsize: int = 16, show_plot: bool = True)#
Returns the Pearson correlation between two fields.
- Parameters
X (str) – A dataset field
Y (str) – The other dataset field
vector_field (str) – The vector field over which the clustering has been performed
alias (str) – The alias of the clustering
groupby (Optional[str]) – A field to group the correlations over
fontsize (int) – The font size of the values in the image
- health(self, output_format='dataframe') Union[pandas.DataFrame, dict] #
Gives you a summary of the health of your vectors, e.g. how many documents with vectors are missing, how many documents with zero vectors
- Parameters
output_format (str) – The format of the output. Must either be “dataframe” or “json”.
Example
from relevanceai import Client client = Client() df = client.Dataset("sample_dataset_id") df.health
- aggregate(self, groupby: Optional[list] = None, metrics: Optional[list] = None, filters: Optional[list] = None, page_size: int = 20, page: int = 1, asc: bool = False, aggregation_query: Optional[dict] = None, sort: list = None)#
Aggregation/Groupby of a collection using an aggregation query. The aggregation query is a json body that follows the schema of:
Example
{ "groupby" : [ {"name": <alias>, "field": <field in the collection>, "agg": "category"}, {"name": <alias>, "field": <another groupby field in the collection>, "agg": "numeric"} ], "metrics" : [ {"name": <alias>, "field": <numeric field in the collection>, "agg": "avg"} {"name": <alias>, "field": <another numeric field in the collection>, "agg": "max"} ] } For example, one can use the following aggregations to group score based on region and player name. { "groupby" : [ {"name": "region", "field": "player_region", "agg": "category"}, {"name": "player_name", "field": "name", "agg": "category"} ], "metrics" : [ {"name": "average_score", "field": "final_score", "agg": "avg"}, {"name": "max_score", "field": "final_score", "agg": "max"}, {'name':'total_score','field':"final_score", 'agg':'sum'}, {'name':'average_deaths','field':"final_deaths", 'agg':'avg'}, {'name':'highest_deaths','field':"final_deaths", 'agg':'max'}, ] }
- facets(self, fields: Optional[list] = [], date_interval: str = 'monthly', page_size: int = 5, page: int = 1, asc: bool = False)#
Get a summary of fields - such as most common, their min/max, etc.
Example
from relevanceai import Client client = Client() from relevanceai.datasets import mock_documents documents = mock_documents(100) ds = client.Dataset("mock_documents") ds.upsert_documents(documents) ds.facets(["sample_1_value"])
- abstract health_check(self, **kwargs)#