GroupedData

GroupedData

A set of methods for aggregations on a DataFrame, created by DataFrame#groupBy.

The main method is GroupedData#agg. This class also contains some first order statistics (such as mean or sum) for convenience.

Constructor

new GroupedData()

Note: Do not use directly (see above).

Since:
  • 1.3.0
Source:

Methods

agg(expr)

Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set spark.sql.retainGroupColumns to false.

The available aggregate functions are defined in Functions.

Parameters:
Name Type Description
expr

Array of columns to group by.

Since:
  • 1.3.0
Source:
Example
// Select the age of the oldest employee and the aggregate expense for each department
df.groupBy("department").agg(F.max("age"), F.sum("expense"));

avg(colNames)

Compute the average value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the average values for them.

Parameters:
Name Type Description
colNames

Array of columns to compute mean over.

Since:
  • 1.3.0
Source:

count()

Count the number of rows for each group. The resulting DataFrame will also contain the grouping columns.

Since:
  • 1.3.0
Source:

max(colNames)

Compute the max value for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the max values for them.

Parameters:
Name Type Description
colNames

Array of columns to compute max over.

Since:
  • 1.3.0
Source:

mean(colNames)

Alias for GroupedData#avg.

Parameters:
Name Type Description
colNames

Array of columns to compute mean over.

Since:
  • 1.3.0
Source:

min(colNames)

Compute the min value for each numeric column for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the min values for them.

Parameters:
Name Type Description
colNames

Array of columns to compute min over.

Since:
  • 1.3.0
Source:

sum(colNames)

Compute the sum for each numeric columns for each group. The resulting DataFrame will also contain the grouping columns. When specified columns are given, only compute the sum for them.

Parameters:
Name Type Description
colNames

Array of columns to compute sum over.

Since:
  • 1.3.0
Source: