DataFrame

DataFrame

A distributed collection of data organized into named columns.

DataFrames can be created using various functions in SQLContext. They can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and Functions.

Constructor

new DataFrame()

Note: Do not use directly (see above).

Since:
  • 1.3.0
Source:
Examples

To select a column from the data frame, use the col method.

  var ageCol = people.col("age");

Note that the Column type can also be manipulated through its various functions.

  // The following creates a new column that increases everybody's age by 10.
  people.col("age").plus(10);

A more complete example.

  var people = sqlContext.read().json("...");
  var department = sqlContext.read().json("...");

  people.filter("age > 30")
    .join(department, people.col("deptId").eq(department.col("id")))
    .groupBy(department.col("name"), people.col("gender"))
    .agg(F.avg(people.col("salary")), F.max(people.col("age")));

Methods

agg(cols)

Aggregates on the entire DataFrame without groups.

Parameters:
Name Type Description
cols

Array of column names or expressions.

Since:
  • 1.3.0
Source:
Example
// df.agg(...) is a shorthand for df.groupBy().agg(...)
  df.agg(F.max(df.col("age")), F.avg(df.col("salary")));
  df.groupBy().agg(F.max(df.col("age")), F.avg(df.col("salary")));

col(colName)

Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.

Parameters:
Name Type Description
colName
Since:
  • 1.3.0
Source:

collect(cb)

Returns an array that contains all of Rows in this DataFrame.

Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

Parameters:
Name Type Description
cb

Node-style callback function (error-first).

Since:
  • 1.3.0
Source:

collectSync()

The synchronous version of DataFrame#collect

Since:
  • 1.3.0
Source:

columns(cb)

Returns all column names as an array.

Parameters:
Name Type Description
cb

Node-style callback function (error-first).

Since:
  • 1.3.0
Source:

columnsSync()

The synchronous version of DataFrame#columns

Since:
  • 1.3.0
Source:

count(cb)

Returns the number of rows in the DataFrame.

Parameters:
Name Type Description
cb

Node-style callback function (error-first).

Since:
  • 1.3.0
Source:

countSync()

The synchronous version of DataFrame#count

Since:
  • 1.3.0
Source:

cube(cols)

Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregation functions.

Parameters:
Name Type Description
cols

Array of column names or expressions.

Since:
  • 1.4.0
Source:
Example
// Compute the average for all numeric columns cubed by department and group.
  df.cube("department", "group").avg();

describe(colNames)

Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this computes statistics for all numerical columns.

This is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. If you want to programmatically compute summary statistics, use the agg method instead.

Parameters:
Name Type Description
colNames

Array of column names.

Since:
  • 1.3.1
Source:
Example
df.describe("age", "height").show();

  // output:
  // summary age   height
  // count   10.0  10.0
  // mean    53.3  178.05
  // stddev  11.6  15.7
  // min     18.0  163.0
  // max     92.0  192.0

distinct()

Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for dropDuplicates.

Since:
  • 1.3.0
Source:

drop(col)

Returns a new DataFrame with a column dropped. This is a no-op if schema doesn't contain column name.

Parameters:
Name Type Description
col

Column

Since:
  • 1.4.0
Source:

dropDuplicates(colNames)

Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for distinct. If column names are passed in, rows are only compared in those columns.

Parameters:
Name Type Description
colNames

Array of column names.

Since:
  • 1.4.0
Source:

except(other)

Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to EXCEPT in SQL.

Parameters:
Name Type Description
other

DataFrame

Since:
  • 1.3.0
Source:

explain(extendedopt)

Prints the plans (logical and physical) to the console for debugging purposes.

Parameters:
Name Type Attributes Default Description
extended string <optional>
false
Since:
  • 1.3.0
Source:

filter(condition)

Filters rows using the given column expression or SQL expression.

Parameters:
Name Type Description
condition

A Column of booleans or a string containing a SQL expression.

Since:
  • 1.3.0
Source:
Example

The following are equivalent:

  peopleDf.filter(peopleDf.col("age").gt(15));
  peopleDf.filter("age > 15");

groupBy(cols)

Groups the DataFrame using the specified columns, so we can run aggregations on them. See GroupedData for all the available aggregation functions.

Parameters:
Name Type Description
cols

Array of column names or expressions to group by.

Since:
  • 1.3.0
Source:
Example

Compute the average for all numeric columns grouped by department.

  df.groupBy("department").avg();

Returns the first n rows.

Running head requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

Parameters:
Name Type Attributes Default Description
n Number <optional>
1

Number of rows to return.

cb

Node-style callback function (error-first).

Since:
  • 1.3.0
Source:

headSync(nopt)

The synchronous version of DataFrame#head

Parameters:
Name Type Attributes Default Description
n Number <optional>
1

Number of rows to return.

Since:
  • 1.3.0
Source:

intersect(other)

Returns a new DataFrame containing rows only in both this frame and another frame. This is equivalent to INTERSECT in SQL.

Parameters:
Name Type Description
other

DataFrame

Since:
  • 1.3.0
Source:

isLocal()

Returns true if the collect and take methods can be run locally (without any Spark executors).

Since:
  • 1.3.0
Source:

join(right, colopt, joinTypeopt)

Join with another DataFrame.

If no col is provided, does a Cartesian join. (Note that cartesian joins are very expensive without an extra filter that can be pushed down).

If a column name (string) is provided, does an equi-join.

If a column expression is provided, uses that as a join expression.

Parameters:
Name Type Attributes Default Description
right

Right side of the join.

col <optional>
null

Column name or join expression.

joinType <optional>
"inner"

One of: inner, outer, left_outer, right_outer, leftsemi.

Since:
  • 1.3.0
Source:
Example

perform a full outer join between df1 and df2

df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");

limit(n)

Returns a new DataFrame by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new DataFrame.

Parameters:
Name Type Description
n

Number of rows.

Since:
  • 1.3.0
Source:

printSchema()

Prints the schema to the console in a nice tree format.

This method runs a computation but is still synchronous, because it is used in an interactive setting (shell).

Since:
  • 1.3.0
Source:

randomSplit(weights)

Randomly splits this DataFrame with the provided weights.

Parameters:
Name Type Description
weights

Weights for splits, will be normalized if they don't sum to 1.

Since:
  • 1.4.0
Source:

registerTempTable(tableName)

Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.

Parameters:
Name Type Description
tableName

Table name.

Since:
  • 1.3.0
Source:

repartition(numPartitionsopt, partitionExprs)

Returns a partitioned DataFrame.

If partition expressions are provided, partition by the given partitioning expressions into numPartitions. The resulting DataFrame is hash partitioned. (This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).)

Parameters:
Name Type Attributes Description
numPartitions Number <optional>

Number of partitions.

partitionExprs

Partitioning expressions.

Since:
  • 1.3.0
Source:

rollup(cols)

Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregations on them. See GroupedData for all the available aggregation functions.

Parameters:
Name Type Description
cols

Array of column names or expressions.

Since:
  • 1.4.0
Source:
Example
// Compute the average for all numeric columns rolluped by department and group
  df.rollup(df.col("department"), df.col("group")).avg();

sample(withReplacement, fraction)

Returns a new DataFrame by sampling a fraction of rows, using a random seed.

Parameters:
Name Type Description
withReplacement

Sample with replacement or not.

fraction

Fraction of rows to generate.

Since:
  • 1.3.0
Source:

select(cols)

Selects a set of column based expressions.

Parameters:
Name Type Description
cols

Array of column names or expressions. If one of the column names is '*', that column is expanded to include all columns in the current DataFrame.

Since:
  • 1.3.0
Source:

selectExpr(exprs)

Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

Parameters:
Name Type Description
exprs

Array of SQL expressions.

Since:
  • 1.3.0
Source:

show(numRowsopt, truncateopt)

Displays the DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

This method runs a computation but is still synchronous, because it is used in an interactive setting (shell).

Parameters:
Name Type Attributes Default Description
numRows Number <optional>
20

Number of rows to show.

truncate Boolean <optional>
true

If true, strings more than 20 characters will be truncated and all cells will be aligned right.

Since:
  • 1.3.0
Source:

sort(col, cols)

Returns a new DataFrame sorted by the specified column, all in ascending order.

Parameters:
Name Type Description
col

Column

cols

Array of additional column names or expressions to sort by.

Since:
  • 1.3.0
Source:

toDF(colNames)

Returns a new DataFrame with new specified column names.

Parameters:
Name Type Description
colNames

Array of new column names.

Since:
  • 1.3.0
Source:

unionAll(other)

Returns a new DataFrame containing union of rows in this frame and another frame. This is equivalent to UNION ALL in SQL.

Parameters:
Name Type Description
other

DataFrame

Since:
  • 1.3.0
Source:

where(condition)

Filters rows using the given condition. This is an alias for filter.

Parameters:
Name Type Description
condition
Since:
  • 1.3.0
Source:

withColumn(colName, col)

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

Parameters:
Name Type Description
colName

Name of the new column.

col

Column Expression for the new column.

Since:
  • 1.3.0
Source:

write()

Interface for saving the content of the DataFrame out into external storage.

Since:
  • 1.4.0
Source: