DataFrame - Documentation

Constructor

new DataFrame()

Note: Do not use directly (see above).

Since:

1.3.0

Source:

DataFrame.js, line 36

Examples

To select a column from the data frame, use the col method.

  var ageCol = people.col("age");

Note that the Column type can also be manipulated through its various functions.

  // The following creates a new column that increases everybody's age by 10.
  people.col("age").plus(10);

A more complete example.

  var people = sqlContext.read().json("...");
  var department = sqlContext.read().json("...");

  people.filter("age > 30")
    .join(department, people.col("deptId").eq(department.col("id")))
    .groupBy(department.col("name"), people.col("gender"))
    .agg(F.avg(people.col("salary")), F.max(people.col("age")));

Methods

agg(cols)

Aggregates on the entire DataFrame without groups.

Parameters:

Name	Type	Description
`cols`		Array of column names or expressions.

Since:

1.3.0

Source:

DataFrame.js, line 278

Example

// df.agg(...) is a shorthand for df.groupBy().agg(...)
  df.agg(F.max(df.col("age")), F.avg(df.col("salary")));
  df.groupBy().agg(F.max(df.col("age")), F.avg(df.col("salary")));

col(colName)

Selects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.

Parameters:

Name	Type	Description
`colName`

Since:

1.3.0

Source:

DataFrame.js, line 168

collect(cb)

Returns an array that contains all of Rows in this DataFrame.

Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

Parameters:

Name	Type	Description
`cb`		Node-style callback function (error-first).

Since:

1.3.0

Source:

DataFrame.js, line 471

collectSync()

The synchronous version of DataFrame#collect

Since:

1.3.0

Source:

DataFrame.js, line 484

columns(cb)

Returns all column names as an array.

Parameters:

Name	Type	Description
`cb`		Node-style callback function (error-first).

Since:

1.3.0

Source:

DataFrame.js, line 61

columnsSync()

The synchronous version of DataFrame#columns

Since:

1.3.0

Source:

DataFrame.js, line 69

count(cb)

Returns the number of rows in the DataFrame.

Parameters:

Name	Type	Description
`cb`		Node-style callback function (error-first).

Since:

1.3.0

Source:

DataFrame.js, line 495

countSync()

The synchronous version of DataFrame#count

Since:

1.3.0

Source:

DataFrame.js, line 508

cube(cols)

Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregation functions.

Parameters:

Name	Type	Description
`cols`		Array of column names or expressions.

Since:

1.4.0

Source:

DataFrame.js, line 263

Example

// Compute the average for all numeric columns cubed by department and group.
  df.cube("department", "group").avg();

describe(colNames)

Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this computes statistics for all numerical columns.

This is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. If you want to programmatically compute summary statistics, use the agg method instead.

Parameters:

Name	Type	Description
`colNames`		Array of column names.

Since:

1.3.1

Source:

DataFrame.js, line 426

Example

df.describe("age", "height").show();

  // output:
  // summary age   height
  // count   10.0  10.0
  // mean    53.3  178.05
  // stddev  11.6  15.7
  // min     18.0  163.0
  // max     92.0  192.0

distinct()

Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for dropDuplicates.

Since:

1.3.0

Source:

DataFrame.js, line 399

drop(col)

Returns a new DataFrame with a column dropped. This is a no-op if schema doesn't contain column name.

Parameters:

Name	Type	Description
`col`		Column

Since:

1.4.0

Source:

DataFrame.js, line 376

dropDuplicates(colNames)

Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for distinct. If column names are passed in, rows are only compared in those columns.

Parameters:

Name	Type	Description
`colNames`		Array of column names.

Since:

1.4.0

Source:

DataFrame.js, line 390

except(other)

Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to EXCEPT in SQL.

Parameters:

Name	Type	Description
`other`		DataFrame

Since:

1.3.0

Source:

DataFrame.js, line 319

explain(extendedopt)

Prints the plans (logical and physical) to the console for debugging purposes.

Parameters:

Name	Type	Attributes	Default	Description
`extended`	string	<optional>	false

Since:

1.3.0

Source:

DataFrame.js, line 90

filter(condition)

Filters rows using the given column expression or SQL expression.

Parameters:

Name	Type	Description
`condition`		A Column of booleans or a string containing a SQL expression.

Since:

1.3.0

Source:

DataFrame.js, line 204

Example

The following are equivalent:

  peopleDf.filter(peopleDf.col("age").gt(15));
  peopleDf.filter("age > 15");

groupBy(cols)

Groups the DataFrame using the specified columns, so we can run aggregations on them. See GroupedData for all the available aggregation functions.

Parameters:

Name	Type	Description
`cols`		Array of column names or expressions to group by.

Since:

1.3.0

Source:

DataFrame.js, line 229

Example

Compute the average for all numeric columns grouped by department.

  df.groupBy("department").avg();

head(nopt, cb)

Returns the first n rows.

Running head requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

Parameters:

Name	Type	Attributes	Default	Description
`n`	Number	<optional>	1	Number of rows to return.
`cb`				Node-style callback function (error-first).

Since:

1.3.0

Source:

DataFrame.js, line 441

headSync(nopt)

The synchronous version of DataFrame#head

Parameters:

Name	Type	Attributes	Default	Description
`n`	Number	<optional>	1	Number of rows to return.

Since:

1.3.0

Source:

DataFrame.js, line 457

intersect(other)

Returns a new DataFrame containing rows only in both this frame and another frame. This is equivalent to INTERSECT in SQL.

Parameters:

Name	Type	Description
`other`		DataFrame

Since:

1.3.0

Source:

DataFrame.js, line 309

isLocal()

Returns true if the collect and take methods can be run locally (without any Spark executors).

Since:

1.3.0

Source:

DataFrame.js, line 99

join(right, colopt, joinTypeopt)

Join with another DataFrame.

If no col is provided, does a Cartesian join. (Note that cartesian joins are very expensive without an extra filter that can be pushed down).

If a column name (string) is provided, does an equi-join.

If a column expression is provided, uses that as a join expression.

Parameters:

Name	Attributes	Default	Description
`right`			Right side of the join.
`col`	<optional>	null	Column name or join expression.
`joinType`	<optional>	"inner"	One of: `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.

Since:

1.3.0

Source:

DataFrame.js, line 141

Example

perform a full outer join between df1 and df2

df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");

limit(n)

Returns a new DataFrame by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new DataFrame.

Parameters:

Name	Type	Description
`n`		Number of rows.

Since:

1.3.0

Source:

DataFrame.js, line 289

printSchema()

Prints the schema to the console in a nice tree format.

This method runs a computation but is still synchronous, because it is used in an interactive setting (shell).

Since:

1.3.0

Source:

DataFrame.js, line 81

randomSplit(weights)

Randomly splits this DataFrame with the provided weights.

Parameters:

Name	Type	Description
`weights`		Weights for splits, will be normalized if they don't sum to 1.

Since:

1.4.0

Source:

DataFrame.js, line 340

registerTempTable(tableName)

Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.

Parameters:

Name	Type	Description
`tableName`		Table name.

Since:

1.3.0

Source:

DataFrame.js, line 541

repartition(numPartitionsopt, partitionExprs)

Returns a partitioned DataFrame.

If partition expressions are provided, partition by the given partitioning expressions into numPartitions. The resulting DataFrame is hash partitioned. (This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).)

Parameters:

Name	Type	Attributes	Description
`numPartitions`	Number	<optional>	Number of partitions.
`partitionExprs`			Partitioning expressions.

Since:

1.3.0

Source:

DataFrame.js, line 524

rollup(cols)

Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregations on them. See GroupedData for all the available aggregation functions.

Parameters:

Name	Type	Description
`cols`		Array of column names or expressions.

Since:

1.4.0

Source:

DataFrame.js, line 246

Example

// Compute the average for all numeric columns rolluped by department and group
  df.rollup(df.col("department"), df.col("group")).avg();

sample(withReplacement, fraction)

Returns a new DataFrame by sampling a fraction of rows, using a random seed.

Parameters:

Name	Type	Description
`withReplacement`		Sample with replacement or not.
`fraction`		Fraction of rows to generate.

Since:

1.3.0

Source:

DataFrame.js, line 330

select(cols)

Selects a set of column based expressions.

Parameters:

Name	Type	Description
`cols`		Array of column names or expressions. If one of the column names is '*', that column is expanded to include all columns in the current DataFrame.

Since:

1.3.0

Source:

DataFrame.js, line 180

selectExpr(exprs)

Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

Parameters:

Name	Type	Description
`exprs`		Array of SQL expressions.

Since:

1.3.0

Source:

DataFrame.js, line 192

show(numRowsopt, truncateopt)

Displays the DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

This method runs a computation but is still synchronous, because it is used in an interactive setting (shell).

Parameters:

Name	Type	Attributes	Default	Description
`numRows`	Number	<optional>	20	Number of rows to show.
`truncate`	Boolean	<optional>	true	If true, strings more than 20 characters will be truncated and all cells will be aligned right.

Since:

1.3.0

Source:

DataFrame.js, line 117

sort(col, cols)

Returns a new DataFrame sorted by the specified column, all in ascending order.

Parameters:

Name	Type	Description
`col`		Column
`cols`		Array of additional column names or expressions to sort by.

Since:

1.3.0

Source:

DataFrame.js, line 156

toDF(colNames)

Returns a new DataFrame with new specified column names.

Parameters:

Name	Type	Description
`colNames`		Array of new column names.

Since:

1.3.0

Source:

DataFrame.js, line 50

unionAll(other)

Returns a new DataFrame containing union of rows in this frame and another frame. This is equivalent to UNION ALL in SQL.

Parameters:

Name	Type	Description
`other`		DataFrame

Since:

1.3.0

Source:

DataFrame.js, line 299

where(condition)

Filters rows using the given condition. This is an alias for filter.

Parameters:

Name	Type	Description
`condition`

Since:

1.3.0

Source:

DataFrame.js, line 214

withColumn(colName, col)

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

Parameters:

Name	Type	Description
`colName`		Name of the new column.
`col`		Column Expression for the new column.

Since:

1.3.0

Source:

DataFrame.js, line 364

write()

Interface for saving the content of the DataFrame out into external storage.

Since:

1.4.0

Source:

DataFrame.js, line 549