Constructor
new DataFrame()
Note: Do not use directly (see above).
- Since:
- 1.3.0
- Source:
Examples
var ageCol = people.col("age");
// The following creates a new column that increases everybody's age by 10.
people.col("age").plus(10);
var people = sqlContext.read().json("...");
var department = sqlContext.read().json("...");
people.filter("age > 30")
.join(department, people.col("deptId").eq(department.col("id")))
.groupBy(department.col("name"), people.col("gender"))
.agg(F.avg(people.col("salary")), F.max(people.col("age")));
Methods
agg(cols)
Aggregates on the entire DataFrame without groups.
Parameters:
Name | Type | Description |
---|---|---|
cols |
Array of column names or expressions. |
- Since:
- 1.3.0
- Source:
Example
// df.agg(...) is a shorthand for df.groupBy().agg(...)
df.agg(F.max(df.col("age")), F.avg(df.col("salary")));
df.groupBy().agg(F.max(df.col("age")), F.avg(df.col("salary")));
col(colName)
Selects column based on the column name and return it as a Column.
Note that the column name can also reference to a nested column like a.b
.
Parameters:
Name | Type | Description |
---|---|---|
colName |
- Since:
- 1.3.0
- Source:
collect(cb)
Returns an array that contains all of Rows in this DataFrame.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
Parameters:
Name | Type | Description |
---|---|---|
cb |
Node-style callback function (error-first). |
- Since:
- 1.3.0
- Source:
collectSync()
The synchronous version of DataFrame#collect
- Since:
- 1.3.0
- Source:
columns(cb)
Returns all column names as an array.
Parameters:
Name | Type | Description |
---|---|---|
cb |
Node-style callback function (error-first). |
- Since:
- 1.3.0
- Source:
columnsSync()
The synchronous version of DataFrame#columns
- Since:
- 1.3.0
- Source:
count(cb)
Returns the number of rows in the DataFrame.
Parameters:
Name | Type | Description |
---|---|---|
cb |
Node-style callback function (error-first). |
- Since:
- 1.3.0
- Source:
countSync()
The synchronous version of DataFrame#count
- Since:
- 1.3.0
- Source:
cube(cols)
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregation functions.
Parameters:
Name | Type | Description |
---|---|---|
cols |
Array of column names or expressions. |
- Since:
- 1.4.0
- Source:
Example
// Compute the average for all numeric columns cubed by department and group.
df.cube("department", "group").avg();
describe(colNames)
Computes statistics for numeric columns, including count, mean, stddev, min, and max. If no columns are given, this computes statistics for all numerical columns.
This is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame. If you want to
programmatically compute summary statistics, use the agg
method instead.
Parameters:
Name | Type | Description |
---|---|---|
colNames |
Array of column names. |
- Since:
- 1.3.1
- Source:
Example
df.describe("age", "height").show();
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// max 92.0 192.0
distinct()
Returns a new DataFrame that contains only the unique rows from this DataFrame.
This is an alias for dropDuplicates
.
- Since:
- 1.3.0
- Source:
drop(col)
Returns a new DataFrame with a column dropped. This is a no-op if schema doesn't contain column name.
Parameters:
Name | Type | Description |
---|---|---|
col |
- Since:
- 1.4.0
- Source:
dropDuplicates(colNames)
Returns a new DataFrame that contains only the unique rows from this DataFrame.
This is an alias for distinct
.
If column names are passed in, rows are only compared in those columns.
Parameters:
Name | Type | Description |
---|---|---|
colNames |
Array of column names. |
- Since:
- 1.4.0
- Source:
except(other)
Returns a new DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT
in SQL.
Parameters:
Name | Type | Description |
---|---|---|
other |
- Since:
- 1.3.0
- Source:
explain(extendedopt)
Prints the plans (logical and physical) to the console for debugging purposes.
Parameters:
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
extended |
string |
<optional> |
false |
- Since:
- 1.3.0
- Source:
filter(condition)
Filters rows using the given column expression or SQL expression.
Parameters:
Name | Type | Description |
---|---|---|
condition |
A Column of booleans or a string containing a SQL expression. |
- Since:
- 1.3.0
- Source:
Example
peopleDf.filter(peopleDf.col("age").gt(15));
peopleDf.filter("age > 15");
groupBy(cols)
Groups the DataFrame using the specified columns, so we can run aggregations on them. See GroupedData for all the available aggregation functions.
Parameters:
Name | Type | Description |
---|---|---|
cols |
Array of column names or expressions to group by. |
- Since:
- 1.3.0
- Source:
Example
df.groupBy("department").avg();
head(nopt, cb)
Returns the first n
rows.
Running head requires moving data into the application's driver process, and doing so with
a very large n
can crash the driver process with OutOfMemoryError.
Parameters:
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
n |
Number |
<optional> |
1 | Number of rows to return. |
cb |
Node-style callback function (error-first). |
- Since:
- 1.3.0
- Source:
headSync(nopt)
The synchronous version of DataFrame#head
Parameters:
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
n |
Number |
<optional> |
1 | Number of rows to return. |
- Since:
- 1.3.0
- Source:
intersect(other)
Returns a new DataFrame containing rows only in both this frame and another frame.
This is equivalent to INTERSECT
in SQL.
Parameters:
Name | Type | Description |
---|---|---|
other |
- Since:
- 1.3.0
- Source:
isLocal()
Returns true if the collect
and take
methods can be run locally
(without any Spark executors).
- Since:
- 1.3.0
- Source:
join(right, colopt, joinTypeopt)
Join with another DataFrame.
If no col
is provided, does a Cartesian join. (Note that cartesian
joins are very expensive without an extra filter that can be pushed
down).
If a column name (string) is provided, does an equi-join.
If a column expression is provided, uses that as a join expression.
Parameters:
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
right |
Right side of the join. |
|||
col |
<optional> |
null | Column name or join expression. |
|
joinType |
<optional> |
"inner" | One of: |
- Since:
- 1.3.0
- Source:
Example
df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
limit(n)
Returns a new DataFrame by taking the first n
rows. The difference between this function
and head
is that head
returns an array while limit
returns a new DataFrame.
Parameters:
Name | Type | Description |
---|---|---|
n |
Number of rows. |
- Since:
- 1.3.0
- Source:
printSchema()
Prints the schema to the console in a nice tree format.
This method runs a computation but is still synchronous, because it is used in an interactive setting (shell).
- Since:
- 1.3.0
- Source:
randomSplit(weights)
Randomly splits this DataFrame with the provided weights.
Parameters:
Name | Type | Description |
---|---|---|
weights |
Weights for splits, will be normalized if they don't sum to 1. |
- Since:
- 1.4.0
- Source:
registerTempTable(tableName)
Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame.
Parameters:
Name | Type | Description |
---|---|---|
tableName |
Table name. |
- Since:
- 1.3.0
- Source:
repartition(numPartitionsopt, partitionExprs)
Returns a partitioned DataFrame.
If partition expressions are provided, partition by the given
partitioning expressions into numPartitions
. The resulting DataFrame
is hash partitioned. (This is the same operation as "DISTRIBUTE BY" in
SQL (Hive QL).)
Parameters:
Name | Type | Attributes | Description |
---|---|---|---|
numPartitions |
Number |
<optional> |
Number of partitions. |
partitionExprs |
Partitioning expressions. |
- Since:
- 1.3.0
- Source:
rollup(cols)
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregations on them. See GroupedData for all the available aggregation functions.
Parameters:
Name | Type | Description |
---|---|---|
cols |
Array of column names or expressions. |
- Since:
- 1.4.0
- Source:
Example
// Compute the average for all numeric columns rolluped by department and group
df.rollup(df.col("department"), df.col("group")).avg();
sample(withReplacement, fraction)
Returns a new DataFrame by sampling a fraction of rows, using a random seed.
Parameters:
Name | Type | Description |
---|---|---|
withReplacement |
Sample with replacement or not. |
|
fraction |
Fraction of rows to generate. |
- Since:
- 1.3.0
- Source:
select(cols)
Selects a set of column based expressions.
Parameters:
Name | Type | Description |
---|---|---|
cols |
Array of column names or expressions. If one of the column names is '*', that column is expanded to include all columns in the current DataFrame. |
- Since:
- 1.3.0
- Source:
selectExpr(exprs)
Selects a set of SQL expressions. This is a variant of select
that accepts
SQL expressions.
Parameters:
Name | Type | Description |
---|---|---|
exprs |
Array of SQL expressions. |
- Since:
- 1.3.0
- Source:
show(numRowsopt, truncateopt)
Displays the DataFrame in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.
This method runs a computation but is still synchronous, because it is used in an interactive setting (shell).
Parameters:
Name | Type | Attributes | Default | Description |
---|---|---|---|---|
numRows |
Number |
<optional> |
20 | Number of rows to show. |
truncate |
Boolean |
<optional> |
true | If true, strings more than 20 characters will be truncated and all cells will be aligned right. |
- Since:
- 1.3.0
- Source:
sort(col, cols)
Returns a new DataFrame sorted by the specified column, all in ascending order.
Parameters:
Name | Type | Description |
---|---|---|
col |
||
cols |
Array of additional column names or expressions to sort by. |
- Since:
- 1.3.0
- Source:
toDF(colNames)
Returns a new DataFrame with new specified column names.
Parameters:
Name | Type | Description |
---|---|---|
colNames |
Array of new column names. |
- Since:
- 1.3.0
- Source:
unionAll(other)
Returns a new DataFrame containing union of rows in this frame and another frame.
This is equivalent to UNION ALL
in SQL.
Parameters:
Name | Type | Description |
---|---|---|
other |
- Since:
- 1.3.0
- Source:
where(condition)
Filters rows using the given condition. This is an alias for filter
.
Parameters:
Name | Type | Description |
---|---|---|
condition |
- Since:
- 1.3.0
- Source:
withColumn(colName, col)
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Parameters:
Name | Type | Description |
---|---|---|
colName |
Name of the new column. |
|
col |
Column Expression for the new column. |
- Since:
- 1.3.0
- Source:
write()
Interface for saving the content of the DataFrame out into external storage.
- Since:
- 1.4.0
- Source: