Functions

Functions

Functions for manipulating DataFrames.

Constructor

new Functions()

Note: Do not use directly. Access via sqlFunctions.

Source:

Methods

(static) abs(col)

Computes the absolute value of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) acos(col)

Computes the cosine inverse of the given column; the returned angle is in the range 0.0 through pi.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) add_months(col, numMonths)

Returns the date that is numMonths after the date in the given column col.

Parameters:
Name Type Description
col
numMonths
Since:
  • 1.5.0
Source:

(static) approxCountDistinct(col)

Aggregate function. Returns the approximate number of distinct items in a group.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) array(cols)

Creates a new array column. The input columns must all have the same data type.

Parameters:
Name Type Description
cols
Since:
  • 1.4.0
Source:

(static) array_contains(col, value)

Returns true if the array contains the value.

Parameters:
Name Type Description
col
value
Since:
  • 1.5.0
Source:

(static) asc(colName)

Returns a sort expression based on ascending order of the column.

Parameters:
Name Type Description
colName
Since:
  • 1.3.0
Source:
Example
// Sort by dept in ascending order, and then age in descending order.
  df.sort(F.asc("dept"), F.desc("age"));

(static) ascii(col)

Computes the numeric value of the first character of the string column, and returns the result as a int column.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) asin(col)

Computes the sine inverse of the given column; the returned angle is in the range -pi/2 through pi/2.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) atan(col)

Computes the tangent inverse of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) atan2(l, r)

Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta). At least one param must be a Column, the second one can be either a Number or a Column.

Parameters:
Name Type Description
l
r
Since:
  • 1.4.0
Source:

(static) avg(col)

Aggregate function. Returns the average of the values in a group. Alias for Functions.mean.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) base64(col)

Computes the BASE64 encoding of a binary column and returns it as a string column. This is the reverse of unbase64.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) bin(col)

An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) bitwiseNOT(col)

Computes bitwise NOT.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) broadcast(dataframe)

Marks a DataFrame as small enough for use in broadcast joins.

The following example marks the right DataFrame for broadcast hash join using joinKey.

Parameters:
Name Type Description
dataframe
Since:
  • 1.5.0
Source:
Example
// left and right are DataFrames
  left.join(F.broadcast(right), "joinKey");

(static) cbrt(col)

Computes the cubic root of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) ceil(col)

Computes the ceiling of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) coalesce(cols)

Returns the first column that is not null, or null if all inputs are null.

For example, coalesce(a, b, c) will return a if a is not null, or b if a is null and b is not null, or c if both a and b are null but c is not null.

Parameters:
Name Type Description
cols
Since:
  • 1.3.0
Source:

(static) col(colName)

Returns the Column with the given column name.

Parameters:
Name Type Description
colName
Since:
  • 1.3.0
Source:

(static) collect_list(col)

Aggregate function. Returns an array of objects with duplicates.

For now this is an alias for the collect_list Hive UDAF.

Parameters:
Name Type Description
col
Since:
  • 1.6.0
Source:

(static) collect_set(col)

Aggregate function. Returns a set of objects with duplicate elements eliminated.

For now this is an alias for the collect_set Hive UDAF.

Parameters:
Name Type Description
col
Since:
  • 1.6.0
Source:

(static) column(colName)

Returns a Column based on the given column name. Alias of Functions.col.

Parameters:
Name Type Description
colName
Since:
  • 1.3.0
Source:

(static) concat(cols)

Concatenates multiple input string columns together into a single string column.

Parameters:
Name Type Description
cols
Since:
  • 1.5.0
Source:

(static) concat_ws(sep, cols)

Concatenates multiple input string columns together into a single string column, using the given separator.

Parameters:
Name Type Description
sep

Separator.

cols

Columns.

Since:
  • 1.5.0
Source:

(static) conv(col, fromBase, toBase)

Convert a number in a string column from one base to another.

Parameters:
Name Type Description
col
fromBase
toBase
Since:
  • 1.5.0
Source:

(static) corr(col1, col2)

Aggregate function. Returns the Pearson Correlation Coefficient for two columns.

Parameters:
Name Type Description
col1
col2
Since:
  • 1.6.0
Source:

(static) cos(col)

Computes the cosine of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) cosh(col)

Computes the hyperbolic cosine of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) count(col)

Aggregate function. Returns the number of items in a group.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) countDistinct(col, cols)

Aggregate function. Returns the number of distinct items in a group.

Parameters:
Name Type Description
col
cols
Since:
  • 1.3.0
Source:

(static) crc32(col)

Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a Number.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) cumeDist()

Window function. Returns the cumulative distribution of values within a window partition, i.e. the fraction of rows that are below the current row.

This is equivalent to the CUME_DIST function in SQL.

Since:
  • 1.4.0
Source:

(static) current_date()

Returns the current date as a date column.

Since:
  • 1.5.0
Source:

(static) current_timestamp()

Returns the current timestamp as a timestamp column.

Since:
  • 1.5.0
Source:

(static) date_add(col, days)

Returns the date that is days days after the date in the given column col.

Parameters:
Name Type Description
col
days
Since:
  • 1.5.0
Source:

(static) date_format(col, format)

Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.

A pattern could be for instance dd.MM.yyyy and could return a string like '18.03.1993'. All pattern letters of java.text.SimpleDateFormat can be used.

NOTE: Use when ever possible specialized functions like Functions.year. These benefit from a specialized implementation.

Parameters:
Name Type Description
col
format
Since:
  • 1.5.0
Source:

(static) date_sub(col, days)

Returns the date that is days days before the date in the given column col.

Parameters:
Name Type Description
col
days
Since:
  • 1.5.0
Source:

(static) datediff(startCol, endCol)

Returns the number of days from startCol to endCol.

Parameters:
Name Type Description
startCol
endCol
Since:
  • 1.5.0
Source:

(static) dayofmonth(col)

Extracts the day of the month as an integer from a given date/timestamp/string.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) dayofyear(col)

Extracts the day of the year as an integer from a given date/timestamp/string.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) decode(col, charset)

Converts the column argument into a string from a binary using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null.

Parameters:
Name Type Description
col
charset
Since:
  • 1.5.0
Source:

(static) denseRank()

Window function. Returns the rank of rows within a window partition, without any gaps.

The difference between rank and denseRank is that denseRank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using denseRank and had three people tie for second place, you would say that all three were in second place and that the next person came in third.

This is equivalent to the DENSE_RANK function in SQL.

Since:
  • 1.4.0
Source:

(static) desc(colName)

Returns a sort expression based on the descending order of the column.

Parameters:
Name Type Description
colName
Since:
  • 1.3.0
Source:
Example
// Sort by dept in ascending order, and then age in descending order.
  df.sort(F.asc("dept"), F.desc("age"));

(static) encode(col, charset)

Computes the first argument into a binary from a string using the provided character set (one of 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'). If either argument is null, the result will also be null.

Parameters:
Name Type Description
col
charset
Since:
  • 1.5.0
Source:

(static) exp(col)

Computes the exponential of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) explode(col)

Creates a new row for each element in the given array or map column.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) expm1(col)

Computes the exponential of the given column minus one.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) expr(expr)

Parses the expression string into the column that it represents, similar to DataFrame.selectExpr

Parameters:
Name Type Description
expr

Expression string.

Source:
Example
// get the number of words of each length
  df.groupBy(F.expr("length(word)")).count();

(static) factorial(col)

Computes the factorial of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) first(col)

Aggregate function. Returns the first value in a group.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) floor(col)

Computes the floor of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) format_number(col, d)

Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.

If d is 0, the result has no decimal point or fractional part. If d < 0, the result will be null.

Parameters:
Name Type Description
col
d
Since:
  • 1.5.0
Source:

(static) format_string(format, args)

Formats the arguments in printf-style and returns the result as a string column.

Parameters:
Name Type Description
format

Format string.

args

Columns containing format arguments.

Since:
  • 1.5.0
Source:

(static) from_unixtime(col, formatopt)

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format (defaults to "yyyy-MM-dd HH:mm:ss").

Parameters:
Name Type Attributes Default Description
col
format <optional>
null
Since:
  • 1.5.0
Source:

(static) from_utc_timestamp(col, tz)

Assumes given timestamp column is UTC and converts to given timezone.

Parameters:
Name Type Description
col
tz
Since:
  • 1.5.0
Source:

(static) greatest(cols)

Returns the greatest value of the array of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Parameters:
Name Type Description
cols
Since:
  • 1.5.0
Source:

(static) hex(col)

Computes hex value of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) hour(col)

Extracts the hours as an integer from a given date/timestamp/string.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) hypot()

Computes sqrt(a^2 + b^2) without intermediate overflow or underflow.

Since:
  • 1.4.0
Source:

(static) initcap(col)

Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace.

For example, "hello world" will become "Hello World".

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) inputFileName()

Creates a string column for the file name of the current Spark task.

Source:

(static) instr(col, subsring)

Locate the position of the first occurrence of substring in the given column. Returns null if either of the arguments are null.

NOTE: The position is not zero based, but 1 based index. Returns 0 if substring could not be found in column.

Parameters:
Name Type Description
col
subsring
Since:
  • 1.5.0
Source:

(static) isNaN(col)

Return true iff the column is NaN.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) json_tuple(col, fields)

Creates a new row for a json column according to the given field names.

Parameters:
Name Type Description
col
fields
Since:
  • 1.6.0
Source:

(static) kurtosis(col)

Aggregate function. Returns the kurtosis of the values in a group.

Parameters:
Name Type Description
col
Since:
  • 1.6.0
Source:

(static) lag(col, offset, defaultValueopt)

Window function. Returns the value that is offset rows before the current row, and null (or optional defaultValue, if provided) if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL.

Parameters:
Name Type Attributes Default Description
col
offset
defaultValue <optional>
null
Since:
  • 1.4.0
Source:

(static) last(col)

Aggregate function. Returns the last value in a group.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) last_day(col)

Given a date column, returns the last day of the month which the given date belongs to. For example, input "2015-07-27" returns "2015-07-31" since July 31 is the last day of the month in July 2015.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) lead(col, offset, defaultValueopt)

Window function. Returns the value that is offset rows after the current row, and null (or optional defaultValue, if provided) if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

This is equivalent to the LEAD function in SQL.

Parameters:
Name Type Attributes Default Description
col
offset
defaultValue <optional>
null
Since:
  • 1.4.0
Source:

(static) least(cols)

Returns the least value of the array of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Parameters:
Name Type Description
cols
Since:
  • 1.5.0
Source:

(static) length(col)

Computes the length of a given string or binary column.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) levenshtein(l, r)

Computes the Levenshtein distance of the two given string columns.

Parameters:
Name Type Description
l
r
Since:
  • 1.5.0
Source:

(static) lit(literal)

Creates a Column of literal value.

The passed in object is returned directly if it is already a Column. Otherwise, a new Column is created to represent the literal value.

Parameters:
Name Type Description
literal
Since:
  • 1.3.0
Source:

(static) locate(substring, column, positionopt)

Locate the position of the first occurrence of substr in a string column (after optional position pos).

NOTE: The position is not zero based, but 1 based index. returns 0 if substr could not be found in str.

Parameters:
Name Type Attributes Default Description
substring
column
position <optional>
null
Since:
  • 1.5.0
Source:

(static) log(col, baseopt)

Computes the logarithm of the given column.

Parameters:
Name Type Attributes Default Description
col
base <optional>
Math.E
Since:
  • 1.4.0
Source:

(static) log1p(col)

Computes the natural logarithm of the given column plus one.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) log2(col)

Computes the logarithm of the given column in base 2.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) log10(col)

Computes the logarithm of the given column in base 10.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) lower(col)

Converts a string column to lower case.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) lpad(col, len, pad)

Left-pad the string column with the contents of 'pad', to a max length of 'len'.

Parameters:
Name Type Description
col
len
pad
Since:
  • 1.5.0
Source:

(static) ltrim(col)

Trim the spaces from left end for the specified string column.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) max(col)

Aggregate function. Returns the maximum value of the expression in a group.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) md5(col)

Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) mean(col)

Aggregate function. Returns the average of the values in a group. Alias for Functions.avg.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) min(col)

Aggregate function. Returns the minimum value of the expression in a group.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) minute(col)

Extracts the minutes as an integer from a given date/timestamp/string.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) monotonicallyIncreasingId()

A column expression that generates monotonically increasing 64-bit integers.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs /*: 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.

Since:
  • 1.4.0
Source:

(static) month(col)

Extracts the month as an integer from a given date/timestamp/string.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) months_between(col1, col2)

Returns number of months between dates in col1 and col2.

Parameters:
Name Type Description
col1
col2
Since:
  • 1.5.0
Source:

(static) nanvl(col1, col2)

Returns col1 if it is not NaN, or col2 if col1 is NaN.

Both inputs should be floating point columns (DoubleType or FloatType).

Parameters:
Name Type Description
col1
col2
Since:
  • 1.5.0
Source:

(static) negate(col)

Unary minus, i.e. negate the expression.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:
Example
// Select the amount column and negates all values.
  df.select(F.negate(df.col("amount")));

(static) next_day(col, dayOfWeek)

Given a date column, returns the first date which is later than the value of the date column that is on the specified day of the week.

For example, next_day('2015-07-27', "Sunday") returns 2015-08-02 because that is the first Sunday after 2015-07-27.

Day of the week parameter is case insensitive, and accepts: "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun".

Parameters:
Name Type Description
col
dayOfWeek
Since:
  • 1.5.0
Source:

(static) not(col)

Inversion of boolean expression, i.e. NOT.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:
Example
df.filter(F.not(df.col("isActive")));

(static) ntile(n)

Window function. Returns the ntile group id (from 1 to n inclusive) in an ordered window partition. Fow example, if n is 4, the first quarter of the rows will get value 1, the second quarter will get 2, the third quarter will get 3, and the last quarter will get 4.

This is equivalent to the NTILE function in SQL.

Parameters:
Name Type Description
n
Since:
  • 1.4.0
Source:

(static) percentRank()

Window function. Returns the relative rank (i.e. percentile) of rows within a window partition.

This is computed by : (rank of row in its partition - 1) / (number of rows in the partition - 1)

This is equivalent to the PERCENT_RANK function in SQL.

Since:
  • 1.4.0
Source:

(static) pmod(dividend, divisor)

Returns the positive value of dividend mod divisor.

Parameters:
Name Type Description
dividend

Column.

divisor

Column.

Since:
  • 1.5.0
Source:

(static) pow(col)

Returns the value of the first argument raised to the power of the second argument.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) quarter(col)

Extracts the quarter as an integer from a given date/timestamp/string.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) rand(seedopt)

Generate a random column with i.i.d. samples from U[0.0, 1.0].

Parameters:
Name Type Attributes Description
seed <optional>
Since:
  • 1.4.0
Source:

(static) randn(seedopt)

Generate a column with i.i.d. samples from the standard normal distribution.

Parameters:
Name Type Attributes Description
seed <optional>
Since:
  • 1.4.0
Source:

(static) rank()

Window function. Returns the rank of rows within a window partition.

The difference between rank and denseRank is that denseRank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using denseRank and had three people tie for second place, you would say that all three were in second place and that the next person came in third.

This is equivalent to the RANK function in SQL.

Since:
  • 1.4.0
Source:

(static) regexp_extract(col, regex, groupIdx)

Extract a specific group identified by a java regex, from the specified string column.

Parameters:
Name Type Description
col
regex
groupIdx

Group index.

Since:
  • 1.5.0
Source:

(static) regexp_replace(col, re, replacement)

Replace all substrings of the specified string value that match regexp with rep.

Parameters:
Name Type Description
col
re
replacement

Replacement string.

Since:
  • 1.5.0
Source:

(static) repeat(col, n)

Repeats a string column n times, and returns it as a new string column.

Parameters:
Name Type Description
col
n
Since:
  • 1.5.0
Source:

(static) reverse(col)

Reverses the string column and returns it as a new string column.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) rint(col)

Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) round(col, scaleopt)

Returns the value of the column rounded to 0 decimal places.

Parameters:
Name Type Attributes Default Description
col
scale <optional>
0
Since:
  • 1.5.0
Source:

(static) rowNumber()

Window function. Returns a sequential number starting at 1 within a window partition.

This is equivalent to the ROW_NUMBER function in SQL.

Since:
  • 1.4.0
Source:

(static) rpad(col, len, pad)

Right-pad the string column with the contents of 'pad', to a max length of 'len'.

Parameters:
Name Type Description
col
len
pad
Since:
  • 1.5.0
Source:

(static) rtrim(col)

Trim the spaces from right end for the specified string column.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) second(col)

Extracts the seconds as an integer from a given date/timestamp/string column.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) sha1(col)

Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) sha2(col, numBits)

Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string.

Parameters:
Name Type Description
col

Column to compute SHA-2 on.

numBits

One of 224, 256, 384, or 512.

Since:
  • 1.5.0
Source:

(static) shiftLeft(col, numBits)

Shift the given column numBits left.

Parameters:
Name Type Description
col
numBits
Since:
  • 1.5.0
Source:

(static) shiftRight(col, numBits)

Shift the the given column numBits right.

Parameters:
Name Type Description
col
numBits
Since:
  • 1.5.0
Source:

(static) shiftRightUnsigned(col, numBits)

Unsigned shift the the given column numBits right.

Parameters:
Name Type Description
col
numBits
Since:
  • 1.5.0
Source:

(static) signum(col)

Computes the signum of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) sin(col)

Computes the sine of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) sinh(col)

Computes the hyperbolic sine of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) size(col)

Returns length of array or map.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) skewness(col)

Aggregate function. Returns the skewness of the values in a group.

Parameters:
Name Type Description
col
Since:
  • 1.6.0
Source:

(static) sort_array(col, ascopt)

Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements.

Parameters:
Name Type Attributes Default Description
col
asc <optional>
true
Since:
  • 1.5.0
Source:

(static) soundex(col)

  • Return the soundex code for the specified expression.
Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) sparkPartitionId()

Partition ID of the Spark task.

Note that this is indeterministic because it depends on data partitioning and task scheduling.

Since:
  • 1.4.0
Source:

(static) split(col, regex)

Splits str around regular expression re (which is a java regular expression).

Parameters:
Name Type Description
col
regex
Since:
  • 1.5.0
Source:

(static) sqrt(col)

Computes the square root of the specified float Colum.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) stddev(col)

Aggregate function. Alias for Functions.stddev_samp.

Parameters:
Name Type Description
col
Since:
  • 1.6.0
Source:

(static) stddev_pop(col)

Aggregate function. Returns the population standard deviation of the expression in a group.

Parameters:
Name Type Description
col
Since:
  • 1.6.0
Source:

(static) stddev_samp(col)

Aggregate function. Returns the sample standard deviation of the expression in a group.

Parameters:
Name Type Description
col
Since:
  • 1.6.0
Source:

(static) struct(cols)

Creates a new struct column. If the input column is a column in a DataFrame, or a derived column expression that is named (i.e. aliased), its name is the StructField's name, otherwise, the newly generated StructField's name is auto generated as col${index + 1}, i.e. col1, col2, col3, ...

Parameters:
Name Type Description
cols
Since:
  • 1.4.0
Source:

(static) substring(col, pos, len)

Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

Parameters:
Name Type Description
col
pos
len
Since:
  • 1.5.0
Source:

(static) substring_index(col, delim, count)

Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.

Parameters:
Name Type Description
col
delim
count
Source:

(static) sum(col)

Aggregate function. Returns the sum of all values in the expression.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) sumDistinct(col)

Aggregate function. Returns the sum of distinct values in the expression.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) tan(col)

Computes the tangent of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) tanh(col)

Computes the hyperbolic tangent of the given column.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) to_date(col)

Converts the column into DateType.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) to_utc_timestamp(col, tz)

Assumes given timestamp column is in given timezone and converts to UTC.

Parameters:
Name Type Description
col
tz
Since:
  • 1.5.0
Source:

(static) toDegrees(col)

Converts an angle measured in radians to an approximately equivalent angle measured in degrees.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) toRadians(col)

Converts an angle measured in degrees to an approximately equivalent angle measured in radians.

Parameters:
Name Type Description
col
Since:
  • 1.4.0
Source:

(static) translate(col, matchingString, replaceString)

Translate any character in the src by a character in replaceString. The characters in replaceString is corresponding to the characters in matchingString. The translate will happen when any character in the string matching with the character in the matchingString.

Parameters:
Name Type Description
col
matchingString
replaceString
Since:
  • 1.5.0
Source:

(static) trim(col)

Trim the spaces from both ends for the specified string column.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) trunc(col, format:)

Returns date truncated to the unit specified by the format.

Parameters:
Name Type Description
col
format:

'year', 'yyyy', 'yy' for truncate by year, or 'month', 'mon', 'mm' for truncate by month

Since:
  • 1.5.0
Source:

(static) unbase64(col)

Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) unhex(col)

Inverse of hex. Interprets each pair of characters as a hexadecimal number and converts to the byte representation of number.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) unix_timestamp(colopt, formatopt)

Returns a Unix timestamp in seconds. If no arguments are passed, returns current time.

If col is passed, it is parsed with the given format (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]). The format defaults to (format yyyy-MM-dd HH:mm:ss).

Parameters:
Name Type Attributes Default Description
col <optional>
null
format <optional>
null
Since:
  • 1.5.0
Source:

(static) upper(col)

Converts a string column to upper case.

Parameters:
Name Type Description
col
Since:
  • 1.3.0
Source:

(static) var_pop(col)

Aggregate function. Returns the population variance of the values in a group.

Parameters:
Name Type Description
col
Since:
  • 1.6.0
Source:

(static) var_samp(col)

Aggregate function. Returns the unbiased variance of the values in a group.

Parameters:
Name Type Description
col
Since:
  • 1.6.0
Source:

(static) variance(col)

Aggregate function. Alias for Functions.var_samp.

Parameters:
Name Type Description
col
Since:
  • 1.6.0
Source:

(static) weekofyear(col)

Extracts the week number as an integer from a given date/timestamp/string column.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source:

(static) when(condition)

Evaluates an array of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.

Parameters:
Name Type Description
condition

Column.

Since:
  • 1.4.0
Source:
Example
// Example: encoding gender string column into integer.

  people.select(F.when(col("gender").equalTo("male"), 0))
    .when(col("gender").equalTo("female"), 1)
    .otherwise(2));

(static) year(col)

Extracts the year as an integer from a given date/timestamp/string.

Parameters:
Name Type Description
col
Since:
  • 1.5.0
Source: