Deletes inner whitespace and leaves leading and trailing whitespace
Deletes inner whitespace and leaves leading and trailing whitespace
val actualDF = sourceDF.withColumn( "some_string_anti_trimmed", antiTrim(col("some_string")) )
Removes all inner whitespace, but doesn't delete leading or trailing whitespace (e.g. changes " this has some "
to " thishassome "
.
Like array() function but doesn't include null elements
Like initcap, but factors in additional delimiters
Convert an Excel epoch to date.
Convert an Excel epoch to date.
Let's see how it works:
Suppose we have the following testDF
+-----------------+ | excel_time| +-----------------+ |43967.24166666666| |33966.78333333378| |43965.58383363244| |33964.58393533934| +-----------------+
We can run the excelEpochToDate
function as follows:
import com.github.mrpowers.spark.daria.sql.functions._ val actualDF = testDF .withColumn("timestamp", excelEpochToDate("excel_time")) actualDf.show() +-----------------+----------+ | excel_time| date| +-----------------+----------+ |43967.24166666666|2020-05-16| |33966.78333333378|1992-12-28| |43965.58383363244|2020-05-14| |33964.58393533934|1992-12-26| +-----------------+----------+
Convert an Excel epoch to date.
Convert an Excel epoch to date.
Let's see how it works:
Suppose we have the following testDF
+-----------------+ | excel_time| +-----------------+ |43967.24166666666| |33966.78333333378| |43965.58383363244| |33964.58393533934| +-----------------+
We can run the excelEpochToDate
function as follows:
import com.github.mrpowers.spark.daria.sql.functions._ val actualDF = testDF .withColumn("timestamp", excelEpochToDate(col("excel_time"))) actualDf.show() +-----------------+----------+ | excel_time| date| +-----------------+----------+ |43967.24166666666|2020-05-16| |33966.78333333378|1992-12-28| |43965.58383363244|2020-05-14| |33964.58393533934|1992-12-26| +-----------------+----------+
Convert an Excel epoch to timestamp.
Convert an Excel epoch to timestamp.
Inspired by [Filip Czaja]{http://fczaja.blogspot.com/2011/06/convert-excel-date-into-timestamp.html}
Let's see how it works:
Suppose we have the following testDF
+-----------------+ | excel_time| +-----------------+ |43967.24166666666| |33966.78333333378| |43965.58383363244| |33964.58393533934| +-----------------+
We can run the excelEpochToTimestamp
function as follows:
import com.github.mrpowers.spark.daria.sql.functions._ val actualDF = testDF .withColumn("timestamp", excelEpochToTimestamp("excel_time")) actualDf.show() +-----------------+-------------------+ | excel_time| timestamp| +-----------------+-------------------+ |43967.24166666666|2020-05-16 05:47:59| |33966.78333333378|1992-12-28 18:48:00| |43965.58383363244|2020-05-14 14:00:43| |33964.58393533934|1992-12-26 14:00:52| +-----------------+-------------------+
Convert an Excel epoch to timestamp.
Convert an Excel epoch to timestamp.
Inspired by [Filip Czaja]{http://fczaja.blogspot.com/2011/06/convert-excel-date-into-timestamp.html}
Let's see how it works:
Suppose we have the following testDF
+-----------------+ | excel_time| +-----------------+ |43967.24166666666| |33966.78333333378| |43965.58383363244| |33964.58393533934| +-----------------+
We can run the excelEpochToTimestamp
function as follows:
import com.github.mrpowers.spark.daria.sql.functions._ val actualDF = testDF .withColumn("timestamp", excelEpochToTimestamp(col("excel_time"))) actualDf.show() +-----------------+-------------------+ | excel_time| timestamp| +-----------------+-------------------+ |43967.24166666666|2020-05-16 05:47:59| |33966.78333333378|1992-12-28 18:48:00| |43965.58383363244|2020-05-14 14:00:43| |33964.58393533934|1992-12-26 14:00:52| +-----------------+-------------------+
Convert an Excel epoch to unix timestamp.
Convert an Excel epoch to unix timestamp.
Inspired by [Filip Czaja]{http://fczaja.blogspot.com/2011/06/convert-excel-date-into-timestamp.html}
Let's see how it works:
Suppose we have the following testDF
+-----------------+ | excel_time| +-----------------+ |43967.24166666666| |33966.78333333378| |43965.58383363244| |33964.58393533934| +-----------------+
We can run the excelEpochToUnixTimestamp
function as follows:
import com.github.mrpowers.spark.daria.sql.functions._ val actualDF = testDF .withColumn("num_years", excelEpochToUnixTimestamp("excel_time")) actualDf.show() +-----------------+--------------------+ | excel_time| unix_timestamp| +-----------------+--------------------+ |43967.24166666666|1.5896080799999995E9| |33966.78333333378| 7.255684800000383E8| |43965.58383363244|1.5894648432258427E9| |33964.58393533934| 7.253784520133189E8| +-----------------+--------------------+
Convert an Excel epoch to unix timestamp.
Convert an Excel epoch to unix timestamp.
Inspired by [Filip Czaja]{http://fczaja.blogspot.com/2011/06/convert-excel-date-into-timestamp.html}
Let's see how it works:
Suppose we have the following testDF
+-----------------+ | excel_time| +-----------------+ |43967.24166666666| |33966.78333333378| |43965.58383363244| |33964.58393533934| +-----------------+
We can run the excelEpochToUnixTimestamp
function as follows:
import com.github.mrpowers.spark.daria.sql.functions._ val actualDF = testDF .withColumn("num_years", excelEpochToUnixTimestamp(col("excel_time"))) actualDf.show() +-----------------+--------------------+ | excel_time| unix_timestamp| +-----------------+--------------------+ |43967.24166666666|1.5896080799999995E9| |33966.78333333378| 7.255684800000383E8| |43965.58383363244|1.5894648432258427E9| |33964.58393533934| 7.253784520133189E8| +-----------------+--------------------+
Returns true if multiple columns are equal to a given value
Returns true if multiple columns are equal to a given value
Returns true
if multiple columns are equal to a value.
Suppose we have the following sourceDF:
+---+---+ | s1| s2| +---+---+ |cat|cat| |cat|dog| |pig|pig| +---+---+
We can use the multiEquals
function to see if multiple columns are equal to "cat"
.
val actualDF = sourceDF.withColumn( "are_s1_and_s2_cat", multiEquals[String]("cat", col("s1"), col("s2")) ) actualDF.show() +---+---+-----------------+ | s1| s2|are_s1_and_s2_cat| +---+---+-----------------+ |cat|cat| true| |cat|dog| false| |pig|pig| false| +---+---+-----------------+
Removes all whitespace in a string
Removes all whitespace in a string
val actualDF = sourceDF.withColumn( "some_string_without_whitespace", removeAllWhitespace(col("some_string")) )
0.16.0
Removes all whitespace in a string
Removes all whitespace in a string
val actualDF = sourceDF.withColumn( "some_string_without_whitespace", removeAllWhitespace(col("some_string")) )
Removes all whitespace in a string (e.g. changes "this has some"
to "thishassome"
.
0.16.0
Removes all non-word characters from a string
Removes all non-word characters from a string
val actualDF = sourceDF.withColumn( "some_string_remove_non_word_chars", removeNonWordCharacters(col("some_string")) )
Removes all non-word characters from a string, excluding whitespace (e.g. changes " ni!!ce h^^air person "
to
" nice hair person ").
Replaces all whitespace in a string with single spaces
Replaces all whitespace in a string with single spaces
val actualDF = sourceDF.withColumn( "some_string_single_spaced", singleSpace(col("some_string")) )
Replaces all multispaces with single spaces (e.g. changes "this has some"
to "this has some"
.
Truncates the length of StringType columns
Truncates the length of StringType columns
sourceDF.withColumn( "some_string_truncated", truncate(col("some_string"), 3) )
Truncates the "some_string"
column to only have three characters.
Returns the number of years from start
to end
.
Returns the number of years from start
to end
.
There is a datediff
function that calculates the number of days between two dates, but there isn't a yeardiff
function that calculates the number of years between two dates.
The com.github.mrpowers.spark.daria.sql.functions.yeardiff
function fills the gap. Let's see how it works!
Suppose we have the following testDf
+--------------------+--------------------+ | first_datetime| second_datetime| +--------------------+--------------------+ |2016-09-10 00:00:...|2001-08-10 00:00:...| |2016-04-18 00:00:...|2010-05-18 00:00:...| |2016-01-10 00:00:...|2013-08-10 00:00:...| | null| null| +--------------------+--------------------+
We can run the yeardiff
function as follows:
import com.github.mrpowers.spark.daria.sql.functions._ val actualDf = testDf .withColumn("num_years", yeardiff(col("first_datetime"), col("second_datetime"))) actualDf.show() +--------------------+--------------------+------------------+ | first_datetime| second_datetime| num_years| +--------------------+--------------------+------------------+ |2016-09-10 00:00:...|2001-08-10 00:00:...|15.095890410958905| |2016-04-18 00:00:...|2010-05-18 00:00:...| 5.923287671232877| |2016-01-10 00:00:...|2013-08-10 00:00:...| 2.419178082191781| | null| null| null| +--------------------+--------------------+------------------+
(Since version 0.38.2) Removing for Spark 3
Like Scala Array exists method, but for ArrayType columns Scala has an Array#exists function that works like this:
Like Scala Array exists method, but for ArrayType columns Scala has an Array#exists function that works like this:
Array(1, 2, 5).exists(_ % 2 == 0) // true
Suppose we have the following sourceDF:
+---------+ | nums| +---------+ |[1, 4, 9]| |[1, 3, 5]| +---------+
We can use the spark-daria exists
function to see if there are even numbers in the arrays in the nums
column.
val actualDF = sourceDF.withColumn( "nums_has_even", exists[Int]((x: Int) => x % 2 == 0).apply(col("nums")) ) actualDF.show() +---------+-------------+ | nums|nums_has_even| +---------+-------------+ |[1, 4, 9]| true| |[1, 3, 5]| false| +---------+-------------+
(Since version 0.38.2) Removing for Spark 3
Like Scala Array forall method, but for ArrayType columns
Like Scala Array forall method, but for ArrayType columns
Scala has an Array#forall function that works like this:
Array("catdog", "crazy cat").forall(_.contains("cat")) // true
Suppose we have the following sourceDF:
+------------+ | words| +------------+ |[snake, rat]| |[cat, crazy]| +------------+
We can use the spark-daria forall
function to see if all the strings in an array contain the string "cat"
.
val actualDF = sourceDF.withColumn( "all_words_begin_with_c", forall[String]((x: String) => x.startsWith("c")).apply(col("words")) ) actualDF.show() +------------+----------------------+ | words|all_words_begin_with_c| +------------+----------------------+ |[snake, rat]| false| |[cat, crazy]| true| +------------+----------------------+
(Since version 0.38.2) Removing for Spark 3
Returns an array of elements after applying a tranformation to each element in the input array with its index.
Returns an array of elements after applying a tranformation to each element in the input array with its index.
Suppose we have the following sourceDF:
+---------+ | nums| +---------+ |[1, 2, 3]| +---------+
val actualDF = sourceDF.withColumn( "idx_sum", transform(col("nums"), (x, i) => x + i) ) actualDF.show() +---------+---------+ | nums| idx_sum| +---------+---------+ |[1, 2, 3]|[1, 3, 5]| +---------+---------+
(Since version 0.38.2) Removing for Spark 3
Returns an array of elements after applying a tranformation to each element in the input array.
Returns an array of elements after applying a tranformation to each element in the input array.
Suppose we have the following sourceDF:
+---------+ | nums| +---------+ |[1, 4, 9]| |[1, 3, 5]| +---------+
val actualDF = sourceDF.withColumn( "squared", transform(col("nums"), x => x * 2) ) actualDF.show() +---------+-----------+ | nums| squared| +---------+-----------+ |[1, 4, 9]|[1, 16, 81]| |[1, 3, 5]| [1, 9, 25]| +---------+-----------+
(Since version 0.38.2) Removing for Spark 3
Spark [has a ton of SQL functions](https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/functions.html) and spark-daria is meant to fill in any gaps.