Runs regexp_replace on all StringType columns in a DataFrame
Runs regexp_replace on all StringType columns in a DataFrame
val actualDF = sourceDF.transform( transformations.bulkRegexpReplace( "cool", "dude" ) )
Replaces all "cool"
strings in all the sourceDF
columns of StringType
with the string "dude"
.
Convert camel case columns to snake case Example: SomeColumn -> some_column
Extracts an object from a JSON field with a specified path expression
Extracts an object from a JSON field with a specified path expression
val sourceDF = spark.createDF( List( (10, """{"name": "Bart cool", "age": 25}"""), (20, """{"name": "Lisa frost", "age": 27}""") ), List( ("id", IntegerType, true), ("person", StringType, true) ) ) val actualDF = sourceDF.transform( transformations.extractFromJson("person", "name", "$.name") ) actualDF.show() +---+---------------------------------+----------------+ |id |person |name | +---+---------------------------------+----------------+ |10 |{"name": "Bart cool", "age": 25} |"Bart cool" | |20 |{"name": "Lisa frost", "age": 27}|"Lisa frost" | +---+---------------------------------+----------------+
Extracts an object from a JSON field with a specified schema
Extracts an object from a JSON field with a specified schema
val sourceDF = spark.createDF( List( (10, """{"name": "Bart cool", "age": 25}"""), (20, """{"name": "Lisa frost", "age": 27}""") ), List( ("id", IntegerType, true), ("person", StringType, true) ) ) val personSchema = StructType(List( StructField("name", StringType), StructField("age", IntegerType) )) val actualDF = sourceDF.transform( transformations.extractFromJson("person", "personData", personSchema) ) actualDF.show() +---+---------------------------------+----------------+ |id |person |personData | +---+---------------------------------+----------------+ |10 |{"name": "Bart cool", "age": 25} |[Bart cool, 25] | |20 |{"name": "Lisa frost", "age": 27}|[Lisa frost, 27]| +---+---------------------------------+----------------+
Changes all the column names in a DataFrame
Runs regexp_replace on multiple columns
Runs regexp_replace on multiple columns
val actualDF = sourceDF.transform( transformations.multiRegexpReplace( List(col("person"), col("phone")), "cool", "dude" ) )
Replaces all "cool"
strings in the person
and phone
columns with the string "dude"
.
snake_cases all the columns of a DataFrame
spark-daria defines a com.github.mrpowers.spark.daria.sql.transformations.snakeCaseColumns
transformation to convert all the column names to snake\_case.
snake_cases all the columns of a DataFrame
spark-daria defines a com.github.mrpowers.spark.daria.sql.transformations.snakeCaseColumns
transformation to convert all the column names to snake\_case.
import com.github.mrpowers.spark.daria.sql.transformations._
val sourceDf = Seq( ("funny", "joke") ).toDF("A b C", "de F") val actualDf = sourceDf.transform(snakeCaseColumns) actualDf.show() +-----+----+ |a_b_c|de_f| +-----+----+ |funny|joke| +-----+----+
snakifies all the columns of a DataFrame
snakifies all the columns of a DataFrame
import com.github.mrpowers.spark.daria.sql.transformations._
val sourceDf = Seq( ("funny", "joke") ).toDF("ThIs", "BiH") val actualDf = sourceDf.transform(snakeCaseColumns) actualDf.show() +-----+----+ |th_is|bi_h| +-----+----+ |funny|joke| +-----+----+
Sorts the columns of a DataFrame alphabetically
The sortColumns
transformation sorts the columns in a DataFrame alphabetically.
Sorts the columns of a DataFrame alphabetically
The sortColumns
transformation sorts the columns in a DataFrame alphabetically.
Suppose you start with the following sourceDF
:
+-----+---+-----+
| name|age|sport|
+-----+---+-----+
|pablo| 3| polo|
+-----+---+-----+
Run the code:
val actualDF = sourceDF.transform(sortColumns())
Here’s the actualDF
:
+---+-----+-----+
|age| name|sport|
+---+-----+-----+
| 3|pablo| polo|
+---+-----+-----+
Title Cases all the columns of a DataFrame
Truncates multiple columns in a DataFrame
Truncates multiple columns in a DataFrame
val columnLengths: Map[String, Int] = Map( "person" -> 2, "phone" -> 3 ) sourceDF.transform( truncateColumns(columnLengths) )
Limits the "person"
column to 2 characters and the "phone"
column to 3 characters.
Categorizes a numeric column in various user specified "buckets"
Strips out invalid characters and replaces spaces with underscores to make Parquet compatible column names
Functions available for DataFrame operations.
SQL transformations take a DataFrame as an argument and return a DataFrame. They are suitable arguments for the
Dataset#transform
method.It's convenient to work with DataFrames that have snake_case column names. Column names with spaces make it harder to write SQL queries.