Returns the columns in otherDF that aren't in self
Executes a list of transformations in CustomTransform objects Uses function composition
Executes a list of custom DataFrame transformations Uses function composition to run a list of DataFrame transformations.
Executes a list of custom DataFrame transformations Uses function composition to run a list of DataFrame transformations.
def withGreeting()(df: DataFrame): DataFrame = { df.withColumn("greeting", lit("hello world")) }
def withCat(name: String)(df: DataFrame): DataFrame = { df.withColumn("cats", lit(name + " meow")) }
sourceDF.composeTransforms(withGreeting(), withCat("sandy"))
Executes a list of custom DataFrame transformations Uses function composition to run a list of DataFrame transformations.
Executes a list of custom DataFrame transformations Uses function composition to run a list of DataFrame transformations.
def withGreeting()(df: DataFrame): DataFrame = { df.withColumn("greeting", lit("hello world")) }
def withCat(name: String)(df: DataFrame): DataFrame = { df.withColumn("cats", lit(name + " meow")) }
val transforms = List( withGreeting()(_), withCat("sandy")(_) )
sourceDF.composeTransforms(transforms)
Returns true if the DataFrame contains the StructField
Returns true if the DataFrame contains the StructField
sourceDF.containsColumn(StructField("team", StringType, true))
Returns true
if sourceDF
contains the StructField and false otherwise.
Returns true if the DataFrame contains the column
Returns true if the DataFrame contains the column
sourceDF.containsColumn("team")
Returns true
if sourceDF
contains a column named "team"
and false otherwise.
Returns true if the DataFrame contains all the columns
Returns true if the DataFrame contains all the columns
sourceDF.containsColumns("team", "city")
Returns true
if sourceDF
contains the "team"
and "city"
columns and false otherwise.
Drops multiple columns that satisfy the conditions of a function Here is how to drop all columns that start with an underscore df.dropColumns(_.startsWith("_"))
Drop nested column by specifying full name (for example foo.bar)
Converts all the StructType columns to regular columns This StackOverflow answer provides a detailed description how to use flattenSchema: https://stackoverflow.com/a/50402697/1125159
Completely removes all duplicates from a DataFrame
Completely removes all duplicates from a DataFrame
Completely removes all duplicates from a DataFrame
Prints the schema with StructType and StructFields so it's easy to copy into code
Spark has a printSchema
method to print the schema of a DataFrame and a schema
method that returns a StructType
object.
Prints the schema with StructType and StructFields so it's easy to copy into code
Spark has a printSchema
method to print the schema of a DataFrame and a schema
method that returns a StructType
object.
The Dataset#schema
method can be easily converted into working code for small DataFrames, but it can be a lot of manual work for DataFrames with a lot of columns.
The printSchemaInCodeFormat
DataFrame extension prints the DataFrame schema as a valid StructType
object.
Suppose you have the following sourceDF
:
+--------+--------+---------+ | team| sport|goals_for| +--------+--------+---------+ | jets|football| 45| |nacional| soccer| 10| +--------+--------+---------+ `sourceDF.printSchemaInCodeFormat()` will output the following rows in the console: StructType( List( StructField("team", StringType, true), StructField("sport", StringType, true), StructField("goals_for", IntegerType, true) ) )
Rename columns Here is how to lowercase all the columns df.renameColumns(_.toLowerCase) Here is how to trim all the columns df.renameColumns(_.trim)
Reorders columns as specified Reorders the columns in a DataFrame.
Reorders columns as specified Reorders the columns in a DataFrame.
val actualDF = sourceDF.reorderColumns( Seq("greeting", "team", "cats") )
The actualDF
will have the greeting
column first, then the team
column then the cats
column.
Makes all columns nullable or vice versa
This method is opposite of flattenSchema.
This method is opposite of flattenSchema. For example, if you have flat dataframe with snake case columns it will convert it to dataframe with nested columns.
From: root |-- person_id: long (nullable = true) |-- person_name: string (nullable = true) |-- person_surname: string (nullable = true)
To: root |-- person: struct (nullable = false) | |-- name: string (nullable = true) | |-- surname: string (nullable = true) | |-- id: long (nullable = true)
Like transform(), but for CustomTransform objects Enables you to specify the columns that should be added / removed by a custom transformations and errors out if the columns the columns that are actually added / removed are different.
Like transform(), but for CustomTransform objects Enables you to specify the columns that should be added / removed by a custom transformations and errors out if the columns the columns that are actually added / removed are different.
val actualDF = sourceDF .trans( CustomTransform( transform = ExampleTransforms.withGreeting(), addedColumns = Seq("greeting"), requiredColumns = Seq("something") ) ) .trans( CustomTransform( transform = ExampleTransforms.withCat("spanky"), addedColumns = Seq("cats") ) ) .trans( CustomTransform( transform = ExampleTransforms.dropWordCol(), removedColumns = Seq("word") ) )
Returns a new DataFrame
with the column columnName
cast
as newType
.
Returns a new DataFrame
with the column columnName
cast
as newType
.
the column to cast
the new type for columnName
Returns a new DataFrame
with the column columnName
cast
as newType
.
Returns a new DataFrame
with the column columnName
cast
as newType
.
the column to cast
the new type for columnName