DataFrameHelpers

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def columnToArray[T](df: DataFrame, colName: String)(implicit arg0: ClassTag[T]): Array[T]

Converts a DataFrame column to an Array of values N.B. This method uses collect and should only be called on small DataFrames.
Converts a DataFrame column to an Array of values N.B. This method uses collect and should only be called on small DataFrames.
This function converts a column to an array of items.
Suppose we have the following sourceDF:
```
+---+
|num|
+---+
|  1|
|  2|
|  3|
+---+
```
Let's convert the num column to an Array of values. Let's run the code and view the results.
```
val actual = DataFrameHelpers.columnToArray[Int](sourceDF, "num")

println(actual)

// Array(1, 2, 3)
```
def columnToList[T](df: DataFrame, colName: String)(implicit arg0: ClassTag[T]): List[T]

Converts a DataFrame column to a List of values N.B. This method uses collect and should only be called on small DataFrames.
Converts a DataFrame column to a List of values N.B. This method uses collect and should only be called on small DataFrames.
This function converts a column to a list of items.
Suppose we have the following sourceDF:
```
+---+
|num|
+---+
|  1|
|  2|
|  3|
+---+
```
Let's convert the num column to a List of values. Let's run the code and view the results.
```
val actual = DataFrameHelpers.columnToList[Int](sourceDF, "num")

println(actual)

// List(1, 2, 3)
```
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef

def printAthenaCreateTable(df: DataFrame, athenaTableName: String, s3location: String): Unit

Generates a CREATE TABLE query for AWS Athena

Suppose we have the following df:

+--------+--------+---------+
|    team|   sport|goals_for|
+--------+--------+---------+
|    jets|football|       45|
|nacional|  soccer|       10|
+--------+--------+---------+

Run the code to print the CREATE TABLE query.

DataFrameHelpers.printAthenaCreateTable(df, "my_cool_athena_table", "s3://my-bucket/extracts/people")

CREATE TABLE IF NOT EXISTS my_cool_athena_table(
  team STRING,
  sport STRING,
  goals_for INT
)
STORED AS PARQUET
LOCATION 's3://my-bucket/extracts/people'

def readTimestamped(dirname: String): DataFrame
lazy val spark: SparkSession
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef

def toArrayOfMaps(df: DataFrame): Array[Map[String, Any]]

Converts a DataFrame to an Array of Maps N.B. This method uses collect and should only be called on small DataFrames.

Converts a DataFrame to an array of Maps.

Suppose we have the following sourceDF:

+----------+-----------+---------+
|profession|some_number|pay_grade|
+----------+-----------+---------+
|    doctor|          4|     high|
|   dentist|         10|     high|
+----------+-----------+---------+

Run the code to convert this DataFrame into an array of Maps.

val actual = DataFrameHelpers.toArrayOfMaps(sourceDF)

println(actual)

Array(
  Map("profession" -> "doctor", "some_number" -> 4, "pay_grade" -> "high"),
  Map("profession" -> "dentist", "some_number" -> 10, "pay_grade" -> "high")
)

def toString(): String

Definition Classes
AnyRef → Any
def twoColumnsToMap[keyType, valueType](df: DataFrame, keyColName: String, valueColName: String)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[keyType], arg1: scala.reflect.api.JavaUniverse.TypeTag[valueType]): Map[keyType, valueType]

Converts two column to a map of key value pairs
Converts two column to a map of key value pairs
N.B. This method uses collect and should only be called on small DataFrames.
Converts two columns in a DataFrame to a Map.
Suppose we have the following sourceDF:
```
+-----------+---------+
|     island|fun_level|
+-----------+---------+
|    boracay|        7|
|long island|        9|
+-----------+---------+
```
Let's convert this DataFrame to a Map with island as the key and fun_level as the value.
```
val actual = DataFrameHelpers.twoColumnsToMap[String, Integer](
  sourceDF,
  "island",
  "fun_level"
)

println(actual)

// Map(
//   "boracay" -> 7,
//   "long island" -> 9
// )
```
def validateAbsenceOfColumns(df: DataFrame, prohibitedColNames: Seq[String]): Unit

Throws an error if the DataFrame contains any of the prohibited columns Validates columns are not included in a DataFrame.
Throws an error if the DataFrame contains any of the prohibited columns Validates columns are not included in a DataFrame. This code will error out:
```
val sourceDF = Seq(
  ("jets", "football"),
  ("nacional", "soccer")
).toDF("team", "sport")

val prohibitedColNames = Seq("team", "sport", "country", "city")

validateAbsenceOfColumns(sourceDF, prohibitedColNames)
```
This is the error message:
> com.github.mrpowers.spark.daria.sql.ProhibitedDataFrameColumnsException: The [team, sport] columns are not allowed to be included in the DataFrame with the following columns [team, sport]
Definition Classes
DataFrameValidator
def validatePresenceOfColumns(df: DataFrame, requiredColNames: Seq[String]): Unit

Throws an error if the DataFrame doesn't contain all the required columns Validates if columns are included in a DataFrame.
Throws an error if the DataFrame doesn't contain all the required columns Validates if columns are included in a DataFrame. This code will error out:
```
val sourceDF = Seq(
  ("jets", "football"),
  ("nacional", "soccer")
).toDF("team", "sport")

val requiredColNames = Seq("team", "sport", "country", "city")

validatePresenceOfColumns(sourceDF, requiredColNames)
```
This is the error message
> com.github.mrpowers.spark.daria.sql.MissingDataFrameColumnsException: The [country, city] columns are not included in the DataFrame with the following columns [team, sport]
Definition Classes
DataFrameValidator
def validateSchema(df: DataFrame, requiredSchema: StructType): Unit

Throws an error if the DataFrame schema doesn't match the required schema
Throws an error if the DataFrame schema doesn't match the required schema
This code will error out:
```
val sourceData = List(
  Row(1, 1),
 Row(-8, 8),
 Row(-5, 5),
 Row(null, null)
)

val sourceSchema = List(
  StructField("num1", IntegerType, true),
  StructField("num2", IntegerType, true)
)

val sourceDF = spark.createDataFrame(
  spark.sparkContext.parallelize(sourceData),
  StructType(sourceSchema)
)

val requiredSchema = StructType(
  List(
    StructField("num1", IntegerType, true),
    StructField("num2", IntegerType, true),
    StructField("name", StringType, true)
  )
)

validateSchema(sourceDF, requiredSchema)
```
This is the error message:
> com.github.mrpowers.spark.daria.sql.InvalidDataFrameSchemaException: The [StructField(name,StringType,true)] StructFields are not included in the DataFrame with the following StructFields [StructType(StructField(num1,IntegerType,true), StructField(num2,IntegerType,true))]
Definition Classes
DataFrameValidator
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def writeTimestamped(df: DataFrame, outputDirname: String, numPartitions: Option[Int] = None, overwriteLatest: Boolean = true): Unit

Related Doc: package sql

object DataFrameHelpers extends DataFrameValidator

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

def columnToArray[T](df: DataFrame, colName: String)(implicit arg0: ClassTag[T]): Array[T]

def columnToList[T](df: DataFrame, colName: String)(implicit arg0: ClassTag[T]): List[T]

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def printAthenaCreateTable(df: DataFrame, athenaTableName: String, s3location: String): Unit

def readTimestamped(dirname: String): DataFrame

lazy val spark: SparkSession

final def synchronized[T0](arg0: ⇒ T0): T0

def toArrayOfMaps(df: DataFrame): Array[Map[String, Any]]

def toString(): String

def twoColumnsToMap[keyType, valueType](df: DataFrame, keyColName: String, valueColName: String)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[keyType], arg1: scala.reflect.api.JavaUniverse.TypeTag[valueType]): Map[keyType, valueType]

def validateAbsenceOfColumns(df: DataFrame, prohibitedColNames: Seq[String]): Unit

def validatePresenceOfColumns(df: DataFrame, requiredColNames: Seq[String]): Unit

def validateSchema(df: DataFrame, requiredSchema: StructType): Unit

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

def writeTimestamped(df: DataFrame, outputDirname: String, numPartitions: Option[Int] = None, overwriteLatest: Boolean = true): Unit

Inherited from DataFrameValidator

Inherited from AnyRef

Inherited from Any

Ungrouped