Package

com.github.mrpowers.spark.daria

sql

Permalink

package sql

Visibility
  1. Public
  2. All

Type Members

  1. case class CustomTransform(transform: (DataFrame) ⇒ DataFrame, requiredColumns: Seq[String] = Seq.empty[String], addedColumns: Seq[String] = Seq.empty[String], removedColumns: Seq[String] = Seq.empty[String], skipWhenPossible: Boolean = true) extends Product with Serializable

    Permalink
  2. case class DariaValidationError(smth: String) extends Exception with Product with Serializable

    Permalink
  3. case class DataFrameColumnsException(smth: String) extends Exception with Product with Serializable

    Permalink
  4. trait DataFrameValidator extends AnyRef

    Permalink
  5. case class EtlDefinition(sourceDF: DataFrame, transform: (DataFrame) ⇒ DataFrame, write: (DataFrame) ⇒ Unit, metadata: Map[String, Any] = ...) extends Product with Serializable

    Permalink

    spark-daria can be used as a lightweight framework for running ETL analyses in Spark.

    spark-daria can be used as a lightweight framework for running ETL analyses in Spark.

    You can define EtlDefinitions, group them in a collection, and run the etls via jobs.

    Components of an ETL

    An ETL starts with a DataFrame, runs a series of transformations (filter, custom transformations, repartition), and writes out data.

    The EtlDefinition class is generic and can be molded to suit all ETL situations. For example, it can read a CSV file from S3, run transformations, and write out Parquet files on your local filesystem.

  6. case class InvalidColumnSortOrderException(smth: String) extends Exception with Product with Serializable

    Permalink
  7. case class InvalidDataFrameSchemaException(smth: String) extends Exception with Product with Serializable

    Permalink
  8. case class MissingDataFrameColumnsException(smth: String) extends Exception with Product with Serializable

    Permalink
  9. class ParquetCompactor extends AnyRef

    Permalink
  10. case class ProhibitedDataFrameColumnsException(smth: String) extends Exception with Product with Serializable

    Permalink

Value Members

  1. object ColumnExt

    Permalink

    Additional methods for the Spark Column class

    Additional methods for the Spark Column class

    Since

    0.0.1

  2. object DariaValidator

    Permalink
  3. object DariaWriters

    Permalink
  4. object DataFrameExt

    Permalink
  5. object DataFrameHelpers extends DataFrameValidator

    Permalink
  6. object FunctionsAsColumnExt

    Permalink
  7. object SparkSessionExt

    Permalink
  8. object functions

    Permalink

    Spark [has a ton of SQL functions](https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/functions.html) and spark-daria is meant to fill in any gaps.

  9. object transformations

    Permalink

    Functions available for DataFrame operations.

    Functions available for DataFrame operations.

    SQL transformations take a DataFrame as an argument and return a DataFrame. They are suitable arguments for the Dataset#transform method.

    It's convenient to work with DataFrames that have snake_case column names. Column names with spaces make it harder to write SQL queries.

  10. package types

    Permalink
  11. package udafs

    Permalink

Ungrouped