Class

com.github.mrpowers.spark.daria.sql

EtlDefinition

Related Doc: package sql

Permalink

case class EtlDefinition(sourceDF: DataFrame, transform: (DataFrame) ⇒ DataFrame, write: (DataFrame) ⇒ Unit, metadata: Map[String, Any] = ...) extends Product with Serializable

spark-daria can be used as a lightweight framework for running ETL analyses in Spark.

You can define EtlDefinitions, group them in a collection, and run the etls via jobs.

Components of an ETL

An ETL starts with a DataFrame, runs a series of transformations (filter, custom transformations, repartition), and writes out data.

The EtlDefinition class is generic and can be molded to suit all ETL situations. For example, it can read a CSV file from S3, run transformations, and write out Parquet files on your local filesystem.

Linear Supertypes
Serializable, Serializable, Product, Equals, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. EtlDefinition
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. AnyRef
  7. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new EtlDefinition(sourceDF: DataFrame, transform: (DataFrame) ⇒ DataFrame, write: (DataFrame) ⇒ Unit, metadata: Map[String, Any] = ...)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  7. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  8. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  9. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  10. val metadata: Map[String, Any]

    Permalink
  11. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  12. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  13. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  14. def process(): Unit

    Permalink

    Runs an ETL process

    Runs an ETL process

    val sourceDF = spark.createDF(
    List(
      ("bob", 14),
      ("liz", 20)
     ), List(
      ("name", StringType, true),
      ("age", IntegerType, true)
     )
    )
    
    def someTransform()(df: DataFrame): DataFrame = {
      df.withColumn("cool", lit("dude"))
    }
    
    def someWriter()(df: DataFrame): Unit = {
      val path = new java.io.File("./tmp/example").getCanonicalPath
      df.repartition(1).write.csv(path)
    }
    
    val etlDefinition = new EtlDefinition(
      name =  "example",
      sourceDF = sourceDF,
      transform = someTransform(),
      write = someWriter(),
      hidden = false
    )
    
    etlDefinition.process()
  15. val sourceDF: DataFrame

    Permalink
  16. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  17. val transform: (DataFrame) ⇒ DataFrame

    Permalink
  18. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  19. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  20. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  21. val write: (DataFrame) ⇒ Unit

    Permalink

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from AnyRef

Inherited from Any

Ungrouped