spark-daria can be used as a lightweight framework for running ETL analyses in Spark.
Additional methods for the Spark Column class
Additional methods for the Spark Column class
0.0.1
Spark [has a ton of SQL functions](https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/functions.html) and spark-daria is meant to fill in any gaps.
Functions available for DataFrame operations.
Functions available for DataFrame operations.
SQL transformations take a DataFrame as an argument and return a DataFrame. They are suitable arguments for the Dataset#transform
method.
It's convenient to work with DataFrames that have snake_case column names. Column names with spaces make it harder to write SQL queries.
spark-daria can be used as a lightweight framework for running ETL analyses in Spark.
You can define
EtlDefinitions
, group them in a collection, and run the etls via jobs.Components of an ETL
An ETL starts with a DataFrame, runs a series of transformations (filter, custom transformations, repartition), and writes out data.
The
EtlDefinition
class is generic and can be molded to suit all ETL situations. For example, it can read a CSV file from S3, run transformations, and write out Parquet files on your local filesystem.