Local Development

DDataflow also enables one to develop with local data. We see this though as a more advanced use case, which might be the first choice for everybody. First, make a copy of the files you need to download in dbfs.

ddataflow.save_sampled_data_sources(ask_confirmation=False)

Then in your machine:

ddataflow current_project download_data_sources

Now you can use the pipeline locally by exporting the following env variables:

export ENABLE_OFFLINE_MODE=true
# run your pipeline as normal
python yourproject/train.py

The downloaded data sources will be stored at $HOME/.ddataflow.

Local setup for spark

if you run spark locally you might need to tweak some parameters compared to your cluster. Below is a good example you can use.

def configure_spark():

    if ddataflow.is_local():
        import pyspark

        spark_conf = pyspark.SparkConf()
        spark_conf.set("spark.sql.warehouse.dir", "/tmp")
        spark_conf.set("spark.sql.catalogImplementation", "hive")
        spark_conf.set("spark.driver.memory", "15g")
        spark_conf.setMaster("local[*]")
        sc = pyspark.SparkContext(conf=spark_conf)
        session = pyspark.sql.SparkSession(sc)

        return session

    return SparkSession.builder.getOrCreate()

If you run into Snappy compression problem: Please reinstall pyspark!