Transform data in delta lake using mapping data flows data factory pipe

In this tutorial, you’ll use the data flow canvas to create data flows that allow you to analyze and transform data in Azure Data Lake Storage (ADLS) Gen2 and store it in Delta Lake.

https://learn.microsoft.com/en-us/azure/data-factory/tutorial-data-flow-delta-lake#next-steps

The file that we are transforming in this tutorial is MoviesDB.csv, which can be found here. To retrieve the file from GitHub, copy the contents to a text editor of your choice to save locally as a .csv file

adfdataflowdocs/moviesDB2.csv at master · kromerm/adfdataflowdocs · GitHub

The examples will be referencing a container named ‘sample-data’. created with hierarchical namespace

Create a data factory

Create a pipeline with a data flow activity

  1. On the home page, select Orchestrate.
  2. In the General tab for the pipeline, enter DeltaLake for Name of the pipeline.
  3. In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow activity from the pane to the pipeline canvas.

DeltaLake

In the top bar of the pipeline canvas, slide the Data Flow debug slider on. Debug mode allows for interactive testing of transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes to warm up and users are recommended to turn on debug first if they plan to do Data Flow development.

Build transformation logic in the data flow canvas

You will generate two data flows in this tutorial. The first data flow is a simple source to sink to generate a new Delta Lake from the movies CSV file from above. Lastly, you’ll create this flow design below to update data in Delta Lake.

  1. Take the MoviesCSV dataset source from above, and form a new Delta Lake from it. 1. Build the logic to updated ratings for 1988 movies to ‘1’.
  2. Delete all movies from 1950.
  3. Insert new movies for 2021 by duplicating the movies from 1960.

Start from a blank data flow canvas

  1. Click on the source transformation
  2. Click new next to dataset in the bottom panel 1 Create a new Linked Service for ADLS Gen2
  3. Choose Delimited Text for the dataset type
  4. Name the dataset “MoviesCSV” 
  5. Point to the MoviesCSV file that you uploaded to storage above
  6. Set it to be comma delimited and include header on first row 

Delimited

Name MoviesCSV and New linked service to ADLS

Comma delimited is default

  1. Go to the source projection tab and click “Detect data types”
  2. Once you have your projection set, you can continue 
  3. Add a sink transformation
  4. Delta is an inline dataset type. You will need to point to your ADLS Gen2 storage account.

We have

Add sink with delta (press the small + sign)

Delta and ADLS added

Go back to the pipeline designer and click Debug to execute the pipeline in debug mode with just this data flow activity on the canvas. This will generate your new Delta Lake in ADLS Gen2.

Forgot a step

Choose a folder name in your storage container where you would like ADF to create the Delta Lake

Created a container lake with folder datalake01

In the sink, edit or add a new for saving

Test connection

Made a new, some fishy with saving

Go back to the pipeline designer and click Debug to execute the pipeline in debug mode with just this data flow activity on the canvas. This will generate your new Delta Lake in ADLS Gen2.

Error again, go to the Settings tab on the sink…and browse to the folder

  1. Go back to the pipeline designer and click Debug to execute the pipeline in debug mode with just this data flow activity on the canvas. This will generate your new Delta Lake in ADLS Gen2.

The new Delta lake

  1. From Factory Resources, click new > Data flow 
  2. Use the MoviesCSV again as a source and click “Detect data types” again
  3. Add a filter transformation to your source transformation in the graph

Open expression builder

And do 16 to 24, what language is this…..

Scroll to Top