- Design and implement data storage (15–20%)
- Develop data processing (40–45%)
- Secure, monitor, and optimize data storage and data processing (30–35%)
In most organizations, a data engineer is the primary role responsible for integrating, transforming, and consolidating data from various structured and unstructured data systems into structures that are suitable for building analytics solutions. An Azure data engineer also helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints.
Types of data
- Structured, relational database or from a flat file.
- Unstructured, data stored as key-value pairs, include portable data format (PDF), word processor documents, and images..
Data Integration involves establishing links between operational and analytical services and data sources to enable secure, reliable access to data across multiple systems.
Operational data usually needs to be transformed into suitable structure and format for analysis, often as part of an extract, transform, and load (ETL) process
Data consolidation is the process of combining data that has been extracted from multiple data sources into a consistent structure – usually to support analytics and reporting.
SQL – One of the most common languages data engineers use is SQL.
Others , The use of notebooks is growing in popularity, and allows collaboration using different languages within the same notebook.
Operational and analytical data
Operational data is usually transactional data that is generated and stored by applications, often in a relational or non-relational database.
Analytical data is data that has been optimized for analysis and reporting, often in a data warehouse.
Streaming data refers to perpetual sources of data that generate data values in real-time, often relating to specific events.
Data pipelines are used to orchestrate activities that transfer and transform data. Pipelines are the primary way in which data engineers implement repeatable extract, transform, and load (ETL) solutions that can be triggered based on a schedule or in response to events.
A data lake is a storage repository that holds large amounts of data in native, raw formats.
[…], The idea with a data lake is to store everything in its original, untransformed state. This approach differs from a traditional data warehouse, which transforms and processes the data at the time of ingestion.
A data warehouse is a centralized repository of integrated data from one or more disparate sources. Data warehouses store current and historical data in relational tables that are organized into a schema that optimizes performance for analytical queries.
Apache Spark is a parallel processing framework that takes advantage of in-memory processing and a distributed file storage. It’s a common open-source software (OSS) tool for big data scenarios.
Data engineers need to be proficient with Spark, using notebooks and other code artifacts to process data in a data lake and prepare it for modeling and analysis.
The core Azure technologies used to implement data engineering workloads include:
Azure Synapse Analytics
Azure Data Lake Storage Gen2
Azure Stream Analytics
Choose a stream processing technology in Azure
Stream Analytics has first-class integration with four kinds of resources as inputs:
- Azure Event Hubs
- Azure IoT Hub
- Azure Blob storage
- Azure Data Lake Storage Gen2
Azure Data Factory
- Pipeline, is a logical grouping of activities that performs a unit of work.
- Data flows mapping, data transformation logic
- Activity, processing step in a pipeline.
- Datasets, data structure within the data store, point to or reference the data to use as input/output.
- Linked services, similar to connection strings.
- Integration runtime, activity = action to be preformed, linked service = target store or compute service. IR = bridge between activity and linked services.