How to pass the Professional Databricks Data Engineering certification ?

Number of questions : 60

Type of questions : Multiple choice questions

Duration : 120 Min

Passing score : 70%

Where to register for the certification : https://www.webassessor.com/databricks

Expiration : 2 years

Topics covered :

  • Databricks tooling
  • Data Processing
  • Data Modeling
  • Security and governance
  • Monitoring and logging
  • Testing and deployment

Practice tests: No practice exams are available yet.

How to prepare for the certification:

Complete The Advanced Data Engineer professional ( Databricks Academy)

Complete The Advanced Data Engineering Notebooks ( Strongly recommended)

Read the databricks documentation (recommended)

Features you should know before taking the exam:

Delta Optimization ( Optimize/Zorder, AutoOptimize)

Delta clones 

Databricks REST APIs

Databricks CLI

Vacuum

Data Objects privileges 

Delta Lake ( Time Travel, Merge, Optimization, CTAs, Insert)

Managed and External Delta Tables

Widgets 

Delta Live Tables (DLT + Autoloader)

Structured Streaming ( Watermarking + Windowing + Joins)

Databricks Repos 

Incremental processing ( Autoloader, Copy Into)

Medallion Architecture

Databricks Workflows

Slowly Changing Dimension

Additional resources :

Data Engineering with Databricks Session 1

Minimally Qualified Candidate :

The minimally qualified candidate should be able to:

  • Understand how to use and the benefits of using the Databricks platform and its tools, including:
    • Platform (notebooks, clusters, Jobs, Databricks SQL, relational entities, Repos)
    • Apache Spark (PySpark, DataFrame API, basic architecture)
    • Delta Lake (SQL-based Delta APIs, basic architecture, core functions)
    • Databricks CLI (deploying notebook-based workflows)
    • Databricks REST API (configure and trigger production pipelines)
  • Build data processing pipelines using the Spark and Delta Lake APIs, including:
    • Building batch-processed ETL pipelines
    • Building incrementally processed ETL pipelines
    • Optimizing workloads
    • Deduplicating data
    • Using Change Data Capture (CDC) to propagate changes
  • Model data management solutions, including:
    • Lakehouse (bronze/silver/gold architecture, databases, tables, views, and the physical layout)
    • General data modeling concepts (keys, constraints, lookup tables, slowly changing dimensions)
  • Build production pipelines using best practices around security and governance, including:
    • Managing notebook and jobs permissions with ACLs
    • Creating row- and column-oriented dynamic views to control user/group access
    • Securely storing personally identifiable information (PII)
    • Securely delete data as requested according to GDPR & CCPA
  • Configure alerting and storage to monitor and log production jobs, including:
    • Setting up notifications
    • Configuring SparkListener
    • Recording logged metrics
    • Navigating and interpreting the Spark UI
    • Debugging errors
  • Follow best practices for managing, testing and deploying code, including:
    • Managing dependencies
    • Creating unit tests
    • Creating integration tests
    • Scheduling Jobs
    • Versioning code/notebooks
    • Orchestration Jobs

Article written by Youssef Mrini