Industry-focused diploma bridging academic excellence with real-world skills
Comprehensive training in data engineering with industry-standard tools and practices
2 Semesters • 12 Courses • 35 Total Credits
| S.No | Course Code | Course Title | Credits |
|---|---|---|---|
| SEMESTER 1: Foundation & Fundamentals (16 Credits) | |||
| 1 | DADE101 | Python for Data Engineering | 3 |
| 2 | DADE102 | Python for Data Engineering Lab | 2 |
| 3 | DADE103 | Data Analysis & Visualization | 3 |
| 4 | DADE104 | Data Analysis & Visualization Lab | 2 |
| 5 | DADE105 | Database Systems for Data Engineering | 3 |
| 6 | DADE106 | Capstone Project 1 | 3 |
| SEMESTER 2: Advanced & Production Systems (19 Credits) | |||
| 7 | DADE201 | ETL/ELT Pipeline Development | 3 |
| 8 | DADE202 | ETL/ELT Pipeline Development Lab | 2 |
| 9 | DADE203 | Big Data Processing & Data Lakes | 3 |
| 10 | DADE204 | Big Data Processing Lab | 2 |
| 11 | DADE205 | Cloud Data Platforms & Data Warehousing | 3 |
| 12 | DADE206 | Capstone Project 2 | 6 |
Click on a course to jump to its detailed syllabus
Introduction to Python – Why Python for Data Engineering? – Python Installation and Setup – Python IDEs: Jupyter Notebook, VS Code, PyCharm – Running Python Scripts – Python Syntax Basics – Indentation and Code Structure – Comments and Documentation. Variables and Data Types – Python Variables – Numeric Types: int, float, complex – String Operations – Boolean Type – Type Conversion – Dynamic Typing – None Type. Data Structures – Lists: Creation, Indexing, Slicing, List Methods, List Comprehensions – Tuples: Immutability, Use Cases – Dictionaries: Key-Value Pairs, Dictionary Methods, Nested Dictionaries – Sets: Unique Elements, Set Operations.
Create lists and perform basic operations; Manipulate dictionaries for data storage; Use list comprehensions for data transformation; Handle nested data structures; Convert between different data types.
Control Flow Statements – Conditional Statements: if, elif, else – Comparison Operators – Logical Operators – Nested Conditionals – Looping Statements: for loops, while loops, range() function – Loop Control: break, continue, pass – Nested Loops. Functions – Function Definition and Calling – Function Parameters: Positional, Keyword, Default Arguments – Variable Scope: Local vs Global – Return Values – Lambda Functions – Map, Filter, Reduce Functions – Function Decorators (Introduction). Error Handling – Types of Errors: Syntax Errors, Runtime Errors, Logical Errors – Exception Handling: try, except, finally – Raising Exceptions – Custom Exceptions – Best Practices for Error Handling.
Write functions for data validation; Implement data filtering using filter() and list comprehensions; Handle file reading errors with exception handling; Create reusable data processing functions; Use lambda functions for quick transformations.
File Operations – Opening Files: Read, Write, Append Modes – Reading Files: read(), readline(), readlines() – Writing Files: write(), writelines() – File Context Managers (with statement) – File Path Handling: os.path, pathlib. Working with CSV Files – CSV Module – Reading CSV Files: csv.reader(), csv.DictReader() – Writing CSV Files: csv.writer(), csv.DictWriter() – Handling CSV with pandas (Introduction) – CSV Data Cleaning. Working with JSON Files – JSON Module – json.load(), json.dumps() – Reading and Writing JSON – Nested JSON Structures – JSON Data Validation. Working with XML Files – XML Parsing with xml.etree.ElementTree – Reading XML Files – Extracting Data from XML – XML to Dictionary Conversion.
Read and process CSV files; Convert JSON data to Python dictionaries; Parse XML files and extract specific data; Clean and validate data from different file formats; Write processed data to files.
NumPy Fundamentals – Introduction to NumPy – Creating Arrays: array(), arange(), linspace(), zeros(), ones() – Array Operations: Indexing, Slicing, Reshaping – Array Mathematics: Element-wise Operations, Broadcasting – NumPy Functions: sum(), mean(), std(), min(), max(). Pandas Introduction – Introduction to pandas – Series: Creation, Indexing, Operations – DataFrames: Creation, Indexing, Selection – Reading Data: read_csv(), read_json(), read_excel() – Basic DataFrame Operations: head(), tail(), info(), describe(). Data Cleaning with Pandas – Handling Missing Values: isnull(), fillna(), dropna() – Data Type Conversion – Removing Duplicates – String Operations on DataFrames – Date and Time Handling.
Perform array operations using NumPy; Create DataFrames from various data sources; Clean datasets with missing values; Convert data types and handle date columns; Perform basic statistical operations.
Data Transformation Techniques – Filtering Data: Boolean Indexing, Query Methods – Sorting Data: sort_values(), sort_index() – Grouping and Aggregation: groupby(), agg() – Merging and Joining: merge(), join(), concat() – Pivot Tables. Data Validation and Quality – Data Type Validation – Range Checking – Pattern Matching – Data Completeness Checks – Outlier Detection (Basic Concepts) – Data Quality Metrics. Working with Large Files – Chunking Large Files – Memory-Efficient Processing – Streaming Data Processing – File Compression: gzip, zipfile. Performance Optimization – Vectorization vs Loops – Using NumPy for Performance – Profiling Python Code – Best Practices for Data Processing.
Transform and aggregate data using pandas; Merge multiple datasets; Validate data quality; Process large files in chunks; Optimize data processing code for performance.
DataFrame Operations – Advanced Indexing: loc, iloc, ix – Boolean Indexing – MultiIndex DataFrames – Reshaping Data: pivot, melt, stack, unstack – Combining Data: concat, merge, join – Handling Time Series Data. Data Aggregation and Grouping – GroupBy Operations – Aggregation Functions: sum, mean, count, min, max – Custom Aggregation Functions – Transform and Filter Operations – Window Functions: rolling, expanding. Data Cleaning and Preprocessing – Handling Missing Data: Detection, Imputation Strategies – Removing Duplicates – Data Type Conversion – String Manipulation: str methods, regex – Date and Time Parsing – Categorical Data Handling.
Perform complex grouping and aggregation operations; Reshape data for analysis; Clean and preprocess messy datasets; Handle time series data; Transform categorical variables.
Descriptive Statistics – Measures of Central Tendency: Mean, Median, Mode – Measures of Dispersion: Variance, Standard Deviation, Range, IQR – Skewness and Kurtosis – Percentiles and Quartiles – Correlation Analysis. Statistical Tests – Hypothesis Testing Concepts – t-tests – Chi-square Tests – ANOVA (Introduction) – Using scipy.stats for Statistical Tests – Interpreting p-values. Probability Distributions – Common Distributions: Normal, Binomial, Poisson – Distribution Fitting – Random Sampling – Using numpy.random for Simulations.
Calculate descriptive statistics for datasets; Perform correlation analysis; Conduct hypothesis tests; Fit probability distributions; Generate random samples from distributions.
Matplotlib Fundamentals – Matplotlib Architecture: Figure, Axes, Plot – Creating Basic Plots: Line, Bar, Scatter, Histogram – Customizing Plots: Colors, Markers, Labels, Titles – Figure and Axes Customization – Saving Figures. Advanced Plotting – Subplots: Creating Multiple Plots – Plot Annotations: Text, Arrows, Shapes – Legends and Color Bars – Plot Styling: Themes, Grids – 3D Plotting (Introduction). Time Series Visualization – Plotting Time Series Data – Date Formatting – Multiple Time Series – Annotations and Highlights – Seasonal Decomposition Visualization.
Create various types of plots; Customize plot appearance; Create multi-panel visualizations; Visualize time series data; Add annotations and labels to plots.
Seaborn Introduction – Seaborn vs Matplotlib – Seaborn Style and Themes – Setting Aesthetic Parameters – Color Palettes. Statistical Visualizations – Distribution Plots: distplot, kdeplot, rugplot – Categorical Plots: barplot, countplot, boxplot, violinplot – Relationship Plots: scatterplot, lineplot, relplot – Regression Plots: regplot, lmplot. Advanced Seaborn Plots – Heatmaps and Clustered Heatmaps – Pair Plots and Pair Grids – Facet Grids for Multi-dimensional Data – Customizing Seaborn Plots.
Create statistical visualizations with seaborn; Build heatmaps for correlation analysis; Create pair plots for exploratory analysis; Use facet grids for multi-dimensional visualization; Customize seaborn plots.
EDA Workflow – Understanding the Dataset – Data Profiling – Identifying Patterns and Anomalies – Feature Relationships – Data Quality Assessment. Visual EDA Techniques – Univariate Analysis: Distributions, Box Plots – Bivariate Analysis: Scatter Plots, Correlation Heatmaps – Multivariate Analysis: Pair Plots, Parallel Coordinates – Categorical Analysis: Count Plots, Cross-tabulations. Data Storytelling – Choosing Appropriate Visualizations – Creating Dashboards (Basic Concepts) – Presenting Insights – Best Practices for Data Visualization – Common Visualization Mistakes.
Perform complete EDA on a dataset; Create comprehensive visualizations; Identify data patterns and anomalies; Build a data analysis report; Present findings effectively.
Introduction to Databases – Database Management Systems – Relational Model – Tables, Rows, Columns, Keys – Primary Keys, Foreign Keys – Database Normalization Basics – ACID Properties. SQL Fundamentals – DDL: CREATE TABLE, ALTER TABLE, DROP TABLE – DML: INSERT, UPDATE, DELETE – DQL: SELECT Statement – WHERE Clause – ORDER BY, GROUP BY, HAVING – Aggregate Functions: COUNT, SUM, AVG, MAX, MIN. Advanced SQL – JOINs: INNER, LEFT, RIGHT, FULL OUTER – Subqueries: Scalar, Correlated – Window Functions: ROW_NUMBER, RANK, DENSE_RANK – Common Table Expressions (CTEs) – Views and Materialized Views.
Create database schema for data engineering use case; Write complex queries with joins and subqueries; Use window functions for analytical queries; Create views for data access; Optimize query performance.
Data Modeling – Entity-Relationship Modeling – Normalization: 1NF, 2NF, 3NF – Denormalization for Performance – Star Schema and Snowflake Schema – Fact and Dimension Tables. Indexing and Performance – Types of Indexes: B-tree, Hash, Bitmap – Creating Indexes – Query Optimization – EXPLAIN Plans – Index Selection Strategies. Partitioning – Table Partitioning Concepts – Range Partitioning – Hash Partitioning – List Partitioning – Partition Pruning.
Design star schema for analytics; Create appropriate indexes; Analyze query execution plans; Implement table partitioning; Optimize database performance.
PostgreSQL Features – PostgreSQL Installation and Setup – Advanced Data Types: JSON, JSONB, Arrays – Full-Text Search – PostgreSQL-specific Functions – Extensions and Extensibility. MySQL Features – MySQL Installation and Setup – Storage Engines: InnoDB, MyISAM – MySQL-specific Functions – Performance Schema – MySQL Optimization. Python Database Connectivity – Connecting to Databases: psycopg2, mysql-connector-python – Executing Queries – Parameterized Queries – Connection Pooling – pandas Integration: read_sql, to_sql.
Set up PostgreSQL/MySQL databases; Use advanced data types; Connect Python applications to databases; Execute queries from Python; Integrate pandas with databases.
Introduction to NoSQL – NoSQL Database Types: Document, Key-Value, Column-Family, Graph – When to Use NoSQL – CAP Theorem – MongoDB Overview. MongoDB Fundamentals – MongoDB Installation – Documents and Collections – CRUD Operations – Query Operators – Aggregation Pipeline – Indexing in MongoDB. MongoDB with Python – PyMongo Library – Connecting to MongoDB – CRUD Operations from Python – Aggregation Pipeline in Python – Data Import/Export.
Design MongoDB schema; Perform CRUD operations; Write aggregation pipelines; Connect MongoDB with Python; Import/export data.
ETL Patterns with Databases – Extract from Multiple Sources – Transform Data – Load to Target Database – Incremental Loading – Change Data Capture (CDC) Concepts. Data Quality and Validation – Data Validation Rules – Constraint Checking – Data Profiling – Data Quality Metrics – Handling Data Quality Issues. Database Backup and Recovery – Backup Strategies – Point-in-Time Recovery – Replication Concepts – High Availability Basics.
Build ETL pipeline with databases; Implement data validation; Handle incremental loads; Set up database backups; Design replication strategy.
Introduction to ETL/ELT – What is ETL? – Extract, Transform, Load Process – ETL vs ELT – When to Use Each Approach – ETL Architecture Patterns – Data Pipeline Components. Extraction Phase – Data Sources: Databases, APIs, Files, Cloud Storage – Extraction Methods: Full Load, Incremental Load – Change Data Capture (CDC) Concepts – Handling Different Data Formats – Error Handling in Extraction. Transformation Phase – Data Cleaning Operations – Data Validation Rules – Data Type Conversions – Aggregations and Calculations – Data Enrichment – Joining and Merging Data.
Design ETL architecture for given scenario; Implement data extraction from multiple sources; Transform data according to business rules; Handle extraction errors; Validate transformed data.
Python ETL Libraries – Introduction to ETL Tools – Building Custom ETL Scripts – Using pandas for Transformations – Handling Large Datasets – Memory Management. Pipeline Design Patterns – Sequential Pipelines – Parallel Processing – Pipeline Orchestration Basics – Dependency Management – Error Recovery Strategies. Data Quality in Pipelines – Data Quality Checks – Validation Rules – Data Profiling in Pipelines – Quality Metrics – Handling Quality Issues.
Build Python ETL script; Implement parallel processing; Add data quality checks; Handle pipeline errors; Optimize memory usage.
Introduction to Apache Airflow – What is Airflow? – Airflow Architecture: Scheduler, Executor, Webserver – DAGs (Directed Acyclic Graphs) – Tasks and Operators – Airflow Installation and Setup. Creating DAGs – DAG Definition – Task Definition – Task Dependencies – Scheduling DAGs – DAG Parameters and Configuration – DAG Best Practices. Airflow Operators – Built-in Operators: PythonOperator, BashOperator, SQLOperator – Custom Operators – Sensor Operators – Transfer Operators – Using Operators Effectively.
Create simple DAG; Define task dependencies; Schedule DAG execution; Use different operators; Handle task failures.
Task Dependencies and Control Flow – Complex Dependencies – Branching: BranchPythonOperator – Conditional Execution – Dynamic Task Generation – Task Groups. Variables, Connections, and Secrets – Airflow Variables – Managing Connections – Using Secrets – Environment Variables – Configuration Management. Monitoring and Logging – Airflow UI: Monitoring DAGs – Viewing Logs – Task Status Tracking – Alerting and Notifications – Performance Monitoring. Error Handling and Retries – Task Retries – Retry Policies – Error Notifications – Handling Failures – Recovery Strategies.
Implement complex task dependencies; Use variables and connections; Monitor pipeline execution; Configure retries; Set up alerting.
Pipeline Testing – Unit Testing DAGs – Integration Testing – Testing Data Transformations – Mocking External Dependencies – Test Data Management. Performance Optimization – Identifying Bottlenecks – Optimizing Transformations – Parallel Execution – Resource Management – Scaling Pipelines. Data Lineage and Documentation – Documenting Pipelines – Data Lineage Tracking – Pipeline Metadata – Version Control for DAGs – Best Practices. Deployment and CI/CD – Deploying DAGs – Version Control Integration – CI/CD for Data Pipelines – Environment Management – Rollback Strategies.
Write tests for pipelines; Optimize pipeline performance; Document pipeline architecture; Deploy pipelines; Implement CI/CD.
Introduction to Big Data – What is Big Data? – 3Vs: Volume, Velocity, Variety – Big Data Challenges – Big Data Use Cases – Big Data Technologies Overview. Distributed Computing Concepts – Why Distributed Computing? – Distributed Systems Architecture – Scalability and Fault Tolerance – Data Partitioning – Shuffle Operations. Hadoop Ecosystem Overview – HDFS (Hadoop Distributed File System) – MapReduce Concepts – YARN (Yet Another Resource Negotiator) – Hadoop Ecosystem Tools – When to Use Hadoop.
Identify big data use cases; Design distributed system architecture; Understand data partitioning strategies; Compare big data technologies; Plan data storage strategy.
Introduction to Apache Spark – What is Spark? – Spark vs MapReduce – Spark Architecture: Driver, Executors, Cluster Manager – Spark Components: Spark Core, Spark SQL, Spark Streaming – Spark Installation and Setup. Spark RDDs (Resilient Distributed Datasets) – RDD Concepts – Creating RDDs – RDD Operations: Transformations and Actions – Lazy Evaluation – RDD Persistence and Caching. Spark DataFrames – DataFrame API – Creating DataFrames – DataFrame Operations – DataFrame vs RDD – Catalyst Optimizer.
Create RDDs from data sources; Perform transformations and actions; Work with DataFrames; Optimize Spark operations; Use caching effectively.
Spark SQL – Spark SQL Overview – Creating Tables and Views – SQL Queries in Spark – User-Defined Functions (UDFs) – Window Functions in Spark. Data Processing with Spark – Reading Data: CSV, JSON, Parquet, Avro – Writing Data – Data Transformations – Joins and Aggregations – Handling Large Datasets. Performance Optimization – Understanding Spark Execution Plan – Partitioning Strategies – Broadcast Variables – Accumulators – Tuning Spark Applications.
Write Spark SQL queries; Process large datasets; Optimize join operations; Use broadcast variables; Tune Spark performance.
Data Lake Concepts – What is a Data Lake? – Data Lake vs Data Warehouse – Data Lake Architecture – Storage Layer: Object Storage (S3, GCS) – Metadata Management. Data Lake Zones – Raw Zone (Bronze) – Processed Zone (Silver) – Curated Zone (Gold) – Zone Design Patterns – Data Lake Best Practices. Data Lake Implementation – Building Data Lake on Cloud – Data Ingestion Patterns – Data Organization: Partitioning, Bucketing – Schema Evolution – Data Governance Basics.
Design data lake architecture; Implement data lake zones; Organize data in data lake; Handle schema evolution; Plan data governance.
Spark Streaming Concepts – Stream Processing vs Batch Processing – Spark Streaming Architecture – DStreams (Discretized Streams) – Structured Streaming – Stream Processing Patterns. Working with Streaming Data – Reading from Streams – Transformations on Streams – Writing Streams – Window Operations – Watermarking – Handling Late Data. Spark Integration – Spark with Airflow – Spark with Cloud Storage – Spark with Databases – Spark Cluster Management – Monitoring Spark Applications.
Process streaming data; Implement window operations; Handle late-arriving data; Integrate Spark with other tools; Monitor Spark applications.
Introduction to Cloud Data Platforms – Cloud Computing for Data – Major Cloud Providers: GCP, AWS, Azure – Cloud Data Services Overview – Benefits of Cloud Data Platforms. Google Cloud Platform Data Services – GCP Data Services Overview – Cloud Storage – Cloud SQL – Cloud Spanner – BigQuery Introduction – Cloud Dataflow – Pub/Sub. Data Platform Architecture – Cloud Data Architecture Patterns – Data Ingestion – Data Storage – Data Processing – Data Analytics – Cost Optimization.
Compare cloud data platforms; Design cloud data architecture; Choose appropriate cloud services; Plan data platform migration; Optimize cloud costs.
BigQuery Introduction – What is BigQuery? – BigQuery Architecture – Serverless Data Warehouse – BigQuery Console – BigQuery Pricing Model. BigQuery Basics – Creating Datasets and Tables – Loading Data into BigQuery – Querying Data: Standard SQL – BigQuery SQL Syntax – Query Results and Export. Advanced BigQuery – Partitioning Tables – Clustering Tables – BigQuery ML (Introduction) – User-Defined Functions – BigQuery BI Engine.
Create BigQuery datasets; Load data to BigQuery; Write complex queries; Optimize table structure; Use BigQuery ML.
BigQuery Data Types – Supported Data Types – Nested and Repeated Data – JSON Data Handling – Array Operations – Struct Operations. BigQuery Performance Optimization – Query Optimization Techniques – Partitioning Strategies – Clustering Strategies – Query Caching – Best Practices. BigQuery Integration – BigQuery with Python – BigQuery API – Loading Data Programmatically – Streaming Inserts – BigQuery with Airflow.
Optimize BigQuery queries; Implement partitioning; Use BigQuery API; Integrate with Python; Load data from pipelines.
Snowflake Introduction – What is Snowflake? – Snowflake Architecture – Virtual Warehouses – Snowflake Pricing – Snowflake vs BigQuery. Snowflake Basics – Creating Accounts and Databases – Creating Tables – Loading Data – Querying Data – Snowflake SQL Syntax. Snowflake Advanced Features – Clustering Keys – Time Travel – Zero-Copy Cloning – Data Sharing – Snowflake Streams and Tasks.
Set up Snowflake account; Create tables and load data; Write queries; Use advanced features; Optimize performance.
Data Warehouse Fundamentals – What is a Data Warehouse? – OLTP vs OLAP – Data Warehouse Architecture – ETL for Data Warehousing – Data Warehouse vs Data Lake. Dimensional Modeling – Star Schema – Snowflake Schema – Fact Tables – Dimension Tables – Slowly Changing Dimensions (SCDs). Data Warehouse Design – Designing Fact Tables – Designing Dimension Tables – Choosing Grain – Handling Historical Data – Best Practices.
Design star schema; Create fact and dimension tables; Handle slowly changing dimensions; Design data warehouse; Implement dimensional model.