Experience: Daimler-Benz · Siemens · DHL · Deutsche Bahn · Infineon Technologies · BMW · VW · Swisscom
Languages: Swiss German · German · English
Education: Ph.D. in Data Science · MBA
Tools:
Microsoft Fabric: Python · PySpark · SQL · T-SQL · KQL · Data Pipelines · Dataflow Gen2 · Medallion Architecture · Lakehouse · Warehouse · Real-Time Analytics · Semantic Models · Power BI · DAX · Deployment Pipelines · CI/CD · Data Science · Data Agents · Swiss–US Data Privacy
Python vs. PySpark Notebooks in Microsoft Fabric
A simple rule of thumb: for small to medium datasets (under ~100M rows), plain Python is typically faster and cheaper on CU costs. PySpark is the better choice once you're dealing with large datasets — think 100M+ rows or 10 GB+.
One thing worth keeping in mind: processing costs don't just apply to notebooks. Data Pipelines also consume CUs, and they can rack up costs quickly if you're not careful.
In a recent project, I needed to reduce a dataset down to 310 million rows. Given the scale, I ran the entire pipeline in PySpark — and it was the right call.

Power BI in the Browser with Microsoft Fabric
I grew up on desktop tools — Power BI Desktop, Tableau, RStudio. But with Microsoft Fabric, I do everything in the browser, and honestly? I love it.
Notebooks, semantic models, Power BI reports — all in one place. Load the data, transform it, set up a pipeline, move it through Bronze → Silver → Gold, build the semantic model, and publish the report. No context switching. No desktop installs.
Sure, the browser isn't 100% there yet — advanced modelling and Row-Level Security still need the desktop. But we're at 80%, and that 80% covers most of the work.
One environment. One flow. That's the win.

From API to Report: Building an End-to-End Data Pipeline in Microsoft Fabric
A walkthrough of a fully automated pipeline that pulls data from a REST API, transforms it through a Lakehouse and Warehouse, and surfaces insights in a Power BI report — all orchestrated inside Fabric.

Why Fabric?
Microsoft Fabric brings together data engineering, warehousing, and BI into a single unified platform. Instead of stitching together multiple services and credentials, everything lives in one workspace — from the raw API call to the finished Power BI report.
Step 1 — Fetch data from the API (notebook)
The pipeline kicks off by setting a timestamp variable, then firing a Fabric notebook that calls the REST API. The raw JSON responses land in the Files section of the API Lakehouse, preserving the original payload for full auditability.
Step 2 — Parse JSON into a Lakehouse table (notebook)
A second notebook reads those JSON files and flattens them into a structured Delta table inside the same Lakehouse. This is where field mapping, type casting, and any light cleansing happen — keeping the transformation logic version-controlled and reproducible.
Step 3 — Copy data from Lakehouse to Warehouse (copy activity)
A Copy data activity moves the cleansed rows from the Lakehouse Delta table into a staging table in the API Warehouse. The separation between Lakehouse and Warehouse keeps raw storage costs low while giving the downstream SQL layer proper query performance and access control.
Step 4 — Load the reporting table via stored procedure
A stored procedure merges the staging data into the final reporting table and updates the accompanying reporting view. Encapsulating this logic in a stored procedure means the pipeline step stays thin — and the merge logic can be tested and iterated independently.
Step 5 — Refresh the semantic model and serve the report
The final pipeline step triggers a semantic model refresh in Power BI. Once the model is updated, the report surfaces the latest data — completing the journey from raw API payload to interactive dashboard, fully automated and repeatable on any schedule.
Data Pipeline "ForEach" = Python's for loop
Microsoft Fabric's ForEach activity works just like a Python for loop — iterate over a collection, run some logic for each item, repeat.
Processing files? Feed it ["file1.csv", "file2.csv"], wrap a Copy activity inside, and use @item() to reference each file dynamically. Same mental model as looping through a list in Python.

Combining Medallion Architecture with CI/CD in Microsoft Fabric
Everyone agrees Medallion architecture delivers competitive advantages — faster insights, better data trust, greater scalability. But knowing why it works is only half the battle. The part that's rarely talked about is how to actually implement it.
My take: pair it with CI/CD. Map the Bronze, Silver, and Gold layers across your development, test, and production pipeline, and suddenly Medallion stops being a theoretical framework and starts being something you can actually ship.

Reading PDFs: Azure AI Document Intelligence vs. LLMs
Traditional PDF extraction tools may feel outdated compared to today's powerful LLMs — but the distinction matters. Traditional OCR-based tools like Azure AI Document Intelligence extract rather than infer, which is their key advantage: they return field-level confidence scores that make outputs fully auditable.
LLMs are harder to audit. More critically, they carry hallucination risk — under uncertainty, an LLM may confabulate (fabricate to fill gaps!!!) a plausible-looking number rather than return a low confidence score or flag the ambiguity. In a high-stakes context like financial reporting, that silent failure is arguably worse than a flagged extraction failure.


Example Queries for Microsoft Fabric Data Agents
When you set up a Data Agent in Microsoft Fabric, you can give it a set of example queries — sometimes called "few-shot examples" — to help it understand how to respond to questions.
Think of them as cheat sheet entries: you provide a sample question alongside the query logic that should answer it. The agent learns from these patterns and uses them to handle similar questions on its own.
In the image below, you can see this in action:
-
(1) is where you define your example queries as a creator
-
(2) shows the agent's answer when that example is applied
-
(3) is where it gets interesting — here I simply asked the agent to visualize the data, and it knew exactly what to do, drawing on the patterns it had learned
That last part is the magic of few-shot examples: once the agent has good patterns to work from, it can go beyond the examples themselves and handle new requests with confidence.
FICTIVE DATA - NOT REAL NAMES

UDF (User Defined Functions) in PySpark
Standard DataFrame operations cover most needs, but UDFs fill the gaps when you need custom business logic that can't be expressed with built-in Spark functions — things like complex string parsing, conditional transformations involving multiple rules, or calling external libraries (e.g., dateutil, re, custom validators).
Simple Example: Classifying a Customer's Spend Tier
Imagine you have a sales DataFrame and want to tag each customer with a business-defined tier label based on their total spend — logic that's too specific for a simple when/otherwise chain.
Defining the schema explicitely
The rule of thumb: Schema inference is fine for exploration. Defined schemas are essential for production pipelines. In Fabric, where we're building data pipelines that run repeatedly on large datasets, defining the schema upfront is a best practice we want to build the habit of early

Partitioning in PySpark
Partitioning is one of those topics that becomes very important as our data grows. Spark processes data in parallel chunks called partitions. Partitioning strategy directly affects:
-
Query speed — Spark can skip irrelevant partitions entirely (called partition pruning)
-
Shuffle cost — bad partitioning causes expensive data movement across nodes
-
Memory pressure — too few partitions = large chunks per node, too many = overhead
Partition by columns you frequently filter on. A good partition column has low cardinality (not too many unique values) — region with 4 values is great, customer_id with millions of values would create millions of tiny files, which is harmful.

Sometimes we just need Python and Pandas
PySpark is amazing for large datasets (e.g., 10GB+). However, for smaller datasets, Python is faster and cheaper. For smaller datasets, Panda's visualization capability are superior to PySpark.

Inside the Apache Parquet Format
Apache Parquet is a columnar storage format that offers significant advantages over traditional row-based formats like CSV or JSON. Unlike row-oriented storage, Parquet organizes data by columns, which allows analytical queries to read only the relevant columns without scanning unnecessary data. It also embeds the data schema directly in the file and enables efficient compression of repetitive values — for example, a boolean column with many repeated 0s and 1s can be compressed far more effectively than in a plain text format. Microsoft Fabric makes it easy to work with Parquet files stored in OneLake or other cloud storage solutions. source: aka.ms/fabricnotes

Advantages of Managing CLS & RLS in Fabric(Not in Power BI)
Power BI's Column Level Security (CLS) and Row Level Security (RLS) were always a workaround — security defined at the reporting layer because the data source couldn't handle it properly.
With Microsoft Fabric, that excuse is gone.
Define your security rules once in the Fabric Warehouse using standard SQL. Connect Power BI via DirectQuery, and those rules are automatically enforced everywhere — in Power BI, notebooks, SQL clients, every tool. No bypass possible. No duplication. One place to maintain.
Power BI goes back to being what it should
always have been: a presentation layer.
Simpler. Safer. More maintainable.

Pipeline Expression Builder (adding utcNow time stamp)
When working with Microsoft Fabric Data Pipelines, one handy trick is using the Pipeline Expression Builder to dynamically generate timestamped filenames when copying files — for example, from one Lakehouse folder to another.
Instead of hardcoding a filename like movies.csv (which would get overwritten on every pipeline run), you can use an expression like this:
@concat('movies_', utcNow('yyyyMMddTHHmmssZ'), '.csv')

What does this expression do?
It builds a filename dynamically by joining three parts together:
-
'movies_' — a static prefix
-
utcNow('yyyyMMddTHHmmssZ') — the current UTC timestamp at the moment the pipeline runs, formatted as something like 20260511T104632Z
-
'.csv' — a static file extension
The result is a uniquely named file on every run, for example: movies_20260511T104632Z.csv
A few things worth knowing
The @ symbol at the start tells the pipeline engine that what follows is an expression to be evaluated — not a literal string. Without it, you'd end up with the text concat('movies_', ...) as your filename.
The format string yyyyMMddTHHmmssZ follows standard date/time tokens. You can customize it to suit your needs — for instance, yyyy-MM-dd if you only need a date, or yyyyMMdd_HHmmss if you prefer an underscore separator between date and time.
Why bother?
Timestamped filenames are a simple but effective pattern for:
-
Preventing data loss — no more accidentally overwriting yesterday's file
-
Auditability — you can immediately see when each file was generated
-
Debugging — if a pipeline run produces unexpected output, you know exactly which file to look at
It's a small touch, but it makes your pipelines significantly more robust in practice.
Data Lineage
As data travels through its lifecycle, how can you tell which systems have influenced it or what it's made up of as it moves and changes? Data lineage refers to the recording of an audit trail that follows data throughout its lifecycle, capturing both the systems that handle it and the upstream sources it relies on.
Data lineage is valuable for error tracking, accountability, and debugging — both for the data itself and the systems that work with it. It provides a clear audit trail for the data lifecycle and supports compliance efforts. For instance, if a user requests that their data be removed from your systems, having lineage for that data makes it straightforward to identify where it lives and what depends on it.
Data lineage has long been a practice in larger organizations with stringent compliance requirements. However, it is increasingly being adopted by smaller companies as data management grows into a mainstream concern.

Why general-purpose APIs won over custom DB connections
-
Security & access control — APIs let vendors expose only what they want to expose, with fine-grained scoping (OAuth, API keys, rate limits), without handing out raw database credentials.
-
Abstraction & stability — The API contract is stable even if the underlying storage engine changes. Snowflake can re-architect internals; your pipeline doesn't break.
-
Cloud-native architecture — Services like S3, Snowflake, and Salesforce are SaaS or cloud platforms. There is no "database" to connect to directly — the API is the interface.
-
Decoupling — Each hop in your example (Salesforce → S3 → Snowflake → S3 → Spark) is loosely coupled. Teams can swap one component without rewiring everything else.
-
Ecosystem tooling — Tools like Airbyte, Fivetran, dbt, and Prefect are built entirely around API-to-API orchestration, reinforcing the pattern.
PS: If you're into finance, tradingeconomics.com offers 500 free API calls per month (see screenshot). Excellent to pull the API data into a Notebook, use a Data Pipeline, create a semantic model, and visualize the API data in Power BI.

No Data, No AI.
It's that simple.
Why Data Engineering in Microsoft Fabric is the quiet engine behind every AI breakthrough.

Everyone wants AI. Smarter dashboards, automated insights, talking to your data instead of querying it. But here's the unglamorous truth: none of it works without solid Data Engineering underneath. Garbage in, garbage out — always.
Microsoft Fabric gets this. Its Data Engineering layer isn't a checkbox — it's the backbone. It handles ingestion, transformation, storage, and scale so that when AI shows up, it actually has something useful to work with. That's not a small thing.

The Fabric Data Agent — launched in March 2026 — is a glimpse of where this is heading. Ask a question in plain English, get an answer. No query writing, no dashboard hunting. It's not magic; it's what happens when your data engineering is done right and AI finally has a clean surface to work on.
Data Engineering lays the road. AI is the vehicle. But without the road, you're going nowhere.
NEW (Nov, 2025): Incremental Refresh in Data Pipeline "Copy Job"
´Copy Job is designed for simplified data movement from many sources to many destinations with no pipelines required, and it natively supports bulk copy, incremental copy, and change data capture (CDC) replication

Creating a Schema in Lakehouse in Spark SQL vs. PySpark
In Microsoft Fabric Lakehouse, we cannot use vanilla SQL to create a schema — we either need Spark SQL or PySpark.
Both approaches achieve the same result, but they differ in readability, flexibility, and use case. Here's a quick comparison.
-
Spark SQL is the more concise and readable option for DDL (Data Definition Language) operations like table creation. If you're already comfortable with SQL, the syntax will feel familiar.
-
PySpark's StructType approach is more verbose, but it shines when you need to programmatically define or manipulate schemas at runtime — for example, building schemas dynamically based on configuration or input data.
One Important Caveat: Regardless of which approach we use, Spark does not enforce varchar length limits. They are accepted syntactically, but internally all string columns are treated as STRING. So while you can define varchar(200) for documentation or compatibility purposes, Spark will not actually truncate or validate values against that length.


Don't hide your data —
mask it instead
A quick intro to Dynamic Data Masking in Microsoft Fabric
Your data warehouse holds sensitive info. Not everyone who queries it should see all of it. Dynamic Data Masking (DDM) is the pragmatic middle ground between full access and full lockdown.
DDM is a built-in security feature available in both Fabric Warehouse and the Lakehouse SQL Endpoint. The key idea: the real data stays exactly as stored, but non-privileged users see a masked version. No copies, no transformations — just a view-time filter applied per column.
Four ways to mask
You define masking rules directly in your CREATE TABLE statement. There are four flavours:
1. Default - Here full masks are applied depending on the data type of the data in the column. e.g. - Numeric types, use a 0 value.
CREATE TABLE masking_demo (
name VARCHAR(100) MASKED WITH (FUNCTION = 'default()'));
2. Email - Here the first letter of the email address and the suffix .com is exposed.
CREATE TABLE masking_demo (
email VARCHAR(100) MASKED WITH (FUNCTION = 'email()'));
3. Random - This is used on a numeric type wherein the original value is replaced with a random value within a specific range.
CREATE TABLE masking_demo (
number INT MASKED WITH (FUNCTION = 'random(1000, 9999)') );
4. Custom string - This masking method exposes the first and last letters of the string and adds a custom padding string in the middle.
CREATE TABLE masking_demo (
id VARCHAR(20) MASKED WITH (FUNCTION = 'partial(0,"XXXX-XXXX-XXXX",4)') );

One important caveat
Masking is not encryption. DDM prevents accidental exposure — a junior analyst running a broad SELECT query won't see raw credit card numbers. But a determined, privileged user can still work around it. If you're defending against malicious insiders, DDM alone isn't enough. Combine it with OLS, RLS, and CLS.
DATA ARCHITECTURE: can save millions
Data architecture is the overall design and organisation of data within an information system — and getting it wrong is expensive.
When building a data solution, you need a well-thought-out blueprint. A data architecture defines your high-level approach, the technologies you'll use, and how data flows through your solution. There's no "one size fits all" answer, which is exactly what makes the decision so challenging.
The framework below cuts through that complexity — focused specifically on Microsoft Fabric.

The blueprint routes you through three questions in order:
1. Is the data streaming or time-series? If yes, stop — Eventhouse is the answer. KQL is built for this; Warehouse and Lakehouse are not.
2. Is it structured and SQL-first? If yes, the tie-breaker is scale and ACID needs. Enterprise, governed reporting with clean schemas → Warehouse. Simpler or exploratory SQL → Lakehouse via its SQL analytics endpoint.
3. Mixed formats, raw files, or ML workloads? That's Lakehouse territory — Delta Lake, Spark, and the flexibility to evolve schemas later.
Then the overlays: data marts are a serving layer on top (great for domain/departmental teams), not a replacement. Data mesh is an organisational principle — it tells you who owns the domains, not which engine runs them.
The most expensive mistake is treating these as mutually exclusive. A mature Fabric architecture often uses all three engines — Eventhouse for real-time, Lakehouse as the medallion foundation, Warehouse for governed BI — connected via OneLake.
Solar Drone Inspection Analytics
Dataset 20-row mockup dataset simulating a drone inspection of a solar farm. GPS coordinates clustered around Sankt Gallen (47.43°N, 9.31°E). 3 drones (drone_01–03) covering panel zones A–D, spaced at realistic 60–90 second intervals.
Anomaly rate intentionally set at ~55% (11/20 rows) to give Power BI something interesting to visualize. In production, expect 5–15% on a well-maintained farm.
Anomaly Types & Severity
-
hot_spot — Localized cell failure · severity 6.8–8.9 · temp delta +17–26°C
-
bypass_diode — String-level failure · most critical · severity 9.1–9.7 · temp delta +28–31°C
-
soiling — Dirt/dust accumulation · lowest priority · severity 2.4–3.1 · temp delta +5–6°C
-
delamination — Physical panel degradation · severity 5.3 · temp delta +12°C
Architecture Decision
Eventhouse (KQL / real-time analytics) is overkill for solar inspection — panels degrade over days/weeks, not seconds. A Lakehouse + Power BI pipeline is the right pattern here:
SD card offload → Blob Storage → OneLake Lakehouse → Power BI
Real-time via Eventhouse would only make sense for edge cases like thermal runaway risk or live grid correlation.


Real-time via Eventhouse would only make sense for edge cases like thermal runaway risk or live grid correlation
.
Microsoft Fabric Advantage End-to-end analytics in a single platform — ingest, store, enrich, and visualize without leaving the Fabric ecosystem. The zone/drone structure sets up naturally as Power BI slicers (filter by drone, zone, anomaly type, severity threshold).

