Window Function

Ayush Prashar

Date: 8th November, 2024

Contents

Agenda

Introduction to Window Functions
Types of window function
- Aggregate Window Function
- Ranking Function
- Value Based Window Function
- Statistical Functions
Window Frame Clauses
Window Function Advantages
Window Function Limitations

Introduction to Window Functions

Window functions are a powerful feature in SQL and other analytical tools, allowing you to perform complex calculations across a set of table rows related to the current row without the need to group or aggregate data as in traditional aggregate functions.

Unlike aggregate functions, window functions operate on a "window" or a subset of rows defined by the query, making it easier to perform analytical computations like running totals, moving averages, and rank calculations.

Components of Window Functions

A window function consists of several key clauses and elements:

OVER Clause: Defines the window or subset of rows on which the function will operate.
PARTITION BY: Splits data into partitions to which the function is applied individually. It’s similar to GROUP BY but doesn’t reduce the rows.
ORDER BY: Orders rows within each partition, crucial for functions that require ordering, like ranking or cumulative sums.
Window Frame: Defines a subset of rows within the partition relative to the current row. By default, this frame includes all rows in the partition, but you can customize it with clauses like ROWS BETWEEN or RANGE BETWEEN.

Why Use Window Functions?

Window functions let you perform advanced calculations across sets of rows while keeping each individual row in the result, unlike GROUP BY, which combines rows into one per group. This means you can calculate metrics like running totals, rankings, and comparisons between rows without losing the original data structure. Here’s why they’re useful:

Detailed Analysis Without Aggregating Rows: You can keep individual rows while calculating cumulative sums, averages, or ranks within each subset of data. This is great for tracking trends over time (e.g., running total of monthly sales).
Easy Row Comparisons: Functions like LAG() and LEAD() allow you to compare data between rows directly (like finding differences between current and previous values), which is essential in time-series analysis and tracking changes.
Efficient Grouped Calculations: The PARTITION BY clause allows you to calculate metrics within groups (e.g., department-wise salary rankings) while keeping the row-level detail.
Simplicity and Efficiency: Instead of complex joins or subqueries, window functions let you achieve advanced analytics in a more streamlined and efficient way, making them essential for reports and dashboards that require detailed metrics and comparisons.

Syntax

function_name(column_name) OVER (
[PARTITION BY partition_column]
[ORDER BY order_column]
)

Types of window function

Window functions come in several types, each suited for specific kinds of analysis. Here’s a breakdown of the main types:

Aggregate Window Functions

Aggregate window functions in SQL and other data-processing tools are powerful tools that allow you to perform calculations across sets of rows while still returning a row for each individual record. These functions, often combined with OVER() and PARTITION BY clauses, enable more advanced analytical and reporting capabilities in database management.

Key Concepts of Aggregate Window Functions:

Window Functions: A window function performs a calculation across a set of table rows related to the current row, defined by a "window." The window is specified by the OVER() clause.
Aggregate Functions: Aggregate functions are commonly used to perform calculations on a group of rows and return a single result. Examples include SUM, AVG, COUNT, MIN, and MAX.
Window Frame: Window functions operate on a "frame," a subset of rows relative to the current row within the partition. For example, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW defines a frame from the start of the partition up to the current row.
Partitioning and Ordering: You can partition data using PARTITION BY to divide data into subsets (windows) before the calculation is applied. ORDER BY within OVER() allows you to define the order of rows within each partition.

Common Aggregate Window Functions

Sales table

Employees table

SUM(): Calculates the running or grouped sum of values.

SELECT product_id,
sale_date,
sale_amount,
SUM(sale_amount) OVER(PARTITION BY product_id ORDER BY sale_date) AS running_total
FROM sales;

Explanation: Calculates the running total of sale_amount per product_id ordered by sale_date.

AVG(): Computes the average across the window.

SELECT department,
employee_id,
salary,
AVG(salary) OVER(PARTITION BY department) AS department_avg_salary
FROM employees;

Explanation: Computes the average salary per department, assigning the same average to each employee within the department.

COUNT(): Counts the number of rows in the window.

SELECT product_category,
product_id,
COUNT(*) OVER(PARTITION BY product_category) AS total_products_in_category
FROM products;

Explanation: Counts the total number of products per category.

MIN() and MAX(): Find the minimum or maximum value within the window.

SELECT department,
employee_id,
hire_date,
MIN(hire_date) OVER(PARTITION BY department) AS earliest_hire_date
FROM employees;

Explanation: Retrieves the earliest hire date per department.

ROW_NUMBER() (Non-Aggregate but Related): Assigns a unique row number for each row within a partition.

SELECT department,
employee_id,
ROW_NUMBER() OVER(PARTITION BY department ORDER BY hire_date) AS row_num
FROM employees;

Explanation: Assigns a unique number to each employee within each department ordered by their hire date.

Ranking Functions

Ranking functions in SQL are powerful tools used to assign a rank to each row within a partition of a result set. They are commonly used in reporting, analytics, and complex data manipulations, where ordered data is grouped based on specific criteria. Ranking functions are window functions and must be used with the OVER() clause, which defines the partition and the order for the ranking.

ROW_NUMBER()

The ROW_NUMBER() function assigns a unique sequential integer to rows within a partition, starting from 1 for each new partition.

Syntax:

ROW_NUMBER() OVER(PARTITION BY column1 ORDER BY column2)

Example: Using Employees table:

Query:

SELECT department,
employee_id,
salary,
ROW_NUMBER() OVER(PARTITION BY department ORDER BY hire_date) AS row_num
FROM employees;

RANK()

The RANK() function assigns a rank to each row within a partition, but unlike ROW_NUMBER(), it assigns the same rank to rows with identical values in the ORDER BY clause. If there are ties, it skips the subsequent ranks.

Syntax:

RANK() OVER(PARTITION BY column1 ORDER BY column2)

Example: Using Sales Table:

Query:

SELECT product_id,
sale_date,
sale_amount,
RANK() OVER(PARTITION BY product_id ORDER BY sale_amount DESC) AS sale_rank
FROM sales;

DENSE_RANK()

The DENSE_RANK() function is similar to RANK(), but it does not skip ranks if there are ties. Rows with the same value in the ORDER BY clause get the same rank, and the next rank continues sequentially.

Syntax:

DENSE_RANK() OVER(PARTITION BY column1 ORDER BY column2)

Example: Using Sales Table:

Query:

SELECT product_id,
sale_date,
sale_amount,
DENSE_RANK() OVER(PARTITION BY product_id ORDER BY sale_amount DESC) AS dense_sale_rank
FROM sales;

NTILE(n)

The NTILE(n) function divides rows into n roughly equal groups and assigns a group number to each row, from 1 to n.

Syntax:

NTILE(n) OVER(PARTITION BY column1 ORDER BY column2)

Example: Using Employee Table:

Query:

SELECT department,
employee_id,
salary,
NTILE(2) OVER(PARTITION BY department ORDER BY salary DESC) AS salary_quartile
FROM employees;

Value Based Window Function

Value-based window functions are window functions that return specific values from a set of ordered data within a defined window frame. These functions are useful in retrieving values based on relative or absolute position within a partition of data, often used in time series analysis, financial analysis, and trend reporting.

FIRST_VALUE()

The FIRST_VALUE() function retrieves the first value in a specified window frame, ordered by a defined criterion.

Syntax:

FIRST_VALUE(column_name) OVER(PARTITION BY column1 ORDER BY column2)

Example: Using Sales Table

Query:

SELECT product_id,
sale_date,
sale_amount,
FIRST_VALUE(sale_amount) OVER(PARTITION BY product_id ORDER BY sale_date) AS first_sale
FROM sales;

LAST_VALUE()

The LAST_VALUE() function retrieves the last value in a specified window frame, ordered by a defined criterion.

Syntax:

LAST_VALUE(column_name) OVER(PARTITION BY column1 ORDER BY column2)

Example: Using Sales table

Query:

SELECT product_id,
sale_date,
sale_amount,
LAST_VALUE(sale_amount) OVER(PARTITION BY product_id ORDER BY sale_date RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_sale
FROM sales;

LAG()

The LAG() function retrieves a value from a specified number of rows “before” the current row in the same result set partition.

Syntax:

LAG(column_name, offset, default_value) OVER(PARTITION BY column1 ORDER BY column2)

Example: Using Sales Table

Query:

SELECT product_id,
sale_date,
sale_amount,
LAG(sale_amount, 1, 0) OVER(PARTITION BY product_id ORDER BY sale_date) AS previous_sale
FROM sales;

LEAD()

The LEAD() function retrieves a value from a specified number of rows “after” the current row in the same result set partition.

Syntax:

LEAD(column_name, offset, default_value) OVER(PARTITION BY column1 ORDER BY column2)

Example: Using Sales table

Query:

SELECT product_id,
sale_date,
sale_amount,
LEAD(sale_amount, 1, 0) OVER(PARTITION BY product_id ORDER BY sale_date) AS next_sale
FROM sales;

Statistical Functions

VARIANCE() - Variance

The VARIANCE() function calculates the statistical variance within the window frame. Variance measures the spread of data points around the mean.

Syntax:

VARIANCE(column_name) OVER (PARTITION BY column1 ORDER BY column2)

Example: Using Sales Table

Query:

SELECT product_id, sale_date, sale_amount, VARIANCE(sale_amount) OVER(PARTITION BY product_id ORDER BY sale_date) AS variance_sale FROM sales;

STDDEV() - Standard Deviation

The STDDEV() function calculates the standard deviation, a measure of data dispersion, within the window frame.

Syntax:

STDDEV(column_name) OVER (PARTITION BY column1 ORDER BY column2)

Example: Using Sales Table:

Query:

SELECT product_id, sale_date, sale_amount, STDDEV(sale_amount) OVER(PARTITION BY product_id ORDER BY sale_date) AS stddev_sale FROM sales;

Window Frame Clauses

Window frame clauses define the specific subset of rows, or "frame," within a partition that the window function should operate on. By narrowing down the window to a subset of rows, you can control how calculations like running totals, moving averages, and comparisons are computed. Window frames are essential for fine-tuning window functions, especially when working with time series or ordered data.

Structure of a Window Frame Clause:

A window frame is defined within the OVER clause of a window function and includes:

Partitioning: Specifies how rows are grouped within the data set using the PARTITION BY clause.
Ordering: Orders the rows in each partition with the ORDER BY clause.
Frame Specification: Specifies the exact rows within the partition to include in the calculation.

Syntax:

<window_function> OVER (
PARTITION BY <column>
ORDER BY <column>
ROWS or RANGE BETWEEN <frame_start> AND <frame_end>
)

Frame Types: ROWS vs RANGE

ROWS: Specifies the frame based on a fixed number of rows before and after the current row.

Example: ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING includes two rows before and one row after the current row.
Use Case: Useful for specific row counts, such as calculating a three-day moving average for each day in a time series dataset.

RANGE: Specifies the frame based on a range of values rather than a fixed number of rows, which is useful for date and time ranges.

Example: RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW includes all rows within a 7-day range up to the current date.
Use Case: Useful for calculations like cumulative totals over specified time periods (e.g., one week).

Frame Boundaries

Frame boundaries further specify the start and end points of the frame and determine which rows are included in the calculation. The common frame boundaries are:

UNBOUNDED PRECEDING: Includes all rows from the start of the partition up to the current row.
UNBOUNDED FOLLOWING: Includes all rows from the current row to the end of the partition.
CURRENT ROW: Includes only the current row in the calculation.
n PRECEDING: Includes n rows before the current row.
n FOLLOWING: Includes n rows after the current row.

Common Frame Specifications and Their Uses

Cumulative Calculation: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Includes all rows from the start of the partition up to the current row, creating a cumulative calculation.
Use Case: Running totals, cumulative sums, and cumulative averages.

Example: Using Sales Table:

SELECT date, sales,
SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales_data;

Moving Window Calculation: ROWS BETWEEN n PRECEDING AND CURRENT ROW

Includes the n rows before the current row and the current row, useful for calculating moving averages or totals within a sliding window.
Use Case: Calculating moving averages or running sums over a specific number of rows.

Example:

SELECT date, sales,
AVG(sales) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg
FROM sales_data;

Centered Window Calculation: ROWS BETWEEN n PRECEDING AND n FOLLOWING

Includes n rows before and n rows after the current row.
Use Case: Useful for smoothing functions or trend analysis where surrounding values are equally considered.

Example:

SELECT date, sales,
AVG(sales) OVER (ORDER BY date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS centered_avg
FROM sales_data;

Fixed Range Calculation: RANGE BETWEEN INTERVAL 'n' DAY PRECEDING AND CURRENT ROW

Includes all rows within a specified date range, providing flexibility to include rows based on time intervals rather than a fixed number of rows.
Use Case: Cumulative sales over a rolling period, like the last 7 days, in time-based data.

Example:

SELECT date, sales,
SUM(sales) OVER (ORDER BY date RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS weekly_total
FROM sales_data;

Window Function Advantages

Window functions offer several significant advantages for data analysis in SQL, making them highly useful for complex calculations and aggregations that require row-by-row operations.

1. Enhanced Performance and Efficiency

Window functions perform aggregations without needing subqueries or self-joins, reducing complexity and improving query performance.
By processing calculations over defined windows within a single SQL pass, window functions are faster and more efficient than other aggregation approaches for large data sets.

2. Row-Level Analysis with Aggregate-Like Behavior

Unlike traditional aggregate functions, window functions allow you to perform aggregations without collapsing rows.
This means you can calculate cumulative sums, moving averages, and rankings while still retaining row-by-row visibility, providing richer insights for each individual record.

3. Flexible and Dynamic Windowing

Window functions allow you to define specific partitions, orders, and frames, offering control over how rows are grouped and the range of rows included in calculations.
This flexibility is ideal for creating dynamic analyses like running totals, year-to-date calculations, and comparisons over different time periods.

4. Simplifies Complex Calculations

Window functions can handle complex calculations like ranking (e.g., row numbering, ranking by value), percentiles, and lag/lead functions for comparing values between rows.
These functions are particularly useful in financial analysis, trend analysis, and any dataset where comparing rows across time or sorted groups is essential.

5. Partitioning for Group-Specific Analysis

Using the PARTITION BY clause, window functions can operate within distinct groups, allowing you to perform calculations within each partition independently.
This is valuable for grouped metrics, like cumulative sales per customer or ranking within each category, without requiring separate aggregation for each group.

Window Function Limitations

1. Performance Overhead on Large Datasets

Window functions can be computationally intensive, particularly on very large datasets or when combined with complex partitioning and ordering criteria. This can lead to performance degradation if not optimized carefully.
Unlike simple aggregations, window functions often process each row individually, which can result in higher resource consumption.

2. Limited Support for Nesting

SQL does not allow direct nesting of window functions. For example, you cannot use the output of a window function within another window function in the same query.
As a workaround, nested window function calculations often require Common Table Expressions (CTEs) or subqueries, which can make the query structure more complex.

3. Cannot Be Used in WHERE or HAVING Clauses

Window functions cannot be directly used in WHERE or HAVING clauses, as these clauses are applied before the window function calculations are executed.
To filter rows based on window function results, you typically need to use a subquery, a CTE, or filter in a later stage of the query (such as with a SELECT wrapping the windowed results).

4. Partitioning and Ordering Requirements

Window functions work best when partitions and ordering are clearly defined. Without appropriate partitioning or ordering, the results may not align with expectations, especially with functions that rely on row sequences, such as LAG, LEAD, or cumulative calculations.
Careful ordering is essential for meaningful calculations in cumulative sums, running totals, and moving averages, as row order affects the results.

5. Limited Use in Aggregations with GROUP BY

Window functions operate on individual rows and can’t be directly combined with GROUP BY clauses in the same query. This requires you to choose between row-by-row and grouped aggregations or to use separate query steps to combine both.
Often, this necessitates the use of nested queries or CTEs, which can make the SQL logic more complex to manage.

Query Solving

Download Dataset From Here

The dataset consists of 50 records, each representing a user with the following attributes:

employee_id: Used for identifying unique employees.

department_id: Used in PARTITION BY to group data by departments.

salary: A key column used for calculating cumulative totals, ranks, and percentiles.

hire_date: Used to order data chronologically within departments, which is crucial for functions like LAG(), LEAD(), and cumulative calculations.

manager_id: Could be used in advanced queries to explore hierarchical data (not used in the above queries).

Query 1:- Calculate the cumulative salary within each department using SUM().

SELECT
department_id,
employee_id,
salary,
SUM(salary) OVER (PARTITION BY department_id ORDER BY hire_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_salary
FROM
employees;

Query 2:- Use LAG() to compare each employee’s current salary with their previous salary (assuming one row per month).

SELECT
employee_id,
hire_date,
salary,
LAG(salary, 1) OVER (PARTITION BY employee_id ORDER BY hire_date) AS previous_salary
FROM
employees;

Query 3:- Calculate a moving average of the salary for the last 3 hires in each department using AVG().

SELECT
department_id,
employee_id,
salary,
AVG(salary) OVER (PARTITION BY department_id ORDER BY hire_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_salary
FROM
employees;

Query 4:- Find the top 3 highest-paid employees in each department using RANK().

SELECT * FROM (
SELECT
department_id,
employee_id,
salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank
FROM
employees
) AS ranked_salaries
WHERE salary_rank <= 3;