Monday, November 7, 2011

Sub query expressions vs. Joins.

It seems that most developers underestimate performance degradation when using sub queries as column expressions in comparison to using joins to retrieve a large dataset. The difference is minimal when the outer SELECT statement will return just a couple of rows but it is more noticeable when the SELECT statement returns thousands of rows. The best way to demonstrate this is to try it out and analyze the execution plans of both.

Test Scenario

I've used a simple two table database using the typical Managers - Employees tables, where multiple employees can be linked to the same Manager.
The attached scripts will create both tables together with the data population. The managers table will contain 3 rows, while 50001 rows will reside in the employees table.

Editor: How did you get this, might show some code and explain what this means
The following SQL will count the number of records for each distinct manager ID
Select COUNT(*),ManagerIDFrom EmployeesGroup By ManagerIDOrder By ManagerID
and this the result, displaying the records distribution.

The script will also create a non-clustered covering index (ix_Employees) on the employee table on the ManagerID column to eliminate any Key Lookups.


Running the Tests

Now for the actual test, I've written a simple query to retrieve all employees for a particular manager, returning also the manager name. This will involve scanning the employee table for the selected ManagerID and linking to the Managers table to retrieve the Manager Name.

For the above tests 1 & 2, there was no difference in performance; both returned 1 row as expected with identical query cost.


However the performance difference becomes more obvious when more rows are returned as in Tests 3 & 4. The number of rows returned for these tests is 40000.

Analyze the Results

Let's investigate what happened by analyzing the execution plans. In the first two tests, an Index Seek was performed on the Employees table to locate ManagerID 3, which returned one row, and then this output was joined to the Manager table to retrieve the Manager Name.

Execution plans for Tests 1 & 2.


Execution plans for Tests 3 & 4.

Although the same queries performed identically when retrieving one row, the result is much different now as the query cost suggests. This is because, when executing the query using the sub query expression (Test 3), the query optimizer introduced a table spool operator as a temporary lookup table for the selected manager and each row returned from the Employee table was evaluated x times, i.e. 40000 times in this case.
Only one row was passed from the Clustered Index Seek on the Managers table to the Table Spool Operator. However, the cost for this operator is 93% of the total cost. Expanding the properties for this operator, it's interesting to notice the actual Rebind and Rewinds counts, whose summation equals the total number of rows returned or the number of lookups performed on the temporary table to retrieve the manager name.


Conclusion

The aim of this article is to demonstrate that although both methods do return the correct results, and both work almost identically when returning a small number of rows, it's always wise to first verify in what circumstances the queries will be used since as demonstrated, when the number of rows returned is large, one solution will work better than the other.