/* ISC 561 Subqueries in the WHERE clause Kroenke text (11th ed.) pp. 71-76 Querying multiple tables with subqueries pp. 77 Comparing Subqueries and Joins Generic syntax As a Join SELECT attributelist FROM table1, table2 WHERE table1.Pk=table2.Fk AND condition list An uncorrelated subquery allows us to perform a JOIN on a smaller sublist of FKs (conceputallly anyway) In some cases, filtering a large "many" table in a 1:M JOIN with a subquery can produce an order of magnitude difference in performance. This is done by reducing the number of Pk=Fk comparisons in the JOIN. If the number of FKs is reduced before the JOIN operation, obviously there will be less comparisons and less time required to produce the result. But on balance, the optimizer will recognize this and tune a JOIN to perform as well. It just depends... The problem is, that it isn't really that simple. One must look at existing indices, views, and so on. However, there are some results that cannot be obtained by a JOIN. Specifically, those cases where you need to test the absence of a value. Pk <> Fk is not the same thing as the absence of a match. Well, we'll get to that in due course... WHERE subqueries can produce enumerated lists: WHERE Pk IN (value1, value2, ...) Or single values WHERE Attribute = (value1) ----Enumerated lists As an uncorrelated subquery SELECT attributelist FROM table1 WHERE condition list AND table1.Pk [NOT] IN(SELECT table2.Fk --only one attribute FROM table2 WHERE condition list) As a correlated subquery SELECT attributelist FROM table1 WHERE condition list AND [NOT] EXISTS (SELECT * FROM table2 WHERE table1.Pk=table2.Fk AND condition list) SELECT attributelist FROM table1 WHERE condition list AND table1.Pk IN (SELECT table2.Fk FROM table2 WHERE table1.Pk=table2.Fk AND condition list) A drawback of subqueries in the WHERE clause is that we cannot SELECT any attributes from tables in the inner query. A common use for WHERE subqueries on a particular table is to filter rows some other outer table rather than to retrieve values from the inner table. An EXISTS subquery can be used to improve the performance of a join that requires DISTINCT -----Single values SELECT attributelist FROM table1 WHERE condition list AND table1.attribute [comparison operator] (SELECT table2.Fk FROM table2 WHERE condition list) */ --Using EXISTS to improve performance of join with DISTINCT --List Lastname and Firstname of employees who have posted an order --We don't want duplicates SELECT Employees.LastName, Employees.FirstName FROM Employees, Orders WHERE Employees.EmployeeID=Orders.EmployeeID --We insert DISTINCT. The trouble is, it forces a Sort... SELECT DISTINCT Employees.LastName, Employees.FirstName FROM Employees, Orders WHERE Employees.EmployeeID=Orders.EmployeeID --Or, we can lookup whether an order exists for each employee as --we scan through the Employees table. This is faster in part because --the subquery terminates the search as soon as a match is found. SELECT Employees.LastName, Employees.FirstName FROM Employees WHERE EXISTS (SELECT * FROM Orders WHERE Employees.EmployeeID=Orders.EmployeeID) --An IN() uncorrelated will perform better too. In this case because --the optimizer recognized the access path would work for both results. SELECT Employees.LastName, Employees.FirstName FROM Employees WHERE Employees.EmployeeID IN(SELECT Orders.EmployeeID FROM Orders) /* Marketing just called and they want to know if we can produce a list of categories that contain products with a UnitPrice greater than $70. They are considering a new marketing campaign for their high-end product categories. One approach is this... Look in the Products table and retrieve the CategoryIDs we need */ USE Northwind SELECT Categories.CategoryID, CategoryName, UnitPrice FROM Categories, Products WHERE Categories.CategoryID=Products.CategoryID ORDER BY UnitPrice DESC SELECT Categories.CategoryID, CategoryName, UnitPrice FROM Categories, Products WHERE Categories.CategoryID=Products.CategoryID AND UnitPrice > 70 SELECT DISTINCT Categories.CategoryID, CategoryName--, UnitPrice FROM Categories, Products WHERE Categories.CategoryID=Products.CategoryID AND UnitPrice > 70 /* Then write a disjunct condition for a WHERE clause against the categories table. */ SELECT CategoryID, CategoryName FROM Categories WHERE CategoryID = 1 OR CategoryID = 3 OR CategoryID = 6 /* Note how much slower the first query is. Why? */ /* A slightly better approach would be to put the values 1 and 6 in an IN() clause. */ SELECT CategoryID, CategoryName FROM Categories WHERE CategoryID IN(1,3,6) /* An even better approach would be to dynamically look the values up using a subquery. */ SELECT CategoryID, CategoryName FROM Categories WHERE CategoryID IN(Subquery) --And so... SELECT CategoryID, CategoryName FROM Categories WHERE CategoryID IN(SELECT CategoryID FROM Products WHERE UnitPrice > 70) --As a join SELECT DISTINCT Categories.CategoryID, CategoryName--, UnitPrice FROM Categories, Products WHERE Categories.CategoryID=Products.CategoryID AND UnitPrice > 70 /* The sales staff is wondering if there is a seasonal trend in demand for vegetarian ingredients, for example, Tofu. List the OrderID and OrderDate for all orders that contain Tofu (ProductID = 14) */ --Join SELECT DATEPART(Month, OrderDate) AS MonthOrdered, COUNT(*) MonthlyCount FROM Orders, [Order Details] AS OrderDetails WHERE Orders.OrderID=OrderDetails.OrderID AND ProductID = 14 GROUP BY DATEPART(Month, OrderDate) ORDER BY MonthOrdered --Is there a trend? Hard to tell. Maybe --Uncorrelated using IN() /* In this case, we only need [Order Details] in order to filter rows from the Orders table. We only want orders that contain orders for Tofu. So one strategy is to look up the list of OrderIDs from the Many table and use this list to filter the 1 table. */ SELECT DATEPART(Month, OrderDate) AS MonthOrdered, COUNT(*) MonthlyCount FROM Orders WHERE Orders.OrderID IN(10249, 10325) GROUP BY DATEPART(Month, OrderDate) ORDER BY MonthOrdered SELECT COUNT(*) FROM [Order Details] --This will be our enumerated list of OrderIDs. SELECT OrderID FROM [Order Details] WHERE ProductID = 14 /* Insert the subquery into the IN() clause This is uncorrelated because it makes no reference to the outer table Orders. We can run the inner query separately. */ SELECT DATEPART(Month, OrderDate) AS MonthOrdered, COUNT(*) MonthlyCount FROM Orders WHERE Orders.OrderID IN(SELECT OrderID FROM [Order Details] WHERE ProductID = 14) GROUP BY DATEPART(Month, OrderDate) ORDER BY MonthOrdered SELECT DATEPART(Month, OrderDate) AS MonthOrdered, COUNT(*) MonthlyCount FROM Orders, [Order Details] AS OrderDetails WHERE Orders.OrderID=OrderDetails.OrderID AND ProductID = 14 GROUP BY DATEPART(Month, OrderDate) ORDER BY MonthOrdered /* Performance? The optimizer used the same execution plan for both formulations. In this case, it comes down to which query is easier to maintain and understand. I find the subquery makes more sense because I am only using the subquery to create a list of FKs. The real point of the query is to list Orders. */ --Correlated using EXISTS SELECT DATEPART(Month, OrderDate) AS MonthOrdered, COUNT(*) MonthlyCount FROM Orders WHERE Orders.OrderID IN(SELECT OrderID FROM [Order Details] WHERE ProductID = 14) GROUP BY DATEPART(Month, OrderDate) ORDER BY MonthOrdered SELECT DATEPART(Month, OrderDate) AS MonthOrdered, COUNT(*) MonthlyCount FROM Orders WHERE EXISTS (SELECT * FROM [Order Details] AS OD WHERE Orders.OrderID=OD.OrderID AND ProductID = 14) GROUP BY DATEPART(Month, OrderDate) ORDER BY MonthOrdered --And again, the optimizer chose the same execution plan despite the different SQL /* The customer Alfreds Futterkiste (CustomerID = 'ALFKI') just rang up to demand that a different employee be assigned to their account. They don't care who it is as long as it isn't an employee they have already worked with. It must be some one new or they are taking their account somewhere else. Period... Oh and they would appreciate it, "danke", if it were not an American employee. List the EmployeeID, FirstName, and LastName of all non 'USA' employees who have not worked on an order for 'ALFKI'. */ --Join (can we produce the result with a join?) --This shows the employees who have worked with alfki. SELECT DISTINCT Employees.EmployeeID, FirstName, LastName FROM Employees, Orders WHERE Employees.EmployeeID=Orders.EmployeeID AND Orders.CustomerID = 'ALFKI' /* For me an "obvious" first step is to look at the relationship Employees ||---< Orders and conclude that if we want to find who hasn't worked with ALFKI, just look for cases where the PKs and FKs don't match. Right? But what will we get if we try that? Change the "=" to "<>" in the WHERE clause. */ SELECT DISTINCT Employees.EmployeeID, FirstName, LastName FROM Employees, Orders WHERE Employees.EmployeeID<>Orders.EmployeeID AND Orders.CustomerID = 'ALFKI' /* We get all the employees. That's because every employee has an order with a customer other than ALFKI. If there were an employee who has only worked with ALFKI and no other customer, then that employee would not show in the list... Soo.... what to do? Tell the staff department to figure it out themselves, you're too busy? Let's try a different approach first. What we want is a list of employees right? So let' start there. */ --This will likely be our outer query. SELECT Employees.EmployeeID, FirstName, LastName FROM Employees /* If we want to know who has worked with ALFKI, we can just look in the Orders table */ SELECT EmployeeID FROM Orders WHERE Orders.CustomerID = 'ALFKI' /* It looks like we have five employees to choose from. But let's apply our previous approach */ SELECT Employees.EmployeeID, FirstName, LastName FROM Employees WHERE EmployeeID IN(Subquery) /* This will give us the list of employees who "have" worked with ALFKI. */ SELECT Employees.EmployeeID, FirstName, LastName FROM Employees WHERE EmployeeID IN(SELECT EmployeeID FROM Orders WHERE Orders.CustomerID = 'ALFKI') /* Now to find those who haven't, negate the IN() clause! */ SELECT Employees.EmployeeID, FirstName, LastName FROM Employees WHERE EmployeeID NOT IN(SELECT EmployeeID FROM Orders WHERE Orders.CustomerID = 'ALFKI') /* But wait, the German customer doesn't want to work with an American employee... */ SELECT Employees.EmployeeID, FirstName, LastName, * FROM Employees --Filter by Country SELECT Employees.EmployeeID, FirstName, LastName FROM Employees WHERE Employees.Country <> 'USA' AND EmployeeID NOT IN(SELECT EmployeeID FROM Orders WHERE Orders.CustomerID = 'ALFKI') /* Will it make a difference for performance if we put a DISTINCT in the subquery? Shannon Clemmons */ SELECT Employees.EmployeeID, FirstName, LastName FROM Employees WHERE Employees.Country <> 'USA' AND EmployeeID NOT IN(SELECT DISTINCT EmployeeID FROM Orders WHERE Orders.CustomerID = 'ALFKI') /* So, there's an example where we couldn't solve the problem with a JOIN. We can re-write the query to be correlated */ --Correlated using EXISTS SELECT Employees.EmployeeID, FirstName, LastName FROM Employees WHERE Employees.Country <> 'USA' AND NOT EXISTS(SELECT EmployeeID FROM Orders WHERE Employees.EmployeeID=Orders.EmployeeID AND Orders.CustomerID = 'ALFKI') SELECT Employees.EmployeeID, FirstName, LastName FROM Employees WHERE Employees.Country <> 'USA' AND EmployeeID NOT IN(SELECT EmployeeID FROM Orders WHERE Employees.EmployeeID=Orders.EmployeeID AND Orders.CustomerID = 'ALFKI') /* Is there a performance difference? The NOT EXISTS might be a little faster. */ /* Now for an example of a single value subquery in the WHERE clause Every year Northwind constructs a performance index for all of it products. One dimension of this index is "Gross Product Orders" or GPO (I'm making this up...). Products are classified as either above or below average on this measure. GPO is computed as GPO=UnitsOnOrder*UnitPrice. Discontinued products are not included in the calculation of GPO. List the ProductID, ProductName, and GPO of all products that have a GPO higher than the average GPO for all products. Exclude discontinued products. Order the output by GPO. */ --Inspect Products table SELECT * FROM Products /* It seems likely that the Products table will be our outer query. So let's start there. */ SELECT ProductID, ProductName, ('to be computed') AS GPO FROM Products --Compute GPO SELECT ProductID, ProductName, UnitsOnOrder*UnitPrice AS GPO FROM Products --Remove discontinue products SELECT ProductID, ProductName, UnitsOnOrder*UnitPrice AS GPO FROM Products WHERE Discontinued = 0 ORDER BY GPO /* We only want to list products that are above the average GPO for all products. I'll just put in an arbitrary value of 50 to test the outer query. */ SELECT ProductID, ProductName, UnitsOnOrder*UnitPrice AS GPO FROM Products WHERE Discontinued = 0 AND UnitsOnOrder*UnitPrice > 50 --AverageGPO --Seems to work! --Compute Avg(GPO) SELECT AVG(UnitsOnOrder*UnitPrice) FROM Products --Test by hardcoding the value SELECT ProductID, ProductName, UnitsOnOrder*UnitPrice AS GPO FROM Products WHERE Discontinued = 0 AND UnitsOnOrder*UnitPrice > 186.7207 --You may have a different value here --Replace hardcoded value with subquery SELECT ProductID, ProductName, UnitsOnOrder*UnitPrice AS GPO FROM Products WHERE Discontinued = 0 AND UnitsOnOrder*UnitPrice > (SELECT AVG(UnitsOnOrder*UnitPrice) FROM Products) --Order the output SELECT ProductID, ProductName, UnitsOnOrder*UnitPrice AS GPO FROM Products WHERE Discontinued = 0 AND UnitsOnOrder*UnitPrice > (SELECT AVG(UnitsOnOrder*UnitPrice) FROM Products) ORDER BY GPO /* The sales folks loved the report but they wondered how difficult it would be to add an attribute that shows the value above or below the average for each product? Not difficult at all. We move our subquery to the SELECT statement and order by the beta value. */ SELECT ProductID, ProductName, UnitsOnOrder*UnitPrice AS GPO, (UnitsOnOrder*UnitPrice)-(SELECT AVG(UnitsOnOrder*UnitPrice) FROM Products) AS Beta FROM Products WHERE Discontinued = 0 ORDER BY Beta /* Marketing is kicking off an end-of-the-year cross-selling campaign. They want to send promotional flyers to customers who have ordered certain combinations of products on the same order, for example, blue cheese and snails. List CustomerID, CompanyName, and Country for all customers who have placed an order for both ProductID=31 and ProductID=58 on the same order. Hmmm.... how to begin? BTW, we'll see other uses for this type of query when doing bottom-up normalization (functional dependancies and keys) with legacy databases or flat files later on. */ /* I could think of at least two ways to approach this: 1. Look for COUNT(*)s greater than 1 in the [Order Details] table 2. Write two seperate queries against [Order Details] and do a self-join. There are probably other approaches... Would a simple JOIN work? */ SELECT * FROM Orders, [Order Details] AS OrderDetails WHERE Orders.OrderID=OrderDetails.OrderId AND ProductID = 31 AND ProductID =58 --Hmmm... that didn't seem to work. SELECT * FROM Orders, [Order Details] AS OrderDetails WHERE Orders.OrderID=OrderDetails.OrderId AND ProductID = 31 OR ProductID =58 SELECT * FROM Orders, [Order Details] AS OrderDetails WHERE Orders.OrderID=OrderDetails.OrderId AND ProductID IN (31,58) ORDER BY Orders.OrderID /* The above gave us all Orders that contained either 31 or 58 but not on the same order. Well, except for orders 10374 and 10537. Is there a way to get those rows using GROUP BY? Yes. */ SELECT OrderDetails.OrderID, COUNT(*) AS OrderIDcount FROM Orders, [Order Details] AS OrderDetails WHERE Orders.OrderID=OrderDetails.OrderId AND ProductID IN (31,58) GROUP BY OrderDetails.OrderID ORDER BY OrderDetails.OrderID /* And as we can see, the orders 10374 and 10537 have a count greater than 1. So.... Add a HAVING clause to filter out GROUPS */ SELECT OrderDetails.OrderID, COUNT(*) AS OrderIDcount FROM Orders, [Order Details] AS OrderDetails WHERE Orders.OrderID=OrderDetails.OrderId AND ProductID IN (31,58) GROUP BY OrderDetails.OrderID HAVING COUNT(*) > 1 ORDER BY OrderDetails.OrderID /* And actually, we don't need the Orders table at all. So we can remove the JOIN and write it more simply as: */ SELECT OrderID FROM [Order Details] WHERE ProductID IN(31, 58) GROUP BY OrderID HAVING COUNT(*) > 1 ORDER BY OrderID /* We will use this exact technique in the normalization section to look for functional dependencies. BTW, if we want attributes from the Orders Table, we can put the OrderID enumerated list into a IN() clause */ SELECT OrderID, CustomerID, EmployeeID, OrderDate --and so on... FROM Orders WHERE OrderID IN(SELECT OrderID FROM [Order Details] WHERE ProductID IN(31, 58) GROUP BY OrderID HAVING COUNT(*) > 1) /* Another approach would be to run our first attempt twice, once for each condition and then do a self-join. This obviously isn't as generalizable as the GROUP BY/HAVING solution, but it is worth examining. */ --Or first query SELECT * FROM Orders, [Order Details] AS OrderDetails WHERE Orders.OrderID=OrderDetails.OrderId AND ProductID = 31 AND ProductID =58 /* These two queries produce a list of OrderIDs for orders that contain 31 and 58 respectively. To get the Intersection, do a JOIN. */ SELECT OrderID FROM [Order Details] WHERE PRoductID = 31 INTERSECT SELECT OrderID FROM [Order Details] WHERE PRoductID = 58 --Like so... SELECT * FROM (subquery) AS BlueCheese, (subquery) AS Snails WHERE BlueCheese.OrderID=Snails.OrderID SELECT * FROM (SELECT OrderID FROM [Order Details] WHERE PRoductID = 31) AS BlueCheese, (SELECT OrderID FROM [Order Details] WHERE PRoductID = 58) AS Snails WHERE BlueCheese.OrderID=Snails.OrderID SELECT OrderID FROM [Order Details] WHERE PRoductID = 31 INTERSECT SELECT OrderID FROM [Order Details] WHERE PRoductID = 58 --And then join to the Orders table to get the non-aggregate values SELECT Orders.OrderID, Orders.CustomerID, Orders.EmployeeID, OrderDate --and so on... FROM (SELECT OrderID FROM [Order Details] WHERE PRoductID = 31) AS BlueCheese, (SELECT OrderID FROM [Order Details] WHERE PRoductID = 58) AS Snails, Orders WHERE BlueCheese.OrderID=Snails.OrderID AND BlueCheese.OrderID=Orders.OrderID --And performance? SELECT OrderID, CustomerID, EmployeeID, OrderDate --and so on... FROM Orders WHERE OrderID IN(SELECT OrderID FROM [Order Details] WHERE ProductID IN(31, 58) GROUP BY OrderID HAVING COUNT(*) > 1) SELECT Orders.OrderID, Orders.CustomerID, Orders.EmployeeID, OrderDate FROM Orders, [Order Details] od1, [Order Details] od2 WHERE Orders.OrderID=od1.OrderId AND od1.ProductId = 31 AND od2.ProductId = 58 AND od1.orderId = od2.orderId ORDER BY Orders.OrderID /* Contributed by Doug Williams Fall 2007 In the above query, Doug aliasied the Order Details table and did a self join */ /* The Uncorrelated will be easier to extend to an arbitrary number of productIDs. We could write a subquery for the IN(31, 58) subquery for example. */ /* Writing JOINs as nested subqueries SELECT * FROM table1, table2, table3 WHERE table1.Pk=table2.Fk AND table3.Pk=table2.Fk SELECT * FROM table1 INNER JOIN Table2 ON table1.Pk=table2.Fk INNER JOIN table3 ON table3.Pk=table2.Fk SELECT * FROM table1 WHERE table1.Pk IN(SELECT table2.FK FROM table2 WHERE table2.PK IN(SELECT table3.Fk FROM table3 WHERE condition) */ /* List the company names for companies who have ordered 'Chai' */ SELECT DISTINCT CompanyName FROM Customers, Orders, [Order Details], Products WHERE Customers.CustomerID=Orders.CustomerID AND Orders.OrderID=[Order Details].OrderID AND [Order Details].ProductID = Products.ProductID AND ProductName = 'Chai' /* We need DISTINCT because some of the customers ordered chai more than once. However, we could write this as a series of nested subqueries. In order to print a list of companynames, we will need to "follow" the PK=FK "path" through the database design. Orders.CustomerID, [Order Details].OrderID,Products.ProductID, ProductName='Chai' */ --Inner most query first SELECT ProductID FROM Products WHERE ProductName = 'Chai' --So, the productID for Chai is 1 --List all the [Order Details].OrderID FKs for ProductID = 1 SELECT OrderID FROM [Order Details] WHERE ProductID = 1 --Plug in the subquery SELECT OrderID FROM [Order Details] WHERE ProductID = (SELECT ProductID FROM Products WHERE ProductName = 'Chai') --next retreive the list of Orders.Customer FKs SELECT CustomerID FROM Orders WHERE OrderID IN(SELECT OrderID FROM [Order Details] WHERE ProductID = (SELECT ProductID FROM Products WHERE ProductName = 'Chai')) --Now, once more to retrieve CompanyName using the list of Customer FKs SELECT CompanyName FROM Customers WHERE CustomerID IN (SELECT CustomerID FROM Orders WHERE OrderID IN(SELECT OrderID FROM [Order Details] WHERE ProductID = (SELECT ProductID FROM Products WHERE ProductName = 'Chai'))) --How many comparisons as a JOIN? SELECT CAST((SELECT COUNT(*) FROM Products) AS BIGINT) * CAST((SELECT COUNT(*) FROM [Order Details])AS BIGINT) * CAST((SELECT COUNT(*) FROM Orders) AS BIGINT) * CAST((SELECT COUNT(*) FROM Customers) AS BIGINT) AS JoinComparisons --12,533,070,550 --Of course, the optimizer didn't do a natural join... --How many as a subquery? /* 77 + (1*2155) = 2232 [scan Products and join 1 row to OrderDetails] 38 * 830 = 31540 [Join result from first two subqueries to Orders table] 38 * 91 = 3458 [Join result with Customers table] Total comparisons: 2232 + 31540 + 3458 = 37,230 */ SELECT CAST((SELECT COUNT(*) FROM Products) AS BIGINT) * CAST((SELECT COUNT(*) FROM [Order Details])AS BIGINT) * CAST((SELECT COUNT(*) FROM Orders) AS BIGINT) * CAST((SELECT COUNT(*) FROM Customers) AS BIGINT) AS JoinComparisons SELECT CompanyName FROM Customers WHERE CustomerID IN (SELECT CustomerID FROM Orders WHERE OrderID IN(SELECT OrderID FROM [Order Details] WHERE ProductID = (SELECT ProductID FROM Products WHERE ProductName = 'Chai'))) SELECT DISTINCT CompanyName FROM Customers, Orders, [Order Details], Products WHERE Customers.CustomerID=Orders.CustomerID AND Orders.OrderID=[Order Details].OrderID AND [Order Details].ProductID = Products.ProductID AND ProductName = 'Chai' --Notice, in the subquery, we didn't need to use DISTINCT --JOIN solution SELECT DISTINCT CompanyName FROM Customers, Orders, [Order Details] WHERE Customers.CustomerID=Orders.CustomerID AND Orders.OrderID=[Order Details].OrderID AND ProductID = (SELECT ProductID FROM Products WHERE ProductName = 'Chai') --We could write it as a Correlated subquery SELECT CompanyName FROM Customers WHERE EXISTS ( SELECT * --doesn't matter what we put for the attribute FROM Orders WHERE Customers.CustomerID=Orders.CustomerID AND EXISTS ( SELECT * FROM [Order Details] WHERE Orders.OrderID=[Order Details].OrderID AND ProductID = (SELECT ProductID FROM Products WHERE ProductName = 'Chai'))) /* A drawback to the subquery solution is that we cann't list attributes from the inner query. For example, the following query won't work. */ SELECT CompanyName, OrderDate FROM Customers WHERE CustomerID IN ( --pk SELECT CustomerID --fk FROM Orders WHERE OrderID IN(--pk SELECT OrderID --fk FROM [Order Details] WHERE ProductID = 1))