Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Eliminating Hash Match Aggregate or Sort Distinct.

Author  Topic 

phlogiston
Starting Member

1 Post

Posted - 2009-07-16 : 12:02:03
Hello, Folks,

We have a main table like this, with a little more than 200 million rows:


CREATE TABLE dbo.MainTable
(
DateHourStamp DATETIME NOT NULL,
RecordIdentifier NCHAR(17) NOT NULL,
LatestStatusDateTime DATETIME NULL,
LatestStatusValue NCHAR(2) NULL,
...
)

ALTER TABLE dbo.MainTable
ADD CONSTRAINT PK_MainTable
PRIMARY KEY CLUSTERED
(
DateHourStamp ASC,
RecordIdentifier ASC
)

CREATE UNIQUE NONCLUSTERED INDEX IX_MainTable_RecordIdentifier
ON dbo.MainTable
(
RecordIdentifier ASC
)


We have a status table like this, with a little more than 2 million rows:


CREATE TABLE dbo.StatusTable
(
RecordIdentifier NCHAR(17) NOT NULL,
StatusDateTime DATETIME NOT NULL,
StatusRank TINYINT NOT NULL,
StatusValue NCHAR(2) NOT NULL
)

ALTER TABLE dbo.StatusTable
ADD CONSTRAINT PK_StatusTable
PRIMARY KEY CLUSTERED
(
RecordIdentifier ASC,
StatusDateTime DESC,
StatusRank DESC
)


The main table is populated once an hour with one row per record identifier, and the distribution of record identifiers is pretty even and mostly ever increasing.

The status table is populated once an hour (but later than the main table) with multiple rows per record identifier, and the distribution of record identifiers could be pretty random.

We use the following statement to update the main table with the latest status for a record identifier:


WITH NewStatuses AS
(
SELECT
RecordIdentifier,
StatusDateTime,
StatusValue,
ROW_NUMBER() OVER
(
PARTITION BY
RecordIdentifier
ORDER BY
StatusDateTime DESC,
StatusRank DESC
)
AS RowNumber
FROM
dbo.StatusTable
)
UPDATE
dbo.MainTable
SET
LatestStatusDateTime = NewStatuses.StatusDateTime,
LatestStatusValue = NewStatuses.StatusValue
FROM
dbo.MainTable
INNER JOIN
NewStatuses
ON (MainTable.RecordIdentifier = NewStatuses.RecordIdentifier)
WHERE
NewStatuses.RowNumber = 1;


MainTable and StatusTable have a one-to-many relationship. The ROW_NUMBER() function allows us to filter the StatusTable to make a pseudo one-to-one relationship.

Here is the problem. Even though we make that pseudo one-to-one relationship, SQL Server still treats it like a one-to-many, and throws a Hash Match Aggregate in the plan to eliminate possible duplicates before the update.

The number of rows input to the Hash Match Aggregate is always equal to the number of rows output. And because the StatusTable and Hash create a pretty random mess, the actual update to the table causes a ton of random disk access.

We tried adding OPTION (ORDER GROUP) to the end, but that just changed the Hash Match Aggregate into a Sort Distinct. Granted, the random disk access was reduced. The hash takes about 90 minutes, the sort takes about 45 minutes.

My question is, how can we eliminate the Hash Match Aggregate or Sort Distinct all together, when we know that the number of rows input to the operator is always equal to the number of rows output? Thanks.

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2009-07-17 : 11:41:49
lateral thinking....

either get a
"union all" clause somewhere into the "from" statement...can you put the main query inside a subquery/subtable?
or
force in some dummy conditions equality/inequality which nudge the optimiser to skip the sort,etc.
or
break the update into smaller chunks, which may minimise the problem (10x smallproblems < 1x bigsize10problem)

also your "transaction logging" on a 2m row update may be part of the problem...maybe the data is all over the place.
Go to Top of Page
   

- Advertisement -