Hello, Folks,We have a main table like this, with a little more than 200 million rows:CREATE TABLE dbo.MainTable( DateHourStamp DATETIME NOT NULL, RecordIdentifier NCHAR(17) NOT NULL, LatestStatusDateTime DATETIME NULL, LatestStatusValue NCHAR(2) NULL, ...)ALTER TABLE dbo.MainTableADD CONSTRAINT PK_MainTablePRIMARY KEY CLUSTERED ( DateHourStamp ASC, RecordIdentifier ASC)CREATE UNIQUE NONCLUSTERED INDEX IX_MainTable_RecordIdentifierON dbo.MainTable( RecordIdentifier ASC)
We have a status table like this, with a little more than 2 million rows:CREATE TABLE dbo.StatusTable( RecordIdentifier NCHAR(17) NOT NULL, StatusDateTime DATETIME NOT NULL, StatusRank TINYINT NOT NULL, StatusValue NCHAR(2) NOT NULL)ALTER TABLE dbo.StatusTableADD CONSTRAINT PK_StatusTablePRIMARY KEY CLUSTERED( RecordIdentifier ASC, StatusDateTime DESC, StatusRank DESC)
The main table is populated once an hour with one row per record identifier, and the distribution of record identifiers is pretty even and mostly ever increasing.The status table is populated once an hour (but later than the main table) with multiple rows per record identifier, and the distribution of record identifiers could be pretty random.We use the following statement to update the main table with the latest status for a record identifier:WITH NewStatuses AS( SELECT RecordIdentifier, StatusDateTime, StatusValue, ROW_NUMBER() OVER ( PARTITION BY RecordIdentifier ORDER BY StatusDateTime DESC, StatusRank DESC ) AS RowNumber FROM dbo.StatusTable)UPDATE dbo.MainTableSET LatestStatusDateTime = NewStatuses.StatusDateTime, LatestStatusValue = NewStatuses.StatusValueFROM dbo.MainTableINNER JOIN NewStatuses ON (MainTable.RecordIdentifier = NewStatuses.RecordIdentifier)WHERE NewStatuses.RowNumber = 1;
MainTable and StatusTable have a one-to-many relationship. The ROW_NUMBER() function allows us to filter the StatusTable to make a pseudo one-to-one relationship.Here is the problem. Even though we make that pseudo one-to-one relationship, SQL Server still treats it like a one-to-many, and throws a Hash Match Aggregate in the plan to eliminate possible duplicates before the update.The number of rows input to the Hash Match Aggregate is always equal to the number of rows output. And because the StatusTable and Hash create a pretty random mess, the actual update to the table causes a ton of random disk access.We tried adding OPTION (ORDER GROUP) to the end, but that just changed the Hash Match Aggregate into a Sort Distinct. Granted, the random disk access was reduced. The hash takes about 90 minutes, the sort takes about 45 minutes.My question is, how can we eliminate the Hash Match Aggregate or Sort Distinct all together, when we know that the number of rows input to the operator is always equal to the number of rows output? Thanks.