Distinct sets in a large(ish) table

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

Distinct sets in a large(ish) table

Author

Topic

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2006-01-28 : 17:41:52

Does anyone have any favorite ways of finding distinct (smallish) sets in a large(ish) table?

Here's what I've got:


-- A table of about 10 million samples:
CREATE TABLE Samples (
  sampleID int PRIMARY KEY,
  sampleDate datetime NOT NULL,
  sampleAnalyzed datetime NOT NULL
  -- etc.
)

-- A table of about 44 million results of tests performed on those samples:
CREATE TABLE SampleResults (
  sampleID int NOT NULL REFERENCES Samples,
  testID smallint NOT NULL,
  testValue numeric(9, 3) NOT NULL,
  PRIMARY KEY (sampleID, testID)
)

So there's a mean of 4.4 tests per sample, but the distribution of tests per sample is highly skewed: a geometric distribution would be roughly right. Let's say there's no more than 25 tests done on any sample.
There are about 650 possible testID values, though again, there are popular and unpopular tests. The testIDs themselves can be considered arbitrary.

What I want to do is find all the distinct sets of tests done on a sample, and how many samples have that set. So a really bad (but quite funny

) way to do this would be find all the pairs of samples where the square of the number of matching tests is the same as the product of the number of tests for both samples, and then group by the larger sampleID to get the minimum sampleID (which can act as a setID) for each set, finally grouping again to get the number of occurrences of each set. Like this:


SELECT setID, COUNT(*) AS sampleCount
FROM (
    SELECT MIN(s1) AS setID, s2
    FROM (
        SELECT R1.sampleID AS s1, R2.sampleID AS s2, COUNT(*) AS ct,
          COUNT(CASE WHEN R1.testID = R2.testID THEN 1 END) AS ct_match
        FROM SampleResults AS R1
        INNER JOIN SampleResults AS R2 ON R1.sampleID <= R2.sampleID
        GROUP BY R1.sampleID, R2.sampleID
      ) AS A
    WHERE ct = ct_match * ct_match
    GROUP BY s2
  ) AS A
GROUP BY setID
ORDER BY setID

(Actually what I really want to do is take that result and draw a graph of all the subset-superset relations (not including transitive links).)

I have to admit that I'm struggling with impure thoughts about writing a CLR aggregate function to union bitmaps of the test values...

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2006-01-30 : 06:16:58

What's that you say? Where's the test data? Yeah, it's not easy to come up with something that gets the distribution of testID values right: in the real data, there's clearly there's some clustering of related tests, but it's tricky to fake. The consequence is that this test data has fewer distinct testID values, and probably produces far more sets.

For the sake of simplicity, I've lost the Samples table and ignored the testValue column of SampleResults:


CREATE TABLE SampleResults (
  sampleID int NOT NULL,
  testID smallint NOT NULL,
  PRIMARY KEY (sampleID, testID)
)
GO

-- prime the random number generator
DECLARE @dummy float
SET @dummy = RAND(1)

DECLARE @endp float, @sampleID int, @testID smallint
SELECT @endp = 0.5, @sampleID = 1, @testID = 0

SET NOCOUNT ON
BEGIN TRANSACTION

WHILE @sampleID <= 1000000
BEGIN
  SET @testID = @testID + 1 + CAST(FLOOR(RAND()*RAND()*10.0) AS int)

  INSERT INTO SampleResults (sampleID, testID) VALUES (@sampleID, @testID)
  
  IF RAND() < @endp
  BEGIN
    SET @testID = 0
    SET @sampleID = @sampleID + 1
  END
END

COMMIT TRANSACTION
SET NOCOUNT ON
GO

SELECT COUNT(*)
FROM SampleResults

SELECT testsPerSample, COUNT(*) AS sampleCount
FROM (
    SELECT COUNT(*) AS testsPerSample
    FROM SampleResults
    GROUP BY sampleID
  ) AS A
GROUP BY testsPerSample
ORDER BY testsPerSample

SELECT testID, COUNT(*)
FROM SampleResults
GROUP BY testID
ORDER BY testID

So that gives 1000000 samples, with a mean of 2 tests per sample (actual total 1998197* rows), a nice geometric distribution of tests per sample, with a maximum of 21* tests. We have testID values from 1 to 96* with 75* of them used, and something fairly close to a geometric distribution of testID usage.

* these are the values I get: if SQL Server doesn't give the same pseudorandom sequence from a given seed, they won't necessarily match exactly.

Here's a fairly standard 'two sets are equal if the number of matches is the same as the number of members in each set' approach:


SELECT setID, COUNT(*) AS sampleCount
FROM (
    SELECT s2, MIN(s1) AS setID
    FROM (
        SELECT S1.sampleID AS s1, S2.sampleID AS s2
        FROM SampleResults AS S1
        INNER JOIN SampleResults AS S2
          ON S1.sampleID <= S2.sampleID AND S1.testID = S2.testID
        GROUP BY S1.sampleID, S2.sampleID
        HAVING COUNT(*) = (SELECT COUNT(*) FROM SampleResults AS Sa
                           WHERE Sa.sampleID = S1.sampleID)
           AND COUNT(*) = (SELECT COUNT(*) FROM SampleResults AS Sb
                           WHERE Sb.sampleID = S2.sampleID)
      ) AS A
    GROUP BY s2
  ) AS A
GROUP BY setID

Much better than the silly query in the other post: the estimated cost has dropped from to 183646580 to a mere 22191956!

Ideally, though, I'd like this query to finish before the heat death of the universe.

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2006-01-30 : 15:51:15

> I have to admit that I'm struggling with impure thoughts about writing a CLR aggregate function to union bitmaps of the test values...

I lost

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=61031


SELECT MIN(sampleID) AS setID, COUNT(*)
FROM (
    SELECT sampleID, dbo.UnionBinary(testID) AS testSet
    FROM SampleResults
    GROUP BY sampleID
  ) AS A
GROUP BY testSet
ORDER BY MIN(sampleID)

SamC
White Water Yakist

3467 Posts

Posted - 2006-01-31 : 08:48:15

quote:
Originally posted by Arnold Fribble

What I want to do is find all the distinct sets of tests done on a sample, and how many samples have that set.

This is the problem? (I'm sure I'm missing something... but what! what!) It sounds like:

-- Distinct sets of tests done on a sample
SELECT sampleID, COUNT(DISTINCT testID) As DistinctTestDoneOnASample
FROM SampleResults
GROUP BY sampleID

-- how many samples have that set?
SELECT testID, COUNT(DISTINCT sampleID) As SamplesThatHaveThatSet
FROM SampleResults
GROUP BY testID

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2006-01-31 : 09:27:12

Hmm, it sounds like I needed to give some simpler sample data.


INSERT INTO SampleResults (sampleID, testID)
SELECT 1, 10
UNION ALL SELECT 1, 20
UNION ALL SELECT 1, 30
UNION ALL SELECT 2, 10
UNION ALL SELECT 2, 20
UNION ALL SELECT 2, 30
UNION ALL SELECT 2, 40
UNION ALL SELECT 3, 10
UNION ALL SELECT 3, 30
UNION ALL SELECT 4, 10
UNION ALL SELECT 4, 20
UNION ALL SELECT 4, 30
UNION ALL SELECT 5, 20
UNION ALL SELECT 5, 40
UNION ALL SELECT 6, 40
UNION ALL SELECT 7, 40

The result we want here is:


setID sampleCount
1     2
2     1
3     1
5     1
6     2

There are 5 distinct sets of tests that have been carried out on samples. These sets can be identified with the MIN(sampleID) that they apply to. So the sets in question are (this isn't a query result):


setID tests
1     {10,20,30}
2     {10,20,30,40}
3     {10,30}
5     {20,40}
6     {40}

If we'd wanted the sampleIDs that all have the same set of tests we could have just used the outermost derived table of the query directly instead of wrapping it in another SELECT...GROUP BY.

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2006-01-31 : 09:58:10

As long as the # of samples in each set is not too big, you could use a concatenation UDF to concatenate the tests into a single string. Then a GROUP By and a MIN() gives you the answer you need. Not sure if this would be faster or slow than using a custom CLR aggregate.

For example:


create function SampleTests(@Sample int)
returns varchar(8000)
as
begin
	declare @ret varchar(8000)
	set @ret = ''
	
	select @ret = @ret + ', ' + convert(varchar(10), TestID) 
        from SampleResults
	where SampleID = @Sample
	order by TestID

	return @ret
end

and then your result is:


select Min(SampleID),Count(*), Tests
from
(
select SampleID, dbo.Sampletests(SampleID) as Tests
from
	(select distinct SampleID from Sampleresults) a
) b
group by Tests

Of course, you may be able to do something much more efficient than a simple concatenation of the results together depending on the data; i.e, the binary OR you mentioned, or some sort of hash algorithm.

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2006-01-31 : 16:58:09

Thanks Jeff, that's just what I needed! It's running a little faster than the CLR aggregate for the example data, and it's a lot less fussy about the size (or type, really) of the testID values.

I had worried that doing a seek inside a function for each distinct sampleID was going to destroy the performace, but as long I remember to
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
it's fine! Goodness knows what it does with all the stats from the SELECTs inside the functions.

Actually I did have another method, even faster (on the example data, anyway) than both the CLR aggregate and your per-sample concatenation function. It still uses a bit set, but splits it up into 63 bit chunks and stores them in bigints.


SELECT MIN(sampleID) AS setID, COUNT(*) AS sampleCount
FROM (
    SELECT sampleID,
      SUM(CASE WHEN testID BETWEEN 0 AND 62 THEN
        POWER(CAST(2 AS bigint), testID) ELSE 0 END) AS ts1,
      SUM(CASE WHEN testID BETWEEN 63 AND 125 THEN
        POWER(CAST(2 AS bigint), testID - 63) ELSE 0 END) AS ts2
      // extend as necessary
    FROM SampleResults
    GROUP BY sampleID
  ) AS A
GROUP BY ts1, ts2 // extend as necessary
ORDER BY sampleCount DESC, setID ASC

Pretty disgusting, eh?
Though the fact that it is faster probably says something about the CLR marshalling. Either that, or my lack of prowess in writing fast C# code!

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources