Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-01-28 : 17:41:52
|
Does anyone have any favorite ways of finding distinct (smallish) sets in a large(ish) table?Here's what I've got:-- A table of about 10 million samples:CREATE TABLE Samples ( sampleID int PRIMARY KEY, sampleDate datetime NOT NULL, sampleAnalyzed datetime NOT NULL -- etc.)-- A table of about 44 million results of tests performed on those samples:CREATE TABLE SampleResults ( sampleID int NOT NULL REFERENCES Samples, testID smallint NOT NULL, testValue numeric(9, 3) NOT NULL, PRIMARY KEY (sampleID, testID)) So there's a mean of 4.4 tests per sample, but the distribution of tests per sample is highly skewed: a geometric distribution would be roughly right. Let's say there's no more than 25 tests done on any sample.There are about 650 possible testID values, though again, there are popular and unpopular tests. The testIDs themselves can be considered arbitrary.What I want to do is find all the distinct sets of tests done on a sample, and how many samples have that set. So a really bad (but quite funny ) way to do this would be find all the pairs of samples where the square of the number of matching tests is the same as the product of the number of tests for both samples, and then group by the larger sampleID to get the minimum sampleID (which can act as a setID) for each set, finally grouping again to get the number of occurrences of each set. Like this:SELECT setID, COUNT(*) AS sampleCountFROM ( SELECT MIN(s1) AS setID, s2 FROM ( SELECT R1.sampleID AS s1, R2.sampleID AS s2, COUNT(*) AS ct, COUNT(CASE WHEN R1.testID = R2.testID THEN 1 END) AS ct_match FROM SampleResults AS R1 INNER JOIN SampleResults AS R2 ON R1.sampleID <= R2.sampleID GROUP BY R1.sampleID, R2.sampleID ) AS A WHERE ct = ct_match * ct_match GROUP BY s2 ) AS AGROUP BY setIDORDER BY setID (Actually what I really want to do is take that result and draw a graph of all the subset-superset relations (not including transitive links).)I have to admit that I'm struggling with impure thoughts about writing a CLR aggregate function to union bitmaps of the test values... |
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-01-30 : 06:16:58
|
What's that you say? Where's the test data? Yeah, it's not easy to come up with something that gets the distribution of testID values right: in the real data, there's clearly there's some clustering of related tests, but it's tricky to fake. The consequence is that this test data has fewer distinct testID values, and probably produces far more sets.For the sake of simplicity, I've lost the Samples table and ignored the testValue column of SampleResults:CREATE TABLE SampleResults ( sampleID int NOT NULL, testID smallint NOT NULL, PRIMARY KEY (sampleID, testID))GO-- prime the random number generatorDECLARE @dummy floatSET @dummy = RAND(1)DECLARE @endp float, @sampleID int, @testID smallintSELECT @endp = 0.5, @sampleID = 1, @testID = 0SET NOCOUNT ONBEGIN TRANSACTIONWHILE @sampleID <= 1000000BEGIN SET @testID = @testID + 1 + CAST(FLOOR(RAND()*RAND()*10.0) AS int) INSERT INTO SampleResults (sampleID, testID) VALUES (@sampleID, @testID) IF RAND() < @endp BEGIN SET @testID = 0 SET @sampleID = @sampleID + 1 ENDENDCOMMIT TRANSACTIONSET NOCOUNT ONGOSELECT COUNT(*)FROM SampleResultsSELECT testsPerSample, COUNT(*) AS sampleCountFROM ( SELECT COUNT(*) AS testsPerSample FROM SampleResults GROUP BY sampleID ) AS AGROUP BY testsPerSampleORDER BY testsPerSampleSELECT testID, COUNT(*)FROM SampleResultsGROUP BY testIDORDER BY testID So that gives 1000000 samples, with a mean of 2 tests per sample (actual total 1998197* rows), a nice geometric distribution of tests per sample, with a maximum of 21* tests. We have testID values from 1 to 96* with 75* of them used, and something fairly close to a geometric distribution of testID usage.* these are the values I get: if SQL Server doesn't give the same pseudorandom sequence from a given seed, they won't necessarily match exactly.Here's a fairly standard 'two sets are equal if the number of matches is the same as the number of members in each set' approach:SELECT setID, COUNT(*) AS sampleCountFROM ( SELECT s2, MIN(s1) AS setID FROM ( SELECT S1.sampleID AS s1, S2.sampleID AS s2 FROM SampleResults AS S1 INNER JOIN SampleResults AS S2 ON S1.sampleID <= S2.sampleID AND S1.testID = S2.testID GROUP BY S1.sampleID, S2.sampleID HAVING COUNT(*) = (SELECT COUNT(*) FROM SampleResults AS Sa WHERE Sa.sampleID = S1.sampleID) AND COUNT(*) = (SELECT COUNT(*) FROM SampleResults AS Sb WHERE Sb.sampleID = S2.sampleID) ) AS A GROUP BY s2 ) AS AGROUP BY setID Much better than the silly query in the other post: the estimated cost has dropped from to 183646580 to a mere 22191956!Ideally, though, I'd like this query to finish before the heat death of the universe. |
 |
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-01-30 : 15:51:15
|
> I have to admit that I'm struggling with impure thoughts about writing a CLR aggregate function to union bitmaps of the test values...I lost http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=61031SELECT MIN(sampleID) AS setID, COUNT(*)FROM ( SELECT sampleID, dbo.UnionBinary(testID) AS testSet FROM SampleResults GROUP BY sampleID ) AS AGROUP BY testSetORDER BY MIN(sampleID) |
 |
|
SamC
White Water Yakist
3467 Posts |
Posted - 2006-01-31 : 08:48:15
|
quote: Originally posted by Arnold Fribble What I want to do is find all the distinct sets of tests done on a sample, and how many samples have that set.
This is the problem? (I'm sure I'm missing something... but what! what!) It sounds like:-- Distinct sets of tests done on a sampleSELECT sampleID, COUNT(DISTINCT testID) As DistinctTestDoneOnASample FROM SampleResults GROUP BY sampleID-- how many samples have that set?SELECT testID, COUNT(DISTINCT sampleID) As SamplesThatHaveThatSet FROM SampleResults GROUP BY testID |
 |
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-01-31 : 09:27:12
|
Hmm, it sounds like I needed to give some simpler sample data.INSERT INTO SampleResults (sampleID, testID)SELECT 1, 10UNION ALL SELECT 1, 20UNION ALL SELECT 1, 30UNION ALL SELECT 2, 10UNION ALL SELECT 2, 20UNION ALL SELECT 2, 30UNION ALL SELECT 2, 40UNION ALL SELECT 3, 10UNION ALL SELECT 3, 30UNION ALL SELECT 4, 10UNION ALL SELECT 4, 20UNION ALL SELECT 4, 30UNION ALL SELECT 5, 20UNION ALL SELECT 5, 40UNION ALL SELECT 6, 40UNION ALL SELECT 7, 40 The result we want here is:setID sampleCount1 22 13 15 16 2 There are 5 distinct sets of tests that have been carried out on samples. These sets can be identified with the MIN(sampleID) that they apply to. So the sets in question are (this isn't a query result):setID tests1 {10,20,30}2 {10,20,30,40}3 {10,30}5 {20,40}6 {40} If we'd wanted the sampleIDs that all have the same set of tests we could have just used the outermost derived table of the query directly instead of wrapping it in another SELECT...GROUP BY. |
 |
|
jsmith8858
Dr. Cross Join
7423 Posts |
Posted - 2006-01-31 : 09:58:10
|
As long as the # of samples in each set is not too big, you could use a concatenation UDF to concatenate the tests into a single string. Then a GROUP By and a MIN() gives you the answer you need. Not sure if this would be faster or slow than using a custom CLR aggregate.For example:create function SampleTests(@Sample int)returns varchar(8000)asbegin declare @ret varchar(8000) set @ret = '' select @ret = @ret + ', ' + convert(varchar(10), TestID) from SampleResults where SampleID = @Sample order by TestID return @retend and then your result is:select Min(SampleID),Count(*), Testsfrom(select SampleID, dbo.Sampletests(SampleID) as Testsfrom (select distinct SampleID from Sampleresults) a) bgroup by Tests Of course, you may be able to do something much more efficient than a simple concatenation of the results together depending on the data; i.e, the binary OR you mentioned, or some sort of hash algorithm. |
 |
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-01-31 : 16:58:09
|
Thanks Jeff, that's just what I needed! It's running a little faster than the CLR aggregate for the example data, and it's a lot less fussy about the size (or type, really) of the testID values.I had worried that doing a seek inside a function for each distinct sampleID was going to destroy the performace, but as long I remember toSET STATISTICS IO OFFSET STATISTICS TIME OFFit's fine! Goodness knows what it does with all the stats from the SELECTs inside the functions.Actually I did have another method, even faster (on the example data, anyway) than both the CLR aggregate and your per-sample concatenation function. It still uses a bit set, but splits it up into 63 bit chunks and stores them in bigints.SELECT MIN(sampleID) AS setID, COUNT(*) AS sampleCountFROM ( SELECT sampleID, SUM(CASE WHEN testID BETWEEN 0 AND 62 THEN POWER(CAST(2 AS bigint), testID) ELSE 0 END) AS ts1, SUM(CASE WHEN testID BETWEEN 63 AND 125 THEN POWER(CAST(2 AS bigint), testID - 63) ELSE 0 END) AS ts2 // extend as necessary FROM SampleResults GROUP BY sampleID ) AS AGROUP BY ts1, ts2 // extend as necessaryORDER BY sampleCount DESC, setID ASC Pretty disgusting, eh?Though the fact that it is faster probably says something about the CLR marshalling. Either that, or my lack of prowess in writing fast C# code! |
 |
|
|
|
|
|
|