Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Clustered non-unique index ?

Author  Topic 

PhilipLewis
Starting Member

30 Posts

Posted - 2015-01-29 : 17:50:33
Here is a question, for which I solicit input.

I have a table requirement, where a large number of processes will be inserting large numbers of rows into a single table concurrently.

While the data has a natural unique key value, it is irrelevant for the task at hand which is to write as many rows as possible in the shortest amount of time - high concurrency.

Every row has an integer representation of time (TK) as its retrieval access key (eg. 567812345).

Rows are always retrieved from the table for TK values >= some integer (and some time later deleted for some TK < some integer value)

For application dependent reasons, TK is not the actual time of insertion, but of some other previous event in the process. Subsequently, while rows arrive from each individual writer process in ascending TK sequence, seen for all writers together, the rows will not be inserted into the table in ascending value of TK.

The granularity of the integer time value is seconds, so potentially many rows will have the same value.

My initial plan was to simply use an integer non-unique clustered index, and let SQLServer apply its 4-byte "uniquness" atribute behind the scenes.

Is this the best approach for high write concurrency and low locking overhead? Am I missing anything obvious?

It may be possible to increase the granularity of TK by changing its data type to numeric (n,n) , e.g. 550784850.0267439, but this would not eliminate duplicates, nor would it alter the out of order insert chracteristic. Range retrieval would be unaltered in this scenario.


Feel free to offer any insights based on this information.

nagino
Yak Posting Veteran

75 Posts

Posted - 2015-01-29 : 21:17:50
Have you thought about using IDENTITY Property?
You can get inserted value by OUTPUT clause or @IDENTITY function.


-------------------------------------
From Japan
Sorry, my English ability is limited.
Go to Top of Page

PhilipLewis
Starting Member

30 Posts

Posted - 2015-01-30 : 06:14:03
IDENTITY does not address the functional requirement.
The data must also be read efficiently by reader process, as well as inserted.

The processes further down have no clue about IDENTITIES, they just care about TK values for retrieval.

"Every row has an integer representation of time (TK) as its retrieval access key"

I was mostly concerned out the out or order clustered key insert sequence and its effect on write speed.

The "best practice" for clustered keys seems to be to make them unique with a monotonically ascending integer value This is not useful for functional requirement in my case.


Thanks
Philip
Go to Top of Page

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2015-01-30 : 09:10:44
"IDENTITY does not address the functional requirement."

Nor would it impact the functional requirement. if your CI key was (TK, ID) the applications would not need to change. Though that's not much different from letting SQL do it behind the scenes by adding its own uniqueifier. You'd just be making it explicit.

I'm also a little concerned by your use of a time-based key. Does it wrap after 24 hours? That would be disastrous, I think. If not after 24 hours, at what actual date and time will it wrap?
Go to Top of Page

PhilipLewis
Starting Member

30 Posts

Posted - 2015-01-30 : 09:29:48
Your concern is unfounded - time moves inexorably forward, reflected in my integer values. Some time around when the sun implodes, I will still be counting (hyperbole).

And yes, I know I could explicitly add an IDENTITY such that TK+IDENTITY=PK, however I see no need to do so, unless there is some compelling performance aspect that makes it advantageous for me to do so. I neither need nor want an IDENTITY column declared.

This is an internals technical question, more related to the out of order insert sequence and non-duplicate clustered index behaviour than anything else.

Perhaps I should have posted somewhere more appropriate?
Go to Top of Page

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2015-01-30 : 09:58:42
this is a good spot to post the question.

"I know I could explicitly add an IDENTITY such that TK+IDENTITY=PK, however I see no need to do so"

Nor do I, I was simply responding to your unnecessary concern that "The processes further down have no clue about IDENTITIES"

So... the answer to my question "Does it wrap after 24 hours" is "No?"

FWIW I think your scheme will never have an out of order insert sequence.

Also fwiw, your key will wrap around sometime on 2083-02-17, if the starting point is today, I believe. I expect will both be long gone by then, unless Ray Kurzweil is correct and we can transfer our conciousness to androids by 2050.

Go to Top of Page

PhilipLewis
Starting Member

30 Posts

Posted - 2015-01-30 : 11:01:19
" unnecessary concern .."

It wasn't a "concern", it was just an observation that the IDENTITY cannot serve any useful purpose in my application.

There will be out of order, because there are 100s of processes each of which takes a different amount of time from when they calculate the TK and when they attempt to insert a row. Also, an insert can fail and be retried, the TK will not have changed and other processes will have written to the table. And so on ...

Your calculation is quite a way out actually. The integer value represents a (day number * 100000) plus a value between 0 and 86400. The granularity is second, it is not a count of seconds - a subtle but important difference. The current day number is 5507 FWIW.

Thanks for your input, and feel free to update your calculation :)
Go to Top of Page

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2015-01-30 : 11:14:23
Aha! Didn't know about the day number component. Makes sense now (not that 2083 would be a problem -- at least not for you and me!)

So, you will indeed have out-of-order inserts. that's problematic since it can (probably will) lead to page splits and fragmentation. What about using an Identity column for the CI key, thus eliminating potential splits, and adding a NCI on the time+id? Yes, you'll have the maintenance cost of the NCI, but it will be much smaller than the CI (since no other data columns) and will split much less frequently.
Go to Top of Page

ScottPletcher
Aged Yak Warrior

550 Posts

Posted - 2015-01-30 : 11:40:06
quote:
My initial plan was to simply use an integer non-unique clustered index, and let SQLServer apply its 4-byte "uniquness" atribute behind the scenes.

Is this the best approach for high write concurrency and low locking overhead? Am I missing anything obvious?

It really depends on how many duplicates you expect to have. If it could often be more than a page's worth, I'd use identity as the second key column to uniqueify the key yourself. If we walk thru SQL's INSERT process, to unqieify a row it will need to find the highest duplicate count currently in use for the main key value. Naturally the more pages it has to search to find that value, the worse it is.

But you are definitely right to keep the time value as the lead clustering key; under no circumstances do you want to just use an identity as the lead clustering key on this table.
Go to Top of Page

PhilipLewis
Starting Member

30 Posts

Posted - 2015-01-30 : 13:07:58
"It really depends on how many duplicates you expect to have."

With a 1 second granularity, 100s.

If I change the TK to a decimal value with a nano second granularity, then the duplicate count will be pretty low.

These are just staging tables, the data doesn't persist any longer than it takes for the rows to be subsequently processed and sent onward in the system. They are being deleted at very regular intervals.

I have no idea whether a clustered index is positively or negatively affected by changing the data type from integer to numeric(16,7). Nor do I know exactly how the internals of clustered index building work. My assumption was that it simply tacked a bigint identity to every row and pretended it was unique (all behind the scenes). Your description is somewhat different.


For background, this is already a well functioning system with 4000+ staging tables and 10,000+ insert processes peaking at 6000+ rows/second total inserts. The ratio of writers to tables varies from 1 to 1000+.

I am chasing milliseconds because my numbers are big, and it is very hard to measure this stuff.

GB: The current setup has a completely pointless IDENTITY PK and a regular non-unique index for TK plus a couple of other redundant columns. I am looking for improved insert performance by essentially conflating the two indices to a single non-unique clustered index.

Will it work???
Go to Top of Page

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2015-01-30 : 13:50:46
I second Scott's solution. Yes, it will work
Go to Top of Page

ScottPletcher
Aged Yak Warrior

550 Posts

Posted - 2015-01-30 : 16:42:26
A one second granularity should be fine; 100s of dups will not be an issue, since they will still be clustered together. Just go and ahead and add the identity column to the end of the clus index to insure uniqueness. Something like this:

CREATE UNIQUE CLUSTERED INDEX <index_name>
ON <table_name> ( time_key, identity_value )
WITH ( FILLFACTOR = 99, SORT_IN_TEMPDB = ON ) --or OFF, as you prefer
ON [<filegroup_name>]

Go to Top of Page

graz
Chief SQLTeam Crack Dealer

4149 Posts

Posted - 2015-01-30 : 17:07:31
Just a few questions...

1. What version of SQL Server are you using?
2. How many rows per second are you thinking of inserting? Any guess on the max value per second?
3. How long will the data stay in table before being deleted?

=================================================
Creating tomorrow's legacy systems today. One crisis at a time.
Go to Top of Page

PhilipLewis
Starting Member

30 Posts

Posted - 2015-02-01 : 06:45:05
graz:

1: 2008R2
2: I gave some application values previously, possibly 1000s per second per table.
3: The data has time value, so it is processed endlessly. Data rarely stays in the staging table for more than 30 minutes.

I am expecting the rate of data inserted for all tables to increase to 10,000 per second, for any specific table I do not know (I didn't bother to count the current peaks by table).
Go to Top of Page

PhilipLewis
Starting Member

30 Posts

Posted - 2015-02-01 : 07:09:43
Really, there are two options, and it rather depends on how SQLServer manages non-unique clustered indices internally, something about which I am uncertain. There is the secondary issue of whether the indexing management for clustered indices is affected by which numeric data type is used, and the effect of changing TK from integer to numeric(16,7)

-- [TK] integer not null
-- [InsertID] bigint IDENTITY (1,1) not null
ALTER TABLE [T] ADD CONSTRAINT [PK] PRIMARY KEY CLUSTERED ([TK] ASC, [InsertID] ASC) ON [PRIMARY]

versus

CREATE CLUSTERED INDEX [idx] ON [T] ([TK] ASC) ON [PRIMARY]

In general the duplicate count and incidence will be low, but it could be a couple of thousand on any specific table.


This is pretty much it, and I guess that the second option should be no worse than the first. Further, all things being equal, the data type should not make much difference other than storage footprint.
Go to Top of Page

ScottPletcher
Aged Yak Warrior

550 Posts

Posted - 2015-02-02 : 10:21:31
If a clustered key value is not unique, SQL will generate a "uniquifier" (as SQL calls it), which is a sequential counter for the duplicate keys values. It's effectively an identity for that one key.

So you insert 'A'. Then you insert 'A' again -- SQL now adds 1 to the first 'A', and 2 to the second. You insert another 'A', it becomes 3, and so on.

The kicker is that I think SQL's pointer is always to the first 'A', so if the list gets long, SQL may have to read several pages to figure out what the next dup key value counter is.
Go to Top of Page

PhilipLewis
Starting Member

30 Posts

Posted - 2015-02-02 : 12:45:57
Scott. Yes, I read some stuff yesterday that mentioned the "uniquifier". This is (I think) effectively a row id on the elements of the duplicate index node. I haven't looked at the internals of clustered indices, and I am trying to avoid the need. I assume it does things in the classic indexing manner - more or less.

It is probably true (but essentially unknowable) that a decimal TK to 7 decimal places would be for all practical purposes unique, and that I could manage the trivial number of duplicate insert fails by incrementing TK by .0000001 until I got success. This would lower the structure footprint (no IDENTITY column), so that would be an alternate solution. The absolute sequence is not that important for the application. But I don't like to guess about these things.
Go to Top of Page

ScottPletcher
Aged Yak Warrior

550 Posts

Posted - 2015-02-02 : 12:57:23
quote:
Originally posted by PhilipLewis

Scott. Yes, I read some stuff yesterday that mentioned the "uniquifier". This is (I think) effectively a row id on the elements of the duplicate index node. I haven't looked at the internals of clustered indices, and I am trying to avoid the need. I assume it does things in the classic indexing manner - more or less.

It is probably true (but essentially unknowable) that a decimal TK to 7 decimal places would be for all practical purposes unique, and that I could manage the trivial number of duplicate insert fails by incrementing TK by .0000001 until I got success. This would lower the structure footprint (no IDENTITY column), so that would be an alternate solution. The absolute sequence is not that important for the application. But I don't like to guess about these things.



Grr, that's an ugly idea. Adding an identity to the table or allowing SQL to gen its own uniquifier would be far better.
Go to Top of Page

PhilipLewis
Starting Member

30 Posts

Posted - 2015-02-02 : 23:19:54
Nah, it's genius!

Those 7 decimals are of a single second. I have 10 million options every second, and insertion rates of a couple of thousand maximum for any given table - lets say 2500/sec. The odds of hits are 2.5:10000 and I get uniqueness which clustered indices "like". It is like 2 extra lines of code compared to what is already in place, I already detect insert failure and retry etc. etc. the code base for this is already in place.

I think I will give it a go :D
Go to Top of Page
   

- Advertisement -