beware of avg() for large datasets

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

beware of avg() for large datasets

Author

Topic

jezemine
Master Smack Fu Yak Hacker

2886 Posts

Posted - 2006-11-19 : 13:20:21

I already posted this on dbforums - skip it if you saw it there. Just a little warning for you so you don't spend hours debugging as I did:

This is not a question, just a fyi about something I ran into that surprised me (and took me a couple hours to debug and track down!)

I have a pretty big dataset - about 700m rows, one column of which is a tinyint. I needed to take an average of that column, so I did this:


select avg(mytinyint) from mytable

which returned:


Msg 8115, Level 16, State 2, Line 6 
Arithmetic overflow error converting expression to data type int.

It turns out, if the sum() over the column is bigger than MAX_INT = 2147483647, then avg() will throw this error. This surprised me a little - I guess I was assuming sql server would use bigint internally when computing integer avg, but it uses int.

so you have to do this:


select avg(cast(mytinyint as bigint)) from mytable

what's annoying to me is, it's impossible that an average of type T could be larger or smaller than the max/min allowed value of a type T. This means sql server devs could have internally always used bigint for sum and count, and then returned cast(sum/count as T) from avg() without possibility of overflow (unless sum or count was bigger than the max bigint! you could cast to float in that case I guess...)

Because of this, you have to cast your column to a bigint to make it work if your dataset is large enough, even if the column is only 1 measly byte! kind of a waste of space if you ask me.

here's a little example of what I'm talking about, one that doesn't require you to import a 700m row dataset


declare @t table (id int) 
insert into @t select 1 
union all select 2147483647 
select avg(cast(id as bigint)) from @t -- works 
select avg(id) from @t -- fails

So, be warned: this little feature could bite you one day without warning if you have a table which is slowly growing without bound and you are computing averages on a tinyint/smallint/int column in that table.

I found this on 2005, it manifests on 2000 as well.

SqlSpec - a fast, cheap, and comprehensive data dictionary generator for
SQL Server 2000/2005 and Analysis Server 2005 - http://www.elsasoft.org

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2006-11-19 : 14:07:20

This can be read in Books Online

quote:

Return Types
The return type is determined by the type of the evaluated result of expression.

Expression result Return type 
integer category int 
decimal category (p, s) decimal(38, s) divided by decimal(10, 0) 
money and smallmoney category money 
float and real category float

Peter Larsson
Helsingborg, Sweden

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2006-11-19 : 14:15:05

Nice pick up.

One of things that make a good DBA is to understand the limitations of the DBMS you use.

Imagine the horrors that mySQL DBA's go through!

DavidM

Production is just another testing cycle

jezemine
Master Smack Fu Yak Hacker

2886 Posts

Posted - 2006-11-19 : 14:27:12

Peso, I saw that in BOL too, but that's not the issue. It makes sense to return an int from avg() if you are averaging ints, or a byte if you are averaging bytes.

what I have a problem with is that internally, int is used to accumulate the count, which will overflow if the ints/smallints/tinyints you are averaging over sum to larger than MAX_INT. This issue has nothing to do with the return type however, because simple mathematics guarantees that the avg of any type T is always between MIN_T/MAX_T as I posted above, so it would always be safe to cast to T once the calc is finished internally.

does that make sense?

SqlSpec - a fast, cheap, and comprehensive data dictionary generator for
SQL Server 2000/2005 and Analysis Server 2005 - http://www.elsasoft.org

snSQL
Master Smack Fu Yak Hacker

1837 Posts

Posted - 2006-11-19 : 19:47:08

quote:
kind of a waste of space if you ask me.

No, what would be a waste of space is if they made it always use bigint as you suggest. They figured that int was big enough for most cases and you have to cast to bigint when you use a huge number. It has to sum up all your numbers before it can calculate the average, so it doesn't matter if the numbers are all one byte - the sum is huge, so the numbers used in calculating the sum must be huge.

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2006-11-19 : 20:23:29

quote:
Originally posted by jezemine

Peso, I saw that in BOL too, but that's not the issue. It makes sense to return an int from avg() if you are averaging ints, or a byte if you are averaging bytes.

what I have a problem with is that internally, int is used to accumulate the count, which will overflow if the ints/smallints/tinyints you are averaging over sum to larger than MAX_INT. This issue has nothing to do with the return type however, because simple mathematics guarantees that the avg of any type T is always between MIN_T/MAX_T as I posted above, so it would always be safe to cast to T once the calc is finished internally.

does that make sense?

SqlSpec - a fast, cheap, and comprehensive data dictionary generator for
SQL Server 2000/2005 and Analysis Server 2005 - http://www.elsasoft.org

It sounds like your complaint is that SQL Server works exactly the way BOL says it works. What exactly would you expect?

I also don't imderstand why it should take you several hours to figure out an error like this:


select
	avg(1000000000)
from
	f_table_number_range(1,100000)


Server: Msg 8115, Level 16, State 2, Line 3
Arithmetic overflow error converting expression to data type int.

The error message seems fairly explicit and obvious to me.

CODO ERGO SUM

jezemine
Master Smack Fu Yak Hacker

2886 Posts

Posted - 2006-11-19 : 21:01:30

quote:
Originally posted by snSQL

quote:
kind of a waste of space if you ask me.

No, what would be a waste of space is if they made it always use bigint as you suggest. They figured that int was big enough for most cases and you have to cast to bigint when you use a huge number. It has to sum up all your numbers before it can calculate the average, so it doesn't matter if the numbers are all one byte - the sum is huge, so the numbers used in calculating the sum must be huge.

Not sure I agree. If I were implementing an averaging function for integer types, I would use a long to accumulate the sum, divide by the count, and then cast to int/short/byte as appropriate for the return value. that way you could cover MANY more cases and you are only using 4 more bytes internally. 4 extra bytes seems pretty cheap to me, and you don't hit overflow until MAX_LONG, which is VERY BIG.

But sql server doesn't work this way, so I had to cast 700m separate bytes to longs. Now *that* is a waste of space. I agree that it may be a rare case, but why not use 4 more bytes internally to make it 2^32 times more rare?

Finally, I didn't post this to be a sql server basher. I like sql server as much as the next guy. Just calling it out so others would be aware. sheesh!

SqlSpec - a fast, cheap, and comprehensive data dictionary generator for
SQL Server 2000/2005 and Analysis Server 2005 - http://www.elsasoft.org

jezemine
Master Smack Fu Yak Hacker

2886 Posts

Posted - 2006-11-19 : 21:28:24

quote:
Originally posted by Michael Valentine Jones
It sounds like your complaint is that SQL Server works exactly the way BOL says it works. What exactly would you expect?

From reading BOL, I would expect that when I take an avg of bytes, the return value is a int. that's what's advertised in BOL as I read it. They don't mention that if the sum calculated internally happens to be bigger than MAX_INT, you get an overflow.

quote:

I also don't imderstand why it should take you several hours to figure out an error like this:

select
avg(1000000000)
from
f_table_number_range(1,100000)

Server: Msg 8115, Level 16, State 2, Line 3
Arithmetic overflow error converting expression to data type int.

The error message seems fairly explicit and obvious to me.

It took a while to debug mainly because the dataset was so large (each time I thought I had a fix, it took a while to try it out), and the query was much more complex that I showed in the post (I oversimplified a lot there in the interest of clarity - was doing a bit more than a simple avg over all rows). In the post I just showed the root cause. Plus I wasn't explicitly casting to int anywhere, so the root cause was not immediately apparent to me.

again, this was just an fyi, to me it was unexpected behavior. but it seems as if this strikes you as expected behavior, that's fair I guess.

SqlSpec - a fast, cheap, and comprehensive data dictionary generator for
SQL Server 2000/2005 and Analysis Server 2005 - http://www.elsasoft.org

snSQL
Master Smack Fu Yak Hacker

1837 Posts

Posted - 2006-11-19 : 23:54:13

Your fyi is appreciated

You must remember though that SQL queries come in all shapes and sizes and SQL Server has to try and optimize each one is the best way possible while still staying generic enough to handle all the different things we throw at it. One of the ways that it does this is by having strict rules for how types are used in expressions. It's not about space, converting a number from one format to another even with complemetary types like int and bigint is nevertheless a conversion. So the designers of SQL Server made a choice to try and have few conversions as possible - so they left ints as ints. If you want to make the sum a big int, you have to convert the two numbers you're adding into bigints first, so if SQL Server did what you say then it would be converting your 700m ints into bigints anyway and you'd save nothing over your explicit conversion - BUT all the times when the sum is less than max-int would be converting unnecessarily.

We didn't assume you were SQL bashing, and we aren't jezemine bashing, just explaining why it works the way it does

jezemine
Master Smack Fu Yak Hacker

2886 Posts

Posted - 2006-11-20 : 00:38:25

quote:
Originally posted by snSQL

Your fyi is appreciated

You must remember though that SQL queries come in all shapes and sizes and SQL Server has to try and optimize each one is the best way possible while still staying generic enough to handle all the different things we throw at it. One of the ways that it does this is by having strict rules for how types are used in expressions. It's not about space, converting a number from one format to another even with complemetary types like int and bigint is nevertheless a conversion. So the designers of SQL Server made a choice to try and have few conversions as possible - so they left ints as ints. If you want to make the sum a big int, you have to convert the two numbers you're adding into bigints first, so if SQL Server did what you say then it would be converting your 700m ints into bigints anyway and you'd save nothing over your explicit conversion - BUT all the times when the sum is less than max-int would be converting unnecessarily.

We didn't assume you were SQL bashing, and we aren't jezemine bashing, just explaining why it works the way it does

fair enough. however i would point out that avg() seems to be doing exactly the casts you describe when it is averaging a column of tinyint or smallint, because it's casting to int; internally int is used for accumulating the sum. if it wasn't, the the avg of two tinyints with value 255 would overflow, and it doesn't. so my point was, why not go whole hog, use 4 more bytes, and accumulate in a bigint instead, since you are already casting in 2 of 3 cases anyway?

perhaps the reasoning on the part of the dev was (as you say) that the most common integer type column is a int32, and in that case no casts would be necessary if you accumulate in an int32. basically optimize for the most common case. If that was the reasoning I could buy it. :)

SqlSpec - a fast, cheap, and comprehensive data dictionary generator for
SQL Server 2000/2005 and Analysis Server 2005 - http://www.elsasoft.org

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2006-11-20 : 00:42:54

quote:

perhaps the reasoning on the part of the dev was (as you seem to be hinting at) that the most common integer type column is a int32, and in that case no casts would be necessary if you accumulate in an int32. basically optimize for the most common case. If that was the reasoning I could buy it. :)

How about the fact that SQL Server 7 never had a bigint. Upcasting to int was the only option.
Notice the addition of the bigint operators (COUNT_BIG etc..) that where added when SQL2k arrived.

DavidM

Production is just another testing cycle

jezemine
Master Smack Fu Yak Hacker

2886 Posts

Posted - 2006-11-20 : 01:00:36

quote:
Originally posted by byrmol
How about the fact that SQL Server 7 never had a bigint. Upcasting to int was the only option.
Notice the addition of the bigint operators (COUNT_BIG etc..) that where added when SQL2k arrived.

that seems more likely actually. code that works has a way of persisting. :)

SqlSpec - a fast, cheap, and comprehensive data dictionary generator for
SQL Server 2000/2005 and Analysis Server 2005 - http://www.elsasoft.org

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2006-11-20 : 05:46:26

How would you calculate (2^63 - 1) + 1 ?
That's even to great for BIGINT, but every part is still less than max for BIGINT?

Peter Larsson
Helsingborg, Sweden

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2006-11-20 : 06:17:13

Jezemine, you DO NOT have to convert your table column to INT. As long as the calculated SUM does not exceed max for INT, it is ok to SUM up all TINYINTS too.

Peter Larsson
Helsingborg, Sweden

Kristen
Test

22859 Posts

Posted - 2006-11-20 : 08:40:09

"I agree that it may be a rare case"

Not sure I agree its rare. You are at risk to the tune of:

2,147,483,647 / (Actual Average Value) being > Number of rows

So for 64K rows an average of 32K would be enough to break it.
At 1M rows the average drops to 2K
At 16M rows its down to 128

"This can be read in Books Online"

Which is silent on AVG(MyBIGINT) [i.e. Expression result = BIGINT] altogether :-(

Strikes me that it would be better if SQL Server silently promoted the internal calculation to BigInt when the first exception was triggered; there isn't going to be a problem with the being rounded back to INT!

I guess its more effort to do a moving-calculation, rather than SUM the lot and divide by COUNT(*), but that's how my HP calculator used to do it IIRC. I can't even remember what the formula was, but it needed some Delta value or somesuch. I'm sure you mathematicians know what I mean! Mind you: single-user HP calculator didn't need a minimal-cpu-cycles approach!

Kristen

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2006-11-20 : 09:03:53

You had a HP41 too?

Peter Larsson
Helsingborg, Sweden

Kristen
Test

22859 Posts

Posted - 2006-11-20 : 09:27:13

Can't remember. It was a long time ago. Might have been an HP0.9 ... I had version 0.9 of WordStar!

jezemine
Master Smack Fu Yak Hacker

2886 Posts

Posted - 2006-11-20 : 09:58:19

quote:
Originally posted by Kristen

"I agree that it may be a rare case"

Not sure I agree its rare. You are at risk to the tune of:

2,147,483,647 / (Actual Average Value) being > Number of rows

So for 64K rows an average of 32K would be enough to break it.
At 1M rows the average drops to 2K
At 16M rows its down to 128

yea, the math says it's not that rare at all. I figure it's rare in practice though since I found no mention of it in BOL or the google searches I did (I did find references to similar issues on oracle and db2 though). that's why I thought it would be useful to post it

SqlSpec - a fast, cheap, and comprehensive data dictionary generator for
SQL Server 2000/2005 and Analysis Server 2005 - http://www.elsasoft.org

jezemine
Master Smack Fu Yak Hacker

2886 Posts

Posted - 2006-11-20 : 10:07:21

quote:
Originally posted by Peso

How would you calculate (2^63 - 1) + 1 ?
That's even to great for BIGINT, but every part is still less than max for BIGINT?

one way would be to cast to float (as I wrote in the original post), then back to bigint when you have the answer. but that's a limitation I could live with. or you could use mathematica

quote:
Originally posted by Peso
Jezemine, you DO NOT have to convert your table column to INT. As long as the calculated SUM does not exceed max for INT, it is ok to SUM up all TINYINTS too.

I understand that I don't have to convert anything to int. all I was trying to say is that internally, the way avg() appears to be working is that sql server is using an int to accumulate the sum. That means it must cast your column to int if you avg a tinyint or smallint column. otherwise it wouldn't be able to accumulate the sum in an int.

SqlSpec - a fast, cheap, and comprehensive data dictionary generator for
SQL Server 2000/2005 and Analysis Server 2005 - http://www.elsasoft.org

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources