Database Server Sizing

By Chris Miller on 12 December 2000 | Tags: Application Design

Ascii255 writes "I am trying to create a site, which users can save their bookmarks upon registration . . . Lets say the site has 100k registered users, each cheking in daily. That means 100k queries made each day, among 2M records (assuming an avg user has 20 bookmarks). It is like one query/sec. What kind of hardware is adequate for that kind of process?"

The full text of the question is I am trying to create a site, which users can save their bookmarks upon registration. I created 2 tables, a user table to keep user information, and a table to keep bookmarks. (the second table is like, UsrID, Url) and I join the tables on UsrID. The thing works out fine now. But imagine this. Lets say the site has 100k registered users, each cheking in daily. That means 100k queries made each day, among 2M records (assuming an avg user has 20 bookmarks). It is like one query/sec. What kind of hardware is adequate for that kind of process?

Another database I am building is supposed to keep track of all the operations on 320 share quotes. That means like 3-5M insert statements everyday. I would like to keep track of at least 2 months, which gives me 300M rows. Can Mssql handle that much rows?

Big questions. I've got some answers, but more pointers on where you'll need to go to find more.

First off, good database design is more important than hardware. Good database design beats good hardware every time. Spend a lot of time on database design and indexing. OK, I think I've emphasized that enough.

Now, a few words on capacity planning. There are some interesting things to keep in mind with respect to how to predict traffic. For example, for sites that have predominantly US audiences, they receive about 10% of their daily traffic in their highest traffic hour. So if you're expecting 100K queries per day, you can probably expect 10K per hour. Well, that's the case if you're running a site here in the US, it just works out that way probably because of the number of covered timezones and usage patterns. Your mileage will vary. Ok, 10K per hour doesn't sound nearly as bad as you think, even out of 2M recs.

Let's look at what your bottlenecks will be. First, let's assume we're dealing with an integer UserID and a bookmark that can be at most 1 KB. So that's 1 KB plus 4 bytes per record, so for 2M records, you're looking at about 2.2 GB for the data space at maximum. If you use varchars to hold your URL's (a definite must in this case) you're probably looking at a lot less than half that. You'll probably also have an index on UserID, so that's another 400K for the index (rough estimates). The queries you're running are pretty simple, so it should run faster than hell, and then it'll get even faster as data gets into cache. If you're reading the data off disk too much, the server will die. OK, it won't die, but you'll have slow response instead of the crisp response you're looking for.

It shouldn't take a lot of box should to pull that off, but the big constraint is going to be the amount of memory you can get into the system. You want to have at least enough memory to cache most of the database and all of the indexes. Now, how big a box do you need to get? Well, I'd suggest building the database and then grabbing a separate workstation box and running a bunch of scripts to determine how much traffic you can actually handle on a box that size. Sizing hardware is very difficult and very dependent on your specific needs and database layouts.

The processor time consumed by this should be relatively small--that's where you've got to have good indexes. SQL Server 7 introduced a whole bunch of join types, but if you're using anything besides nice clean loop joins (the only join type available in SQL 6.5) you're chewing up huge amounts of processor time. How do you get loop joins? Good indexes.

Let's take a look at your overhead. First, you've got connection overhead from making and breaking connections, most of which should be alleviated by our good friend the Internet Information Server, which will pool the connections to keep the overhead of connect/disconnect/connect minimized by leaving connections open and reusing them. Second is insert/update overhead, and once again, if you've got the right indexes, this won't be too bad. Third is logging overhead: where and how are you going to log who hits your site? How do you know that someone didn't just connect, make 100 bookmarks, and then promptly forget their password and submit 100 new passwords. After a month or two, you'll need to be able to know which records aren't being used...

OK, as for 3-5 million insert statements daily: How about a different methodology for getting rows into the database? How about 3-5 million Bulk inserts, either via (icky) DTS or via (my favorite) command line BCP. Significantly lower overhead, and performance like you won't believe. Read up on BCP and the Select Into/BulkCopy option on a database for more information. Then once again it comes down to estimating database size and memory requirements, just like in the exercise above. This is the vastly preferred solution for high-capacity inserts.

Have fun, post responses with questions, I'll keep an eye on this one.

rocketscientist.