Optimising big queries - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Optimising big queries

Author

Topic

icx
Starting Member

11 Posts

Posted - 2008-03-26 : 11:01:54

Hi there

Recently our company purchased a product from ip2location.com; a database containing 2.9million IP address ranges, and their approximate cities/countries of registration.

Naturally, I thought - "Hey, wouldn't it be great if we could cross reference this with our IIS logs so we could see where our visitors are from?".

So, I set about doing just that. Our IIS logs are already in SQL.

The trouble is, the ip2location database is so large that executing a query against it to find which range a particular IP address is within takes me 1 second. Multiply that by 1,000,000 log rows, and Houston - we have a problem.

One of the issues is that each record in the ip2location database comprises a FROM_IP and TO_IP range to describe a range of IPs. So to find which IP range a particular IP resides in, I have to join using a BETWEEN statement (or so, I think anyway!).

Does anyone have any suggestions on how to improve this process, or has anyone done anything similar before?

Ideally, I'd like to write a trigger to grab the IP region data (i.e. City/Country) and update the IISLog with that value when the new row is inserted, saving me having to do it later.

I tried this, and the batch import of IIS logs into SQL took so long I got bored and gave up :)

Any help anyone can offer would be appreciated.

Many thanks

Richard.
P.S. Somebody is bound to ask - "Why couldn't you just use Google Analytics?"; my answer is because we want to slice up our log data into chunks, and give it to our customers in semi-real time. Plus the logs report on other services - not just HTTP. ;)

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-03-26 : 11:05:31

Create a clustered index over IP range.

E 12°55'05.25"
N 56°04'39.16"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-03-26 : 11:06:17

Also is IP range varchar? Or INT?

E 12°55'05.25"
N 56°04'39.16"

icx
Starting Member

11 Posts

Posted - 2008-03-26 : 11:10:31

quote:
Originally posted by Peso

Also is IP range varchar? Or INT?

Hi Peso

The IP range is Bigint (its stored in IPNumber format, so not the traditional '192.168.0.0' style dotted notation).

So, create a clustered index with IP_FROM and IP_TO as the Included Columns?

Thanks
p.s. (new to clustering!)

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-03-26 : 11:18:54

Yes, a normal

CREATE UNIQUE CLUSTERED INDEX IX_IP ON tbl_IP (IP_From, IP_To)

will do well.

And remember to use FillFactor 100 when creating.

E 12°55'05.25"
N 56°04'39.16"

icx
Starting Member

11 Posts

Posted - 2008-03-26 : 11:22:49

quote:
Originally posted by Peso

Yes, a normal

CREATE UNIQUE CLUSTERED INDEX IX_IP ON tbl_IP (IP_From, IP_To)

will do well.

And remember to use FillFactor 100 when creating.

Ok, will try this now to see if it helps.... Many thanks.

icx
Starting Member

11 Posts

Posted - 2008-04-17 : 12:12:53

Sorry its been a while but wanted to update the thread to say that your suggestion worked brilliantly - thank you for your help and advice.

quote:
Originally posted by icx

quote:
Originally posted by Peso

Yes, a normal

CREATE UNIQUE CLUSTERED INDEX IX_IP ON tbl_IP (IP_From, IP_To)

will do well.

And remember to use FillFactor 100 when creating.

Ok, will try this now to see if it helps.... Many thanks.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-04-17 : 16:04:08

You wrote before that a search took about second before.
How long time does it take now?

E 12°55'05.25"
N 56°04'39.16"

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2008-04-18 : 08:55:33

quote:
Originally posted by Peso

You wrote before that a search took about second before.
How long time does it take now?

The question is, at what value of n does a query like:


SELECT *
FROM ( SELECT TOP n * FROM #IPSample ) AS S
INNER JOIN #Range AS R
  ON S.ip BETWEEN R.ip_from AND R.ip_to

go from having a query plan like:


  |--Nested Loops(Inner Join, OUTER REFERENCES:([tempdb].[dbo].[#IPSample].[ip]))
       |--Top(TOP EXPRESSION:((98)))
       |    |--Table Scan(OBJECT:([tempdb].[dbo].[#IPSample]))
       |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[#Range] AS
[R]), SEEK:([R].[ip_from] <= [tempdb].[dbo].[#IPSample].[ip]),
WHERE:([tempdb].[dbo].[#IPSample].[ip]<=[tempdb].[dbo].[#Range].[ip_to]
as [R].[ip_to]) ORDERED FORWARD)

to one like:


  |--Nested Loops(Inner Join, OUTER REFERENCES:([R].[ip_from], [R].[ip_to]))
       |--Index Scan(OBJECT:([tempdb].[dbo].[#Range] AS [R]))
       |--Index Spool(SEEK:([tempdb].[dbo].[#IPSample].[ip] >=
[tempdb].[dbo].[#Range].[ip_from] as [R].[ip_from] AND
[tempdb].[dbo].[#IPSample].[ip] <= [tempdb].[dbo].[#Range].[ip_to] as
[R].[ip_to]))
            |--Top(TOP EXPRESSION:((99)))
                 |--Table Scan(OBJECT:([tempdb].[dbo].[#IPSample]))

i.e. when does it become cheaper find the logged IP addresses for each range rather than the ranges for each logged IP address?
With 3 million rows in #Range and no index on #IPSample.ip, I get a value of n=99.

Edit: line breaks in execution plan so it doesn't ruin layout of rest of thread.

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2008-04-18 : 09:05:51

Here's the test code I wrote:


DROP TABLE #R0
DROP TABLE #R1
DROP TABLE #Range

-- generate 4 million random 32-bit values to form the start of each range:
SELECT CAST(FLOOR(POWER(2.0, 32) * RAND(CAST(NEWID() AS binary(4)))) AS bigint) as ip
INTO #R0
FROM dbo.Numbers AS N1, dbo.Numbers AS N2
WHERE N2.n < 40

-- get the distinct ones and give them ids in ip order:
SELECT ROW_NUMBER() OVER (ORDER BY ip) AS range_id, ip
INTO #R1
FROM #R0
GROUP BY ip

-- cluster on range id:
ALTER TABLE #R1 ALTER COLUMN range_id int NOT NULL
ALTER TABLE #R1 ADD PRIMARY KEY (range_id)

-- get the end of one range from the start of the next range
-- (i.e. ranges are non-overlapping and contiguous):
CREATE TABLE #Range (
  range_id int PRIMARY KEY NONCLUSTERED,
  ip_from bigint NOT NULL,
  ip_to bigint NOT NULL
)

INSERT INTO #Range
SELECT R_from.range_id, R_from.ip AS ip_from, R_to.ip - 1 AS ip_to
FROM #R1 AS R_from
INNER JOIN #R1 AS R_to
  ON R_from.range_id = R_to.range_id - 1

-- throw away every fourth range so there are some holes and ~3 million ranges:
DELETE FROM #Range
WHERE (range_id % 4) = 0

-- create the unique clustered index:
CREATE UNIQUE CLUSTERED INDEX Range_from_to ON #Range(ip_from, ip_to) WITH (FILLFACTOR = 100)

-- create a million random IP addresses
SELECT CAST(FLOOR(POWER(2.0, 32) * RAND(CAST(NEWID() AS binary(4)))) AS bigint) as ip
INTO #IPSample
FROM dbo.Numbers AS N1, dbo.Numbers AS N2
WHERE N2.n < 10

-- execution plan show estimated row count for result of 709780000000!
SELECT *
FROM #IPSample AS S
INNER JOIN #Range AS R
  ON S.ip BETWEEN R.ip_from AND R.ip_to

-- if we know that the ranges are non-overlapping, single IP address
-- lookups can be done much more quickly like this:
SELECT *
FROM ( SELECT TOP 1000 * FROM #IPSample ) AS S
OUTER APPLY (
    SELECT *
    FROM (
        SELECT TOP 1 *
        FROM #Range AS R
        WHERE S.ip >= R.ip_from
        ORDER BY R.ip_from DESC
      ) AS R
    WHERE S.ip <= R.ip_to
  ) AS R

icx
Starting Member

11 Posts

Posted - 2008-04-28 : 04:30:31

Hi all

I must apologise - I completely forgot about this thread, and for some reason, my subscription to it stopped emailing me updates. Thank you for so many helpful replies, and also to Arnold Fribble who's excellent code I'm stil going through trying to understand :)

I think I will try to work some of Arnold's logic into my own query and then I'll post the results when I'm done...

Again, thanks for the assistance.

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2008-04-28 : 08:08:27

I think what I was trying to get across (rather obliquely!) was that I was surprised about two things:
1. Peter (Peso) had suggested merely indexing those columns
2. The result of doing so had made it fast enough for you. Particularly if you were getting a range for a single IP address each time.

Actually, if you do have overlapping ranges (and therefore the CROSS APPLY / TOP 1 solution doesn't work) then I'm not sure what the best way to do it is...

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-04-28 : 09:00:54

quote:
Originally posted by icx

The trouble is, the ip2location database is so large that executing a query against it to find which range a particular IP address is within takes me 1 second. Multiply that by 1,000,000 log rows, and Houston - we have a problem.

This is the part I read.
I interpreted this as the ASP page read the client IP address, searched the IP-table and then wrote to the log table.

E 12°55'05.25"
N 56°04'39.16"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-04-28 : 09:03:02

Or are you doing nightly "batch" updates?
Update all log entries for yesterday, with city information?

E 12°55'05.25"
N 56°04'39.16"

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2008-04-28 : 09:32:37

quote:
Originally posted by Peso

quote:
Originally posted by icx

The trouble is, the ip2location database is so large that executing a query against it to find which range a particular IP address is within takes me 1 second. Multiply that by 1,000,000 log rows, and Houston - we have a problem.

This is the part I read.
I interpreted this as the ASP page read the client IP address, searched the IP-table and then wrote to the log table.

Yes, that's what I was assuming too. Which is why I didn't understand your answer, because I don't really see how it helps to any great degree. A simple index on (ip_from, ip_to) can only be seeked to return the ranges where ip_from <= ip -- which is essentially everything by the time you get to the far end.

One way I could see that index being useful in looking up ranges for a single IP address was, as I said, if the ranges don't overlap and therefore the one with the largest ip_from <= ip is the required one (hence the APPLY query).

And then when the OP said it was working well I thought maybe I'd misinterpreted and that it was actually a batch process.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-04-28 : 09:37:02

I have to test your code tonight.
Maybe a

SELECT TOP 1 * FROM IpLocation WHERE FromIP < @IpToSearch ORDER BY FromIP DESC

would suffice?

Oh, what the heck! Why not "unpack" the ranges to single IP-addressed? A sort of semi-normalization?

IpFrom  IpTo  City 
1       3     London

yields


1  London
2  London
3  London

Are the IP ranges all D-class addresses? Or C-class addresses?

E 12°55'05.25"
N 56°04'39.16"

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2008-04-28 : 09:59:31

quote:
Originally posted by Peso

I have to test your code tonight.
Maybe a

SELECT TOP 1 * FROM IpLocation WHERE FromIP < @IpToSearch ORDER BY FromIP DESC

would suffice?

Quite true...
Can't stop thinking set-based even when I should!

In fact, that APPLY isn't doing much more than what you said -- it's just a bit complicated by trying to keep seek-and-top execution plan and get left-join-ish behaviour.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-04-28 : 10:43:46

You can also experiment horizontally partitioned tables.
See http://weblogs.sqlteam.com/peterl/archive/2007/09/24/Horizontal-partitioning.aspx

I don't know the distribution of your ip-addresses, but about 16 base tables would be enough, with ranges like this


         0	 268435455
 268435456	 536870911
 536870912	 805306367
 805306368	1073741823
1073741824	1342177279
1342177280	1610612735
1610612736	1879048191
1879048192	2147483647
2147483648	2415919103
2415919104	2684354559
2684354560	2952790015
2952790016	3221225471
3221225472	3489660927
3489660928	3758096383
3758096384	4026531839
4026531840	4294967295

Of course you have to investigate your data to make proper partitioning but this is the general idea. And because the IP (which is part of pk) always is included it will be fast.

E 12°55'05.25"
N 56°04'39.16"

icx
Starting Member

11 Posts

Posted - 2008-04-29 : 06:00:13

Wow - you guys are going way over my head here :) I should explain that I am by no means a DBA :) My background is in software programming, more lately with the .NET framework ... so you can see my dilemma here.... I am trying to be a DBA and not doing too well! :)

(To my credit, I have now purchased several books on the subject).

Anyway, back to the thread. Just to clarify a few points for you both, the ranges do NOT overlap, and currently, the updates to the log data from the IP2location database were taking place as each row was inserted (hence the problem: with a 1s+ delay per lookup, a million rows = 1 million seconds to complete the insert, which is of course, bad!).

I am actually now rewriting the entire logging component so that, for instance, once the initial log import has completed, I can fire a separate proc to do the updates in a batch, vs. using sql's triggering to do it on a line by line basis.

Does this help clarify things a little?

Thanks again for your support and your patience...

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources