| Author |
Topic |
|
icx
Starting Member
11 Posts |
Posted - 2008-03-26 : 11:01:54
|
| Hi thereRecently our company purchased a product from ip2location.com; a database containing 2.9million IP address ranges, and their approximate cities/countries of registration.Naturally, I thought - "Hey, wouldn't it be great if we could cross reference this with our IIS logs so we could see where our visitors are from?".So, I set about doing just that. Our IIS logs are already in SQL.The trouble is, the ip2location database is so large that executing a query against it to find which range a particular IP address is within takes me 1 second. Multiply that by 1,000,000 log rows, and Houston - we have a problem.One of the issues is that each record in the ip2location database comprises a FROM_IP and TO_IP range to describe a range of IPs. So to find which IP range a particular IP resides in, I have to join using a BETWEEN statement (or so, I think anyway!).Does anyone have any suggestions on how to improve this process, or has anyone done anything similar before?Ideally, I'd like to write a trigger to grab the IP region data (i.e. City/Country) and update the IISLog with that value when the new row is inserted, saving me having to do it later.I tried this, and the batch import of IIS logs into SQL took so long I got bored and gave up :)Any help anyone can offer would be appreciated.Many thanksRichard.P.S. Somebody is bound to ask - "Why couldn't you just use Google Analytics?"; my answer is because we want to slice up our log data into chunks, and give it to our customers in semi-real time. Plus the logs report on other services - not just HTTP. ;) |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-03-26 : 11:05:31
|
Create a clustered index over IP range. E 12°55'05.25"N 56°04'39.16" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-03-26 : 11:06:17
|
Also is IP range varchar? Or INT? E 12°55'05.25"N 56°04'39.16" |
 |
|
|
icx
Starting Member
11 Posts |
Posted - 2008-03-26 : 11:10:31
|
quote: Originally posted by Peso Also is IP range varchar? Or INT?
Hi PesoThe IP range is Bigint (its stored in IPNumber format, so not the traditional '192.168.0.0' style dotted notation). So, create a clustered index with IP_FROM and IP_TO as the Included Columns?Thanksp.s. (new to clustering!) |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-03-26 : 11:18:54
|
Yes, a normalCREATE UNIQUE CLUSTERED INDEX IX_IP ON tbl_IP (IP_From, IP_To)will do well.And remember to use FillFactor 100 when creating. E 12°55'05.25"N 56°04'39.16" |
 |
|
|
icx
Starting Member
11 Posts |
Posted - 2008-03-26 : 11:22:49
|
quote: Originally posted by Peso Yes, a normalCREATE UNIQUE CLUSTERED INDEX IX_IP ON tbl_IP (IP_From, IP_To)will do well.And remember to use FillFactor 100 when creating.
Ok, will try this now to see if it helps.... Many thanks. |
 |
|
|
icx
Starting Member
11 Posts |
Posted - 2008-04-17 : 12:12:53
|
Sorry its been a while but wanted to update the thread to say that your suggestion worked brilliantly - thank you for your help and advice.quote: Originally posted by icx
quote: Originally posted by Peso Yes, a normalCREATE UNIQUE CLUSTERED INDEX IX_IP ON tbl_IP (IP_From, IP_To)will do well.And remember to use FillFactor 100 when creating.
Ok, will try this now to see if it helps.... Many thanks.
|
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-04-17 : 16:04:08
|
You wrote before that a search took about second before.How long time does it take now? E 12°55'05.25"N 56°04'39.16" |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2008-04-18 : 08:55:33
|
quote: Originally posted by Peso You wrote before that a search took about second before.How long time does it take now?
The question is, at what value of n does a query like:SELECT *FROM ( SELECT TOP n * FROM #IPSample ) AS SINNER JOIN #Range AS R ON S.ip BETWEEN R.ip_from AND R.ip_to go from having a query plan like: |--Nested Loops(Inner Join, OUTER REFERENCES:([tempdb].[dbo].[#IPSample].[ip])) |--Top(TOP EXPRESSION:((98))) | |--Table Scan(OBJECT:([tempdb].[dbo].[#IPSample])) |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[#Range] AS[R]), SEEK:([R].[ip_from] <= [tempdb].[dbo].[#IPSample].[ip]),WHERE:([tempdb].[dbo].[#IPSample].[ip]<=[tempdb].[dbo].[#Range].[ip_to]as [R].[ip_to]) ORDERED FORWARD) to one like: |--Nested Loops(Inner Join, OUTER REFERENCES:([R].[ip_from], [R].[ip_to])) |--Index Scan(OBJECT:([tempdb].[dbo].[#Range] AS [R])) |--Index Spool(SEEK:([tempdb].[dbo].[#IPSample].[ip] >=[tempdb].[dbo].[#Range].[ip_from] as [R].[ip_from] AND[tempdb].[dbo].[#IPSample].[ip] <= [tempdb].[dbo].[#Range].[ip_to] as[R].[ip_to])) |--Top(TOP EXPRESSION:((99))) |--Table Scan(OBJECT:([tempdb].[dbo].[#IPSample])) i.e. when does it become cheaper find the logged IP addresses for each range rather than the ranges for each logged IP address?With 3 million rows in #Range and no index on #IPSample.ip, I get a value of n=99.Edit: line breaks in execution plan so it doesn't ruin layout of rest of thread. |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2008-04-18 : 09:05:51
|
Here's the test code I wrote:DROP TABLE #R0DROP TABLE #R1DROP TABLE #Range-- generate 4 million random 32-bit values to form the start of each range:SELECT CAST(FLOOR(POWER(2.0, 32) * RAND(CAST(NEWID() AS binary(4)))) AS bigint) as ipINTO #R0FROM dbo.Numbers AS N1, dbo.Numbers AS N2WHERE N2.n < 40-- get the distinct ones and give them ids in ip order:SELECT ROW_NUMBER() OVER (ORDER BY ip) AS range_id, ipINTO #R1FROM #R0GROUP BY ip-- cluster on range id:ALTER TABLE #R1 ALTER COLUMN range_id int NOT NULLALTER TABLE #R1 ADD PRIMARY KEY (range_id)-- get the end of one range from the start of the next range-- (i.e. ranges are non-overlapping and contiguous):CREATE TABLE #Range ( range_id int PRIMARY KEY NONCLUSTERED, ip_from bigint NOT NULL, ip_to bigint NOT NULL)INSERT INTO #RangeSELECT R_from.range_id, R_from.ip AS ip_from, R_to.ip - 1 AS ip_toFROM #R1 AS R_fromINNER JOIN #R1 AS R_to ON R_from.range_id = R_to.range_id - 1-- throw away every fourth range so there are some holes and ~3 million ranges:DELETE FROM #RangeWHERE (range_id % 4) = 0-- create the unique clustered index:CREATE UNIQUE CLUSTERED INDEX Range_from_to ON #Range(ip_from, ip_to) WITH (FILLFACTOR = 100)-- create a million random IP addressesSELECT CAST(FLOOR(POWER(2.0, 32) * RAND(CAST(NEWID() AS binary(4)))) AS bigint) as ipINTO #IPSampleFROM dbo.Numbers AS N1, dbo.Numbers AS N2WHERE N2.n < 10-- execution plan show estimated row count for result of 709780000000!SELECT *FROM #IPSample AS SINNER JOIN #Range AS R ON S.ip BETWEEN R.ip_from AND R.ip_to-- if we know that the ranges are non-overlapping, single IP address-- lookups can be done much more quickly like this:SELECT *FROM ( SELECT TOP 1000 * FROM #IPSample ) AS SOUTER APPLY ( SELECT * FROM ( SELECT TOP 1 * FROM #Range AS R WHERE S.ip >= R.ip_from ORDER BY R.ip_from DESC ) AS R WHERE S.ip <= R.ip_to ) AS R |
 |
|
|
icx
Starting Member
11 Posts |
Posted - 2008-04-28 : 04:30:31
|
| Hi allI must apologise - I completely forgot about this thread, and for some reason, my subscription to it stopped emailing me updates. Thank you for so many helpful replies, and also to Arnold Fribble who's excellent code I'm stil going through trying to understand :)I think I will try to work some of Arnold's logic into my own query and then I'll post the results when I'm done...Again, thanks for the assistance. |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2008-04-28 : 08:08:27
|
| I think what I was trying to get across (rather obliquely!) was that I was surprised about two things:1. Peter (Peso) had suggested merely indexing those columns2. The result of doing so had made it fast enough for you. Particularly if you were getting a range for a single IP address each time.Actually, if you do have overlapping ranges (and therefore the CROSS APPLY / TOP 1 solution doesn't work) then I'm not sure what the best way to do it is... |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-04-28 : 09:00:54
|
quote: Originally posted by icx The trouble is, the ip2location database is so large that executing a query against it to find which range a particular IP address is within takes me 1 second. Multiply that by 1,000,000 log rows, and Houston - we have a problem.
This is the part I read.I interpreted this as the ASP page read the client IP address, searched the IP-table and then wrote to the log table. E 12°55'05.25"N 56°04'39.16" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-04-28 : 09:03:02
|
Or are you doing nightly "batch" updates?Update all log entries for yesterday, with city information? E 12°55'05.25"N 56°04'39.16" |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2008-04-28 : 09:32:37
|
quote: Originally posted by Peso
quote: Originally posted by icx The trouble is, the ip2location database is so large that executing a query against it to find which range a particular IP address is within takes me 1 second. Multiply that by 1,000,000 log rows, and Houston - we have a problem.
This is the part I read.I interpreted this as the ASP page read the client IP address, searched the IP-table and then wrote to the log table.
Yes, that's what I was assuming too. Which is why I didn't understand your answer, because I don't really see how it helps to any great degree. A simple index on (ip_from, ip_to) can only be seeked to return the ranges where ip_from <= ip -- which is essentially everything by the time you get to the far end.One way I could see that index being useful in looking up ranges for a single IP address was, as I said, if the ranges don't overlap and therefore the one with the largest ip_from <= ip is the required one (hence the APPLY query).And then when the OP said it was working well I thought maybe I'd misinterpreted and that it was actually a batch process. |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-04-28 : 09:37:02
|
I have to test your code tonight.Maybe a SELECT TOP 1 * FROM IpLocation WHERE FromIP < @IpToSearch ORDER BY FromIP DESCwould suffice?Oh, what the heck! Why not "unpack" the ranges to single IP-addressed? A sort of semi-normalization?IpFrom IpTo City 1 3 London yields1 London2 London3 London Are the IP ranges all D-class addresses? Or C-class addresses? E 12°55'05.25"N 56°04'39.16" |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2008-04-28 : 09:59:31
|
quote: Originally posted by Peso I have to test your code tonight.Maybe a SELECT TOP 1 * FROM IpLocation WHERE FromIP < @IpToSearch ORDER BY FromIP DESCwould suffice?
Quite true...Can't stop thinking set-based even when I should! In fact, that APPLY isn't doing much more than what you said -- it's just a bit complicated by trying to keep seek-and-top execution plan and get left-join-ish behaviour. |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2008-04-28 : 10:43:46
|
You can also experiment horizontally partitioned tables.See http://weblogs.sqlteam.com/peterl/archive/2007/09/24/Horizontal-partitioning.aspxI don't know the distribution of your ip-addresses, but about 16 base tables would be enough, with ranges like this 0 268435455 268435456 536870911 536870912 805306367 805306368 10737418231073741824 13421772791342177280 16106127351610612736 18790481911879048192 21474836472147483648 24159191032415919104 26843545592684354560 29527900152952790016 32212254713221225472 34896609273489660928 37580963833758096384 40265318394026531840 4294967295 Of course you have to investigate your data to make proper partitioning but this is the general idea. And because the IP (which is part of pk) always is included it will be fast. E 12°55'05.25"N 56°04'39.16" |
 |
|
|
icx
Starting Member
11 Posts |
Posted - 2008-04-29 : 06:00:13
|
| Wow - you guys are going way over my head here :) I should explain that I am by no means a DBA :) My background is in software programming, more lately with the .NET framework ... so you can see my dilemma here.... I am trying to be a DBA and not doing too well! :)(To my credit, I have now purchased several books on the subject).Anyway, back to the thread. Just to clarify a few points for you both, the ranges do NOT overlap, and currently, the updates to the log data from the IP2location database were taking place as each row was inserted (hence the problem: with a 1s+ delay per lookup, a million rows = 1 million seconds to complete the insert, which is of course, bad!).I am actually now rewriting the entire logging component so that, for instance, once the initial log import has completed, I can fire a separate proc to do the updates in a batch, vs. using sql's triggering to do it on a line by line basis.Does this help clarify things a little?Thanks again for your support and your patience... |
 |
|
|
|