Storing Large Graph - Inserting Millions of Rows

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Storing Large Graph - Inserting Millions of Rows

Author

Topic

aeblank
Starting Member

12 Posts

Posted - 2008-11-18 : 04:31:25

THE PROBLEM
I'm running into performance issues generating and storing a randomly created graph in a SQL Server database. I have a T-SQL script that generates a graph, and then randomly connects the vertices in that graph to each other. I can see that my hard-drive is working very hard throughout the operation. The operation took about 2 hours (I just canceled it at this point, looked like it was about 10% done) with 200,000 vertices and an average of 50 edges per vertex (essentially, it had to insert about 12 million rows).

How can I speed this up? Is there a clever way to get SQL Server to do all the operations in RAM, and then just copy the data from RAM to disk in one contiguous stream after all the inserting is done?

I am developing on a 4gb RAM laptop, and the database should easily fit entirely into RAM several times over--I estimate that my "test" script would generate a 150mbyte database.

THE BACKGROUND
I'm new to database programming and am trying to store a directed graph in a database for a research project. The graph will ultimately contain approximately 1,000,000 vertices with an average of about 150 edges per vertex. There will be more than one graph stored. When this is deployed, it will be deployed on a cluster with (much) more RAM, storage and computing power.

Each edge represents a uni-directional connection between two points (vertices) on the graph). So if point 10 connects to point 23 in a graph, there would be an "edge" entry in the database containing "10" and "23".

If a vertex/point is isolated (it connects to nothing), it is stored as an edge which "connects" to itself: i.e. if point 56 connects to no other points, then there would be an "edge" entry in the database containing "56" and "56".

THE SCHEMA
The table is of the form:
[CODE]
CREATE TABLE [dbo].[edges](
[edge_id] [int] IDENTITY(1,1) NOT NULL,
[graph_id] [int] NOT NULL,
[primary_vertex_id] [int] NOT NULL,
[adjacent_vertex_id] [int] NULL CONSTRAINT [DF_edges_adjacent_vertex_id] DEFAULT (NULL),
CONSTRAINT [PK_edges] PRIMARY KEY CLUSTERED
(
[edge_id] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
[/CODE]

THE SCRIPT THAT TAKES TOO LONG
This was cobbled together from two functions ("add vertices" and "connect vertices"), I hope it isn't too difficult to read.
[CODE]
DECLARE @graph_id INT
DECLARE @vertices INT
DECLARE @average_edges INT
SET @graph_id = 35
SET @vertices = 200000 /* 200,000 vertex graph */
SET @average_edges = 50; /* approx 50 connections per vertex */
BEGIN
/* get the first new vertex id -- should usually be 1, but just in case there's already some vertices in graph */
DECLARE @first_vertex_id INT
SELECT @first_vertex_id = (1 + isnull(max(dbo.edges.primary_vertex_id),0)) FROM dbo.edges WHERE dbo.edges.graph_id = @graph_id;
BEGIN
DECLARE @counter INT
DECLARE @new_vertex_id INT
SET @counter = 0
WHILE @counter < @vertices
BEGIN
SET @new_vertex_id = @first_vertex_id + @counter;
SET @counter = @counter + 1;
/* insert a partial edge with the new vertex id */
INSERT INTO dbo.edges(graph_id,
primary_vertex_id,
adjacent_vertex_id)
VALUES(@graph_id, @new_vertex_id, @new_vertex_id);
END;
RETURN 0;
END;
END;

BEGIN
DECLARE @lower INT
DECLARE @upper INT
SELECT @lower = min(dbo.edges.primary_vertex_id) FROM dbo.edges WHERE dbo.edges.graph_id = @graph_id
SELECT @upper = max(dbo.edges.primary_vertex_id) FROM dbo.edges WHERE dbo.edges.graph_id = @graph_id

/* note: this algorithm assumes that the vertex_ids are CONTIGUOUS!! and may create duplicate edges */
BEGIN
DECLARE @loop_max INT
DECLARE @counter INT
DECLARE @rn_one INT
DECLARE @rn_two INT
SET @counter = 0
SELECT @loop_max = (@upper - @lower + 1) * @average_edges
WHILE @counter < @loop_max
BEGIN
/* pick two random vertex IDs that exist on the graph */
SELECT @rn_one = Round(((@Upper - @Lower -1) * Rand() + @Lower), 0);
SELECT @rn_two = Round(((@Upper - @Lower -1) * Rand() + @Lower), 0);
/* insert a bi-directional edge for those two vertices */
INSERT INTO dbo.edges(graph_id,
primary_vertex_id,
adjacent_vertex_id)
VALUES(@graph_id, @rn_one, @rn_two);

INSERT INTO dbo.edges(graph_id,
primary_vertex_id,
adjacent_vertex_id)
VALUES(@graph_id, @rn_two, @rn_one);
SET @counter = @counter + 1;
END
END;
END;
[/CODE]

sakets_2000
Master Smack Fu Yak Hacker

1472 Posts

Posted - 2008-11-18 : 04:48:40

Have you tried taking off the indexes when running the script ?

aeblank
Starting Member

12 Posts

Posted - 2008-11-18 : 05:19:38

How would I do that from within my script? (This code is called from a .NET app as a stored procedure)
And of course, how do I re-create the indexes after the operation is done?

Can you point me to the right commands, and then I can Google it from there.
Thanks!

sakets_2000
Master Smack Fu Yak Hacker

1472 Posts

Posted - 2008-11-18 : 05:23:56

Include the drop indices command towards the start of the proc, before you start inserting.
Something like,
IF EXISTS (SELECT name FROM sysindexes WHERE name = 'PK_edges')
DROP INDEX edges.PK_edges

After you're through with the insert, Recreate the index.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources