Exporting records that change - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

Exporting records that change

Author

Topic

Dash Riprock
Starting Member

6 Posts

Posted - 2010-08-18 : 08:25:19

Hi all,

I have imported a .csv file into Server 2008 (via access to add in the 'ID' field because I couldn't figure out how to do that using the import wizard) and the resulting table looks like this:

ID, Date, Price, Volume,
1, 20090610 153000, 936, 21
2, 20090610 153000, 935.75, 1
3, 20090610 153010, 935.5, 1
4, 20090610 153017, 935.5, 1
5, 20090610 153000, 935.5, 1
6, 20090610 153000, 935.75, 5

... and so on for many millions (in some cases 25+) of rows. What I would like to do is to remove the rows where the Price field hasn't changed from the previous record - so the new table would look like this:

ID, Date, Price, Volume,
1, 20090610 153000, 936, 21
2, 20090610 153000, 935.75, 1
3, 20090610 153010, 935.5, 1
6, 20090610 153000, 935.75, 5

(rows 4 and 5 have been removed from the table).

As you may be able to tell, my SQL experience is... none! I have spent about a day now searching various forums and blogs for a solution, but I am afraid to say that any proposed solution just goes way over my head. In my mind I see a couple of routes to a solution:

1) "Look at the Price of the record where ID = Current ID - 1, if Price (then) = Price(now), delete the row / if Price(then) ~= Price(now), add the record to a new table (preferred to deleting)

But I think this may run into problems where there are instances of prices repeating over more than one record, for example:

ID, Date, Price, Volume,
1, 20090610 153000, 936, 21
2, 20090610 153000, 935.75, 1
3, 20090610 153010, 935.5, 1
4, 20090610 153017, 935.5, 1
5, 20090610 153000, 935.5, 1
6, 20090610 153000, 935.75, 5

Applying argument 1) to row 5, for example, would mean looking at row 4 which would already have been deleted - I guess we can get around that my copying the "new price" rows into a new table rather than deleting them from the existing table.

2) Calculate the change between successive rows, exporting that into a new table along with the other fields, and then deleting the rows where the "change" field is zero. I think this could be very resource intensive on 25m+ records!!

It is imperitve that the data remains (or can be returned to) the order in which it appears in the original file... it is Time and Sales data for a futures contract, the data is useless out of order.

I would be very grateful if somebody could help me write a query that can do what I describe - I am asking for help not via lack of effort, but lack of talent!

I can provide sample data if required.

Thanks in advance

Dash

MSquared
Yak Posting Veteran

52 Posts

Posted - 2010-08-18 : 10:38:06

I'm kinda confused. Your header had 4 columns, but your data has 5. I think the ID field is what's confusing you. Do you use it to sort? How do you know the order of the prices? In your last example, what are you expecting as the end-result in the table? Is the CSV file loaded into a staging table and then compared to a production table? Not sure what you mean by "the data remains (or can be returned to) the order in which it appears in the original file"

X002548
Not Just a Number

15586 Posts

Posted - 2010-08-18 : 10:39:36

OK, Let's take this one piece by piece

quote:

Originally posted by Dash Riprock
...resulting table looks like this:


ID,     Date,        Price,     Volume,
1,             20090610 153000,    936,              21
2,             20090610 153000,    935.75,           1
3,             20090610 153010,    935.5,            1
4,             20090610 153017,    935.5,            1
5,             20090610 153000,    935.5,            1
6,             20090610 153000,    935.75,           5

What I would like to do is to remove the rows where the Price field hasn't changed from the previous record - so the new table would look like this:


ID,     Date,        Price,     Volume,
1,             20090610 153000,    936,              21
2,            20090610 153000,    935.75,           1
3,             20090610 153010,    935.5,            1
6,             20090610 153000,    935.75,           5

(rows 4 and 5 have been removed from the table).

What do you mean by

>> What I would like to do is to remove the rows where the Price field hasn't changed from the previous record

How do Rows 4 and 5 meet your requirement?

I don't see it

Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

Add yourself!
http://www.frappr.com/sqlteam

X002548
Not Just a Number

15586 Posts

Posted - 2010-08-18 : 10:42:07

Maybe Row 5, but then Row 2 also fits that description

Also..you Must Understand

The Physical Order of Data in a Database has no meaning

Say it with me again

The Physical Order of Data in a Database has no meaning

If you want to say, relate rows of data based on a date, we can go from there...but then what happens when you have the same datetime???

Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

Add yourself!
http://www.frappr.com/sqlteam

Dash Riprock
Starting Member

6 Posts

Posted - 2010-08-18 : 11:41:42

Thanks everybody for taking an interest :-)

OK, I will try and clear up any confusions. The data is a list of trades for a particular futures contract... so the first row is a trade that was executed at 15:30:00 on 10/06/2009 - 21 contracts were traded at 936.00 (just for context).

The ID column is one I added after importing the data into access (I'm embarrassed but I couldn't figure out how to do that in SQL). The data in the .txt file has the rows in order of the time that they occurred (the Time and Date details are together), but as you can see some trades occur at the same time (because the time data doesn't go below seconds). I included the ID field to start at 1 for the top line and work its way down (second line = 2 etc), so that I can re-order the data back to the original state if it gets jumbled up.

The order comes direct from the exchange, they list the details of each trade that occurs in the order that they happen.

I'm not sure of the specifics about "staging tables" but what I think you mean is replicating the whole table, and then comparing the price at row 2 in table A to the price in row 1 table B? I think I have read about solutions that take that approach (inner-joining?) but as you can gather I'm a total SQL "n00b"!!

Moving onto rows 4 and 5, I'll try and explain it another way:

The series goes 936.00...935.75...935.5...935.5...935.5...935.75...

The prices I have put in italics are the same as the price that occurred before it in the real world. I understand that databases don't have an 'order', so thats why I put the ID field in - my (maybe naive) idea was that the query would look think "Ok, what is the ID of this record? 3.. ok, what was the price of record ID = 3-1? Is it the same as the price for record ID = 3? If it isn't the same (i.e. the prices are different) put a copy of this record in a new table, if it is then do nothing and move onto the next"

Maybe another way to explain it would be in a loop:

For i = 2 to i = max[id]

Select record where ID = i
A = price of record where ID=1

Select the record where ID = i - 1
B = price of record where ID = i-1

IF A == B DO NOTHING
ELSEIF A ~= B EXPORT ROW B TO NEW TABLE
END
i = i+1
END

Sorry for the child-like decription but thats the only way I can think to do it :-/

Eventually the data will be sent out to things like R and matlab to do time-series analysis on it, so thats what I mean by the order... in a new table I could just list/sort all the records in ascending ID.

Thanks again, I hope I have made myself clear

X002548
Not Just a Number

15586 Posts

Posted - 2010-08-18 : 11:55:45

What are the datattypes???

Your datetime is varchar???


CREATE TABLE myTable99(ID int, [Date] datetime, Price money, Volume int)
GO

INSERT INTO myTable99(ID, [Date], Price, Volume)
SELECT 1, '2009/06/10 15:30:00', 936,    21  UNION ALL
SELECT 2, '2009/06/10 15:30:00', 935.75,  1  UNION ALL
SELECT 3, '2009/06/10 15:30:10', 935.5,   1  UNION ALL
SELECT 4, '2009/06/10 15:30:17', 935.5,   1  UNION ALL
SELECT 5, '2009/06/10 15:30:00', 935.5,   1  UNION ALL
SELECT 6, '2009/06/10 15:30:00', 935.75,  5
GO

Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

Add yourself!
http://www.frappr.com/sqlteam

Dash Riprock
Starting Member

6 Posts

Posted - 2010-08-18 : 12:10:12

Hi,

From what I can tell, the data types are as follows:

ID = int
Date = nvarchar(50)
Price = float
Volume = int

Do you think you understand what I am trying to do?

Thanks for the help mate.

X002548
Not Just a Number

15586 Posts

Posted - 2010-08-18 : 12:37:55

Maybe...

If the "previous" row (based on ID) has the same value as the row before, you want to ignore it frm the result set.

Correct?

Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

Add yourself!
http://www.frappr.com/sqlteam

X002548
Not Just a Number

15586 Posts

Posted - 2010-08-18 : 12:42:01

Is this it?


CREATE TABLE myTable99(ID int, [Date] datetime, Price money, Volume int)
GO

INSERT INTO myTable99(ID, [Date], Price, Volume)
SELECT 1, '2009/06/10 15:30:00', 936,    21  UNION ALL
SELECT 2, '2009/06/10 15:30:00', 935.75,  1  UNION ALL
SELECT 3, '2009/06/10 15:30:10', 935.5,   1  UNION ALL
SELECT 4, '2009/06/10 15:30:17', 935.5,   1  UNION ALL
SELECT 5, '2009/06/10 15:30:00', 935.5,   1  UNION ALL
SELECT 6, '2009/06/10 15:30:00', 935.75,  5
GO

    SELECT a.[ID], b.[ID], a.Price, b.Price 
      FROM myTable99 a
INNER JOIN myTable99 b
		ON a.[ID] = b.[ID] - 1
     WHERE a.Price <> b.Price
GO

DROP TABLE myTable99
GO

Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

Add yourself!
http://www.frappr.com/sqlteam

MSquared
Yak Posting Veteran

52 Posts

Posted - 2010-08-18 : 12:42:42

This will work, but I'm not sure it's correct. I am sorting by the order in which the data was entered into the table and not by the SellDate. If we use the SellDate, then it will work differently sonce the order is different. To change the order, just change the order by clause in the cte



declare @t TABLE (ID int, Sell_Date datetime, Price decimal(8,2), Volume int)

-- See how 3 comes after 4 when sorting by date and 5 occurs before 3.  
--That's what I'm talking about with the order by clause

INSERT INTO @t(ID, Sell_Date, Price, Volume)
SELECT 1, '2009/06/10 15:30:00', 936,    21  UNION ALL
SELECT 2, '2009/06/10 15:30:00', 935.75,  1  UNION ALL
SELECT 3, '2009/06/10 15:30:10', 935.5,   1  UNION ALL
SELECT 4, '2009/06/10 15:30:17', 935.5,   1  UNION ALL
SELECT 5, '2009/06/10 15:30:00', 935.5,   1  UNION ALL
SELECT 6, '2009/06/10 15:30:00', 935.75,  5

select * from @t

begin tran

;with cte as
(select ID, Sell_Date, Price, Volume,
		Row_Number() over (order by id, sell_date) RowID
from @t)

delete t
from @t t
where ID in 
	(select a.id from cte a 
		 left outer join cte b 
			on a.RowID = b.RowID - 1
		where a.Price = b.Price) 
 
select * from @t

rollback

Dash Riprock
Starting Member

6 Posts

Posted - 2010-08-18 : 13:59:56

quote:
Originally posted by X002548

Maybe...

If the "previous" row (based on ID) has the same value as the row before, you want to ignore it frm the result set.

Correct?

Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

Add yourself!
http://www.frappr.com/sqlteam

Hi guys,

Yes, I think we are starting to talk about the same thing... but - and sorry for being pedantic - the row before the previous row would be 2 rows back from the present record being examined?

I am going to take what everyone's said so far and try to make sense of it, try copying some of the SQL code that has been suggessted and try it on my table and stuff like that. I only need to do the query once on each table, but I would like to have an understanding of what I am doing, self improvement and all that :-)

Thanks very much for taking your time to help me out, I assure you it is appreciated!

I will check back tomorrow with details of how I got on.

Dash Riprock
Starting Member

6 Posts

Posted - 2010-08-18 : 14:24:48

Oh Goodness, I just noticed something I said in an earlier post isn't quite accurate...

Looking at the original "table" that I used as an example, I didn't change the time field on the last two records (5 and 6)... so they should look something like this:

ID, Date, Price, Volume,
1, 20090610 153000, 936, 21
2, 20090610 153000, 935.75, 1
3, 20090610 153010, 935.5, 1
4, 20090610 153017, 935.5, 1
5, 20090610 153017, 935.5, 1
6, 20090610 153019, 935.75, 5

Corrections are in Bold, sorry if this is confusing anybody - It's my bad I just forgot to provide the correct time data. In case you are thinking about ordering by Time, it is important if two trades are made at the same time (for instance 1 and 2 in the table here) that they are still preserved in the original order because of the volumes... (I won;t go into the trading bit, but its important for that).

Sorry!

Dash Riprock
Starting Member

6 Posts

Posted - 2010-08-18 : 15:33:08

I made a diagram to try and explain it better...

http://img291.imageshack.us/img291/8029/sqlj.png

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources