Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
Dash Riprock
Starting Member
6 Posts |
Posted - 2010-08-18 : 08:25:19
|
| Hi all,I have imported a .csv file into Server 2008 (via access to add in the 'ID' field because I couldn't figure out how to do that using the import wizard) and the resulting table looks like this:ID, Date, Price, Volume,1, 20090610 153000, 936, 212, 20090610 153000, 935.75, 13, 20090610 153010, 935.5, 14, 20090610 153017, 935.5, 15, 20090610 153000, 935.5, 16, 20090610 153000, 935.75, 5... and so on for many millions (in some cases 25+) of rows. What I would like to do is to remove the rows where the Price field hasn't changed from the previous record - so the new table would look like this:ID, Date, Price, Volume,1, 20090610 153000, 936, 212, 20090610 153000, 935.75, 13, 20090610 153010, 935.5, 16, 20090610 153000, 935.75, 5(rows 4 and 5 have been removed from the table).As you may be able to tell, my SQL experience is... none! I have spent about a day now searching various forums and blogs for a solution, but I am afraid to say that any proposed solution just goes way over my head. In my mind I see a couple of routes to a solution:1) "Look at the Price of the record where ID = Current ID - 1, if Price (then) = Price(now), delete the row / if Price(then) ~= Price(now), add the record to a new table (preferred to deleting)But I think this may run into problems where there are instances of prices repeating over more than one record, for example:ID, Date, Price, Volume,1, 20090610 153000, 936, 212, 20090610 153000, 935.75, 13, 20090610 153010, 935.5, 14, 20090610 153017, 935.5, 15, 20090610 153000, 935.5, 16, 20090610 153000, 935.75, 5Applying argument 1) to row 5, for example, would mean looking at row 4 which would already have been deleted - I guess we can get around that my copying the "new price" rows into a new table rather than deleting them from the existing table.2) Calculate the change between successive rows, exporting that into a new table along with the other fields, and then deleting the rows where the "change" field is zero. I think this could be very resource intensive on 25m+ records!!It is imperitve that the data remains (or can be returned to) the order in which it appears in the original file... it is Time and Sales data for a futures contract, the data is useless out of order.I would be very grateful if somebody could help me write a query that can do what I describe - I am asking for help not via lack of effort, but lack of talent!I can provide sample data if required.Thanks in advanceDash |
|
|
MSquared
Yak Posting Veteran
52 Posts |
Posted - 2010-08-18 : 10:38:06
|
| I'm kinda confused. Your header had 4 columns, but your data has 5. I think the ID field is what's confusing you. Do you use it to sort? How do you know the order of the prices? In your last example, what are you expecting as the end-result in the table? Is the CSV file loaded into a staging table and then compared to a production table? Not sure what you mean by "the data remains (or can be returned to) the order in which it appears in the original file" |
 |
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2010-08-18 : 10:39:36
|
OK, Let's take this one piece by piecequote: Originally posted by Dash Riprock...resulting table looks like this:ID, Date, Price, Volume,1, 20090610 153000, 936, 212, 20090610 153000, 935.75, 13, 20090610 153010, 935.5, 14, 20090610 153017, 935.5, 15, 20090610 153000, 935.5, 16, 20090610 153000, 935.75, 5 What I would like to do is to remove the rows where the Price field hasn't changed from the previous record - so the new table would look like this:ID, Date, Price, Volume,1, 20090610 153000, 936, 212, 20090610 153000, 935.75, 13, 20090610 153010, 935.5, 16, 20090610 153000, 935.75, 5 (rows 4 and 5 have been removed from the table).
What do you mean by>> What I would like to do is to remove the rows where the Price field hasn't changed from the previous recordHow do Rows 4 and 5 meet your requirement?I don't see itBrett8-)Hint: Want your questions answered fast? Follow the direction in this linkhttp://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspxAdd yourself!http://www.frappr.com/sqlteam |
 |
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2010-08-18 : 10:42:07
|
| Maybe Row 5, but then Row 2 also fits that descriptionAlso..you Must UnderstandThe Physical Order of Data in a Database has no meaningSay it with me againThe Physical Order of Data in a Database has no meaningIf you want to say, relate rows of data based on a date, we can go from there...but then what happens when you have the same datetime???Brett8-)Hint: Want your questions answered fast? Follow the direction in this linkhttp://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspxAdd yourself!http://www.frappr.com/sqlteam |
 |
|
|
Dash Riprock
Starting Member
6 Posts |
Posted - 2010-08-18 : 11:41:42
|
| Thanks everybody for taking an interest :-)OK, I will try and clear up any confusions. The data is a list of trades for a particular futures contract... so the first row is a trade that was executed at 15:30:00 on 10/06/2009 - 21 contracts were traded at 936.00 (just for context).The ID column is one I added after importing the data into access (I'm embarrassed but I couldn't figure out how to do that in SQL). The data in the .txt file has the rows in order of the time that they occurred (the Time and Date details are together), but as you can see some trades occur at the same time (because the time data doesn't go below seconds). I included the ID field to start at 1 for the top line and work its way down (second line = 2 etc), so that I can re-order the data back to the original state if it gets jumbled up.The order comes direct from the exchange, they list the details of each trade that occurs in the order that they happen.I'm not sure of the specifics about "staging tables" but what I think you mean is replicating the whole table, and then comparing the price at row 2 in table A to the price in row 1 table B? I think I have read about solutions that take that approach (inner-joining?) but as you can gather I'm a total SQL "n00b"!!Moving onto rows 4 and 5, I'll try and explain it another way:The series goes 936.00...935.75...935.5...935.5...935.5...935.75...The prices I have put in italics are the same as the price that occurred before it in the real world. I understand that databases don't have an 'order', so thats why I put the ID field in - my (maybe naive) idea was that the query would look think "Ok, what is the ID of this record? 3.. ok, what was the price of record ID = 3-1? Is it the same as the price for record ID = 3? If it isn't the same (i.e. the prices are different) put a copy of this record in a new table, if it is then do nothing and move onto the next"Maybe another way to explain it would be in a loop:For i = 2 to i = max[id]Select record where ID = iA = price of record where ID=1Select the record where ID = i - 1B = price of record where ID = i-1IF A == B DO NOTHINGELSEIF A ~= B EXPORT ROW B TO NEW TABLEENDi = i+1ENDSorry for the child-like decription but thats the only way I can think to do it :-/Eventually the data will be sent out to things like R and matlab to do time-series analysis on it, so thats what I mean by the order... in a new table I could just list/sort all the records in ascending ID.Thanks again, I hope I have made myself clear |
 |
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2010-08-18 : 11:55:45
|
What are the datattypes???Your datetime is varchar???CREATE TABLE myTable99(ID int, [Date] datetime, Price money, Volume int)GOINSERT INTO myTable99(ID, [Date], Price, Volume)SELECT 1, '2009/06/10 15:30:00', 936, 21 UNION ALLSELECT 2, '2009/06/10 15:30:00', 935.75, 1 UNION ALLSELECT 3, '2009/06/10 15:30:10', 935.5, 1 UNION ALLSELECT 4, '2009/06/10 15:30:17', 935.5, 1 UNION ALLSELECT 5, '2009/06/10 15:30:00', 935.5, 1 UNION ALLSELECT 6, '2009/06/10 15:30:00', 935.75, 5GO Brett8-)Hint: Want your questions answered fast? Follow the direction in this linkhttp://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspxAdd yourself!http://www.frappr.com/sqlteam |
 |
|
|
Dash Riprock
Starting Member
6 Posts |
Posted - 2010-08-18 : 12:10:12
|
| Hi,From what I can tell, the data types are as follows:ID = intDate = nvarchar(50)Price = floatVolume = intDo you think you understand what I am trying to do?Thanks for the help mate. |
 |
|
|
X002548
Not Just a Number
15586 Posts |
|
|
X002548
Not Just a Number
15586 Posts |
Posted - 2010-08-18 : 12:42:01
|
Is this it?CREATE TABLE myTable99(ID int, [Date] datetime, Price money, Volume int)GOINSERT INTO myTable99(ID, [Date], Price, Volume)SELECT 1, '2009/06/10 15:30:00', 936, 21 UNION ALLSELECT 2, '2009/06/10 15:30:00', 935.75, 1 UNION ALLSELECT 3, '2009/06/10 15:30:10', 935.5, 1 UNION ALLSELECT 4, '2009/06/10 15:30:17', 935.5, 1 UNION ALLSELECT 5, '2009/06/10 15:30:00', 935.5, 1 UNION ALLSELECT 6, '2009/06/10 15:30:00', 935.75, 5GO SELECT a.[ID], b.[ID], a.Price, b.Price FROM myTable99 aINNER JOIN myTable99 b ON a.[ID] = b.[ID] - 1 WHERE a.Price <> b.PriceGODROP TABLE myTable99GO Brett8-)Hint: Want your questions answered fast? Follow the direction in this linkhttp://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspxAdd yourself!http://www.frappr.com/sqlteam |
 |
|
|
MSquared
Yak Posting Veteran
52 Posts |
Posted - 2010-08-18 : 12:42:42
|
This will work, but I'm not sure it's correct. I am sorting by the order in which the data was entered into the table and not by the SellDate. If we use the SellDate, then it will work differently sonce the order is different. To change the order, just change the order by clause in the ctedeclare @t TABLE (ID int, Sell_Date datetime, Price decimal(8,2), Volume int)-- See how 3 comes after 4 when sorting by date and 5 occurs before 3. --That's what I'm talking about with the order by clauseINSERT INTO @t(ID, Sell_Date, Price, Volume)SELECT 1, '2009/06/10 15:30:00', 936, 21 UNION ALLSELECT 2, '2009/06/10 15:30:00', 935.75, 1 UNION ALLSELECT 3, '2009/06/10 15:30:10', 935.5, 1 UNION ALLSELECT 4, '2009/06/10 15:30:17', 935.5, 1 UNION ALLSELECT 5, '2009/06/10 15:30:00', 935.5, 1 UNION ALLSELECT 6, '2009/06/10 15:30:00', 935.75, 5select * from @tbegin tran;with cte as(select ID, Sell_Date, Price, Volume, Row_Number() over (order by id, sell_date) RowIDfrom @t)delete tfrom @t twhere ID in (select a.id from cte a left outer join cte b on a.RowID = b.RowID - 1 where a.Price = b.Price) select * from @trollback |
 |
|
|
Dash Riprock
Starting Member
6 Posts |
Posted - 2010-08-18 : 13:59:56
|
quote: Originally posted by X002548 Maybe...If the "previous" row (based on ID) has the same value as the row before, you want to ignore it frm the result set.Correct?Brett8-)Hint: Want your questions answered fast? Follow the direction in this linkhttp://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspxAdd yourself!http://www.frappr.com/sqlteam
Hi guys,Yes, I think we are starting to talk about the same thing... but - and sorry for being pedantic - the row before the previous row would be 2 rows back from the present record being examined?I am going to take what everyone's said so far and try to make sense of it, try copying some of the SQL code that has been suggessted and try it on my table and stuff like that. I only need to do the query once on each table, but I would like to have an understanding of what I am doing, self improvement and all that :-)Thanks very much for taking your time to help me out, I assure you it is appreciated!I will check back tomorrow with details of how I got on. |
 |
|
|
Dash Riprock
Starting Member
6 Posts |
Posted - 2010-08-18 : 14:24:48
|
| Oh Goodness, I just noticed something I said in an earlier post isn't quite accurate...Looking at the original "table" that I used as an example, I didn't change the time field on the last two records (5 and 6)... so they should look something like this:ID, Date, Price, Volume,1, 20090610 153000, 936, 212, 20090610 153000, 935.75, 13, 20090610 153010, 935.5, 14, 20090610 153017, 935.5, 15, 20090610 153017, 935.5, 16, 20090610 153019, 935.75, 5Corrections are in Bold, sorry if this is confusing anybody - It's my bad I just forgot to provide the correct time data. In case you are thinking about ordering by Time, it is important if two trades are made at the same time (for instance 1 and 2 in the table here) that they are still preserved in the original order because of the volumes... (I won;t go into the trading bit, but its important for that).Sorry! |
 |
|
|
Dash Riprock
Starting Member
6 Posts |
Posted - 2010-08-18 : 15:33:08
|
| I made a diagram to try and explain it better...http://img291.imageshack.us/img291/8029/sqlj.png |
 |
|
|
|
|
|
|
|