bcp with "strange" characters? - SQL Server Forums

Home | Weblogs | Forums

Site Sponsored By: SQLDSC - SQL Server Desired State Configuration

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Import/Export (DTS) and Replication (2000)

bcp with "strange" characters?

Author

Topic

visoth
Starting Member

7 Posts

Posted - 2002-01-09 : 12:28:24

Hi,

I'm having quite a bit of trouble with bcp's lack of support for "escape sequences". My two problems are a) I need to import text fields that contain newlines (comming from Unix only a line feed) b) I can't think of a single character that could be the field terminator and nothing else, so my original file also escapes the ft when necessary. I could perhaps try to set the row terminator as CRLF, leave the LFs without special meaning, and set CR or NULL as my ft. Does this sound all right? Is there something better? This is a task I have to automate, and not with VBA or stuff like that, so little chance of using any Microsoft program to help me. I can use a C compiler though. I will be trying several variations of the above theme until I find a solution, but got a bit tired for now and decided to ask the community.
Cheers,

Tasos

AjarnMark
SQL Slashing Gunting Master

3246 Posts

Posted - 2002-01-09 : 12:51:31

I haven't tried this, but can you use a high ASCII value like char(254) as a field terminator?

--------------------------------------------------------------
1000 Posts, Here I come! I wonder what my new title will be...

visoth
Starting Member

7 Posts

Posted - 2002-01-09 : 13:56:52

and how do I specify a "high" character on the command line? I can't write char(254), can I? Mind you, I have a lot of multilingual text, have to exercise caution in selecting the code.

graz
Chief SQLTeam Crack Dealer

4149 Posts

Posted - 2002-01-09 : 15:56:40

Can you export to XML rather than a text file? Can you convert your line feeds to something else before you export them. Maybe "__LF__" or something elese that wouldn't exist in the source? Then you could convert them back to vbCRLFs when they're loaded.

===============================================
Creating tomorrow's legacy systems today.
One crisis at a time.

graz
Chief SQLTeam Crack Dealer

4149 Posts

Posted - 2002-01-09 : 16:34:58

FYI - I moved this post from the Article Discussion forum to the Import/Export forum. That seemed more appropriate.

===============================================
Creating tomorrow's legacy systems today.
One crisis at a time.

visoth
Starting Member

7 Posts

Posted - 2002-01-09 : 18:51:43

I could export mark CRs with a unique string, but I feel that there is a serious performance hit - the idea of using bcp is to reduce a previous process that used to take hours to minutes. what would be the best way to achieve this conversion? T-SQL does have a search-and-replace feature, but it seems to me that the loading process is when I stand my best chance to preserve performance. From what I 've seen so far, bcp and DTS don't offer enough of pattern matching to do the _LF_ conversion, do they?

I could possibly export (or convert) to XML as well, but then how do I load?

robvolk
Most Valuable Yak

15732 Posts

Posted - 2002-01-09 : 19:43:18

I assume that accent marks (`) and tilde (~) are out; I use them a lot because I never have data with them in it. I also like to use the pipe (|), backslash (\), caret (^) and asterisk (*). It's unusual for text data to have all of these characters in them...are you sure none of these can be used? What about that old favorite, the tab character (\t)?

If you need to put in a high-byte character, like 254, here's an olllllllld DOS trick, updated for Windows: hold down the Alt key, and type in the byte code for the character using the numeric keypad. You should enter the code with a leading 0 so that it's 4 digits long. For example, the ¢ symbol is Alt-0162.

You can set up a bcp format file or files for your imports, using the proper delimiters, so that you don't have to re-enter the high-bytes codes each time you import.

Edited by - robvolk on 01/09/2002 19:44:06

visoth
Starting Member

7 Posts

Posted - 2002-01-09 : 20:08:47

Thanks for the answers so far. My files are, as I said multilingual, and amount to around 50MB of text I think, which tends to contain all the characters you can readily enter on the keyboard, and then some - it doesn't contain any euros right now, but it probably will quite soon. I like the Alt- idea, just like the old days, what about the in-field carriage returns though? Is it healthy enough to have another strange character in their place and ask DTS to do a 1-1 translation? Like I said previously, I think it would be too slow to try to do a search-n-replace with T-SQL.

visoth
Starting Member

7 Posts

Posted - 2002-01-09 : 20:15:24

having said all that, I 'll probably be fine, but I was that close to having to bulk import blobs. I 've tried quite hard to find if bcp could important blobish data types, and all I found was a quite lousy example in Microsoft's knowledge base, which would only work with one blob (a Word document in the example) at a time, and you even had to state the size of the blob. Any ideas/experience with this more general bcp case?

robvolk
Most Valuable Yak

15732 Posts

Posted - 2002-01-09 : 20:39:00

If the data is truly multilingual, I can't see how the characters would be properly interpreted using one code page or collation without resorting to national (double byte) data...meaning that at least SOME characters can't really be used for data purposes.

Take a look in the Character Map application in Windows; there have to be a few characters out of the 255 that can't appear in the data stream. Alt-0172 (¬) I've seen used in some apps as a delimiter. Here's an overline Alt-0175 (¯) character too. These might represent different characters if you are using another code page; I'm on CP 437.

Let me ask this...are you ABSOLUTELY certain that the horizontal tab (code 9, \t) character appears in your data? This is a nearly universal column delimiter on a wide range of systems for a number of years, both DOS/Windows and Unix. As far as row terminators, if you can use another control character (1-25, vertical tab (11) is probably a good one) instead of some kind of LF combination, it should save you some trouble.

visoth
Starting Member

7 Posts

Posted - 2002-01-10 : 04:45:02

Yes, they do contain tabs, look at them as articles from print media from around the world, they may contain unprintable(Western (TM)) Ascii codes, but look at them on a Korean machine or through the Thai encoding and voila. But I am NOT claiming they contain ALL Ascii codes, although I was asking around to gauge SQL Server's (and bcp's) tolerance for such unprintable codes. From what I gather, NULL ( char(0) ) is not tolerated very well.

robvolk
Most Valuable Yak

15732 Posts

Posted - 2002-01-10 : 08:53:17

You're right, NUL (char 0) is not a good idea, nor is EOF (char 26).

Found a cool ASCII chart here:

http://www.mindspring.com/~jc1/serial/Resources/ASCII.html

And lo and behold, char 30 and 31 are listed as record separator and unit separator! Since these are lo-byte control characters, they won't interfere with character data.

visoth
Starting Member

7 Posts

Posted - 2002-01-10 : 17:01:34

Newsflash2: bcp works as well. One funny thing about this process is that, under Windows NT console, I cannot put all of my bcps in one .bat file, because the special terminator character gets corrupted somehow during execution. I understand that the console underwent numerous improvements in 2K and XP, maybe it works there. Instead, I am firing up the batch file using Cygwin's shell instead, and it works fine! The synergy of Windows and Unix. For the record, I was missing the /c option previously, here is how I am invoking it right now:

bcp NewMedia.dbo.ext_adds in ext_adds -c -t ¬ /Usa /P

As you can see, the tables and the files have exactly the same names. For the record, 100MB of CSV imports in 7 minutes on a good Pentium III machine.

Newsflash: it works! I used Ascii 172 as field separator, converted the end-of-row line feeds to CRLF, and left the "embedded" line feeds as they were, they don't seem to confuse DTS. I 'm having some difficulties still with bcp, but probably I have to battle with the options a bit more. So far, the only discrepancy I detected was that my input file has a date field 0000-00-00 (obviously an attempt at a null value), which in SQL Server becomes "This action has not been performed". I'm not sure if this is going to cause problems, or if it at all possible to set the date in SQLS as 0000-00-00, any ideas on that?

so, I 'm betting my working week that bcp or DTS can load reliably my files with the strange separators and multitude of data types - will know in about 24 hours. Meanwhile, CSVs and text exports in general with escape sequences in them are very common in the Unix, perl, MySQL world. To give a bit back to the community I 've put online my little C program that transforms "escaped" input to output that conforms to Microsoft's Bulk Copy expectations:
http://www.users.totalise.co.uk/~tasos/mmfilter.c

It is reasonably general, safe and easy to change, and works either with pipes and/or explicit filenames for input and output.
Cheers

Edited by - visoth on 01/10/2002 19:13:03

Edited by - visoth on 01/10/2002 22:53:19

chadmat
The Chadinator

1974 Posts

Posted - 2002-01-15 : 13:26:49

The Date 00-00-0000 is invalid in SQL Server. Datetime can only store dates as far back as: January 1, 1753.

HTH
-Chad

aiken
Aged Yak Warrior

525 Posts

Posted - 2002-01-17 : 16:54:56

Don't forget about the option of pre-processing. You might consider using something like BK Replace (http://download.cnet.com/downloads/0-10096-100-4931801.html?tag=st.dl.10001-103-1.lst-1-1.4931801) to do a search/replace from 00-00-0000 to NULL before doing the BCP insert.

And, while it's too late to be useful, my approach to finding the termination character would have been to write a little app to scan the files for characters which don't occur.

Cheers
-b

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources