Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 Import/Export (DTS) and Replication (2000)
 Bulk Load UTF-8?

Author  Topic 

Keith.Macdonald
Starting Member

5 Posts

Posted - 2006-10-03 : 06:39:57
I know that BULK LOAD does not like Unix UTF-8 files. This is a great shame because our old email system (Imail) was running on an MS server and so it gave us log files that bulk load would accept. Now, we're moving to Communigate on a Unix platform. So the logs are in UTF-8 files.

The actual problem is our old log processing relies on stored procedures that start with a bulk load of the logfile before running queries to seperate out things like POP3 connections, SMTP messages sent & received, etc.

Has anyone else had a similar problem and what did they use instead of bulk load? All suggestions gratefully received.

Kristen
Test

22859 Posts

Posted - 2006-10-03 : 06:53:02
I don't know what a Unix UTF-8 file looks like, but isn't it just a question of telling Bulk Load what the Delimiter and Row Separators are?

or failing that running the file through something that converts them to whatever Bulk Load wants?

Kristen
Go to Top of Page

Keith.Macdonald
Starting Member

5 Posts

Posted - 2006-10-03 : 07:09:59
Re << isn't it just a question of telling Bulk Load what the Delimiter and Row Separators >>
I've tried that.
BULK INSERT CMailBulkLoadRaw_tbl
FROM "T:\CMailLog\Test\2006-09-16.log"
WITH ( ROWTERMINATOR = '\n' )

I've tried ROWTERMINATOR = '\r' as well.

I've found that if I open the file in Textpad, I can Save As and then choose PC instead of Unix as the format. Then the BULK INSERT works fine. Of course, that's OK for hacking in development, but no good for a production system running 365x24 where the stored procedure is being run at 03:00 from an SQL Agent Job.

I did a Google search for "convert UTF-8" and found TextPipe but functionally that does nothing more for me than Textpad does for free.

My options (so far) seem to be:
1) use something else in the stored procedures instead of bulk load (probably DTS?)
2) find a freeware utility that can convert the file format before the SQL job runs
3) write a convertor utility
4) umm?
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2006-10-03 : 08:08:39
Might be worth having a look at the file before/after saving with Textpad and seeing what it has actually changed. I would expect that it has CHAR(10) for linebreaks before, and CHAR(13)+CHAR(10) after, but maybe there are other changes too.

Perhaps a FORMATFILE would help you more precisely define the file than just using ROWTERMINATOR?

Kristen
Go to Top of Page

QuantumBunny
Starting Member

2 Posts

Posted - 2006-10-06 : 18:42:12
quote:
Originally posted by Keith.Macdonald

I know that BULK LOAD does not like Unix UTF-8 files. This is a great shame because our old email system (Imail) was running on an MS server and so it gave us log files that bulk load would accept. Now, we're moving to Communigate on a Unix platform. So the logs are in UTF-8 files.

The actual problem is our old log processing relies on stored procedures that start with a bulk load of the logfile before running queries to seperate out things like POP3 connections, SMTP messages sent & received, etc.

Has anyone else had a similar problem and what did they use instead of bulk load? All suggestions gratefully received.



Okay, the only benefit, as I am sure you are aware, to UTF-8 over ASCII, is the ability to handle "special characters" like accented letters in French, chinese characters, etc. The only benefit of UTF-8 over full Unicode, is character size(and thus over large files, file size). However, with FileType = 'widechar' in your bulk insert, you should be able to upload Unicode files perfectly. However, UTF-8 still didn't work perfectly for me, so the best way to get the file in without losing anything for me, was to convert the file to full Unicode, and then Bulk Insert it.

PS, you shouldn't open flat text files you intend to Bulk Insert with Notepad, it can change EOL characters. I recommend getting a file editor like TextPad or UltraEdit, which will not change the basic aspects of the file.
Go to Top of Page

Keith.Macdonald
Starting Member

5 Posts

Posted - 2006-10-11 : 04:30:50
QuantumBunny
Re your comment << I recommend getting a file editor like TextPad or UltraEdit, which will not change the basic aspects of the file.>>
Thanks for the suggestion, but I recommend you read what I've already posted i.e. << I've found that if I open the file in Textpad, I can Save As and then choose PC instead of Unix as the format. Then the BULK INSERT works fine.>>
Go to Top of Page

Keith.Macdonald
Starting Member

5 Posts

Posted - 2006-10-11 : 07:35:05
Kristen

I've found a solution. In the batch file that pulls the daily log from the email server to the SQL Server ready for uploading, I've added this one-line command:
TYPE yyyymmdd.log | FIND "" /V > yyyymmdd.txt
which streams the entire log file to a new version in a PC format.

Keith
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2006-10-11 : 08:02:58
quote:
Originally posted by Keith.Macdonald
TYPE yyyymmdd.log | FIND "" /V > yyyymmdd.txt
which streams the entire log file to a new version in a PC format.



While this converts the line endings from UNIX to DOS/Windows (i.e. prefix each LF character with a CR), it doesn't change the character encoding from UTF-8 to something else, does it?
You still have to either:
(a) convert it from UTF-8 to UTF-16 and import it with the DATATYPEFILE = 'widechar' setting.
(b) convert it to Windows-1252 (down-converting any characters that are not representable) and import it with the CODEPAGE = 'ACP'.
(c) leave it as UTF-8, and lie about the encoding when you import it. If you do this, any non-ASCII characters in the original text will be misrepresented as a (nonsensical-looking) sequence of characters in the database.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2006-10-11 : 08:21:26
"will be misrepresented as a (nonsensical-looking) sequence of characters"

... which you might be able to convert to something more visually correct with

REPLACE(MyColumn, CHAR(111), CHAR(222)

(Obviously not the real numbers for CHAR() function!)

Kristen
Go to Top of Page

Keith.Macdonald
Starting Member

5 Posts

Posted - 2006-10-11 : 08:33:48
Re << it doesn't change the character encoding from UTF-8 to something else, does it? >>
I might be wrong, but as a new completely file is being created on the SQL Server, I believe the answer might be Yes. Without making any changes to the BULK INSERT command,
BULK INSERT CMailBulkLoadRaw_tbl FROM "T:\CMailLog\Test\Unix\2006-09-29.txt" WITH ( ROWTERMINATOR = '\n')
it can now read the file and bulk load data that is not only readable but also produces meaningful results when queried.
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2006-10-11 : 09:51:57
quote:
Originally posted by Kristen

"will be misrepresented as a (nonsensical-looking) sequence of characters"

... which you might be able to convert to something more visually correct with

REPLACE(MyColumn, CHAR(111), CHAR(222)

(Obviously not the real numbers for CHAR() function!)

Kristen



It's a little more involved than that: a single unicode character can be represented by up to 4 bytes in UTF-8. But if you do get into that situation, this might be useful:
http://www.sqlteam.com/Forums/topic.asp?TOPIC_ID=62406
Go to Top of Page

ciaran
Starting Member

40 Posts

Posted - 2006-10-11 : 10:50:28
Keith,
ya I had this problem. It was a pain. basically you'll need to convert the file to unicode or any non ascii characters will appear translated as some thing else. I wrote a .net console application that does the conversion for me and i called it from my dts package. Bit of a hack i know but this was the only way i could think of getting around it.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2006-10-11 : 14:07:43
If your only tool is a hammer all your problems are nails!
Go to Top of Page
   

- Advertisement -