Splitting for files with different encodings UTF-8, Unicode

airbase3 · August 25, 2012, 10:24pm

I tried to split a large file (12 GB) into multiple smaller files, about 500MB each.
The text file had a BOM (Byte order mask) 0xFF 0xFE - it was unicode.
I wasn’t aware initially that the file was unicode, and tried to split by lines, every 611000 lines, by 0x0D 0x0A (newline).
It was confusing that it didn’t find any occurence in the file of that pattern. (I should have specified 0x0D 0x00 0x0A 0x00)

Another issue is that each file part does not get a BOM, and Notepad will not display the file properly, unless the source file is Ansi. For Unicode, the characters appear separated by extra space in Notepad. If saved in notepad, the file will be corrupted. (0x00 will become space character)

I think the application should be aware of encodings (by reading the BOM and checking for unicode characters if BOM is missing), and write BOM at beginning of each generated file part.

I ended up splitting the file with a custom program I wrote, which was already aware of encodings, and had an added benefit. By specifying that each file part should be Ansi (because there weren’t any extended unicode characters in source file), I got files that occupied only 6GB. (half the size).

gdgsupport · August 28, 2012, 4:53pm

Thank you for your feedback. Unicode support will be implemented in next version of GSplit. Until that, you can insert a BOM in each piece with the custom header functionality.