UTF8 Woes With Ruby 1.9
One of my projects at work is to consume “halfhose” from Gnip, which is half of the full Twitter firehose. Lots of fast data. Lots of UTF8. When working with the code, I kept getting “ruby 1.9: invalid byte sequence in UTF-8″. Now, being that I am consuming Twitter, it should already be UTF8. You can’t blindly do a .toutf8 on the string, as that actually tries to re-encode the already properly encoded UTF8.
So, I went googling and found this post which linked to this which has this snippet of code:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
This block of code did the trick. It basicly removes invalid UTF8 characters. I can’t say that I took the time to fully understand it, I just know that it works.







