Tekker Blog

My little home on the internet.

UTF8 Woes With Ruby 1.9

without comments

One of my projects at work is to consume “halfhose” from Gnip, which is half of the full Twitter firehose.  Lots of fast data. Lots of UTF8.  When working with the code, I kept getting “ruby 1.9: invalid byte sequence in UTF-8″. Now, being that I am consuming Twitter, it should already be UTF8. You can’t blindly do a .toutf8 on the string, as that actually tries to re-encode the already properly encoded UTF8.

So, I went googling and found this post which linked to this which has this snippet of code:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]

This block of code did the trick. It basicly removes invalid UTF8 characters. I can’t say that I took the time to fully understand it, I just know that it works.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Reddit
  • StumbleUpon

Written by admin

August 3rd, 2011 at 11:15 am

Posted in

Leave a Reply