ExpressionEngine CMS
Open, Free, Amazing

Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

Having Converted from iso-8859-1 to UTF-8, how to convert characters IN the data?

November 29, 2008 8:30pm

Subscribe [2]
  • #1 / Nov 29, 2008 8:30pm

    Ryan M.

    1511 posts

    I went through a whole thing today of taking a client’s older EE database and converting it from the basic latin-1 to UTF-8, which has helped solve a bunch of character headaches with é, ó, á, etc with certain extensions. The problem I’m having is that many of the characters are still not converted, and I haven’t been able to find a reliable source of info for how to do this (other than going into the entry with the offending characters, fixing them and re-saving the entry). Here were the steps I took:

    1) Exported my dev database dev_ee02 to SQL file.
    2) Changed all instances of default character set on the tables to utf8 from whatever it was.
    3) Created a new dev database. Set the collation and default character set to utf-8 (general) and imported all those tables I had exported earlier.
    4) Pointed EE to the new database.
    5) Changed the default charset in EE from the iso version to utf-8.

    For the most part, everything seems to be working pretty well. I can type an é and it doesn’t get screwed up at all when rendering on the template. However, I think some of the old characters are still encoded as iso, so they’re throwing those funky black-background question marks on the templates. When I go into the database via phpMyAdmin, I can see that:

    ó is being rendered as � (�)
    é is being rendered as both é and �

    I’m sure there are other offenses. Is there any way to sweep through the exp_weblog_data table (if nothing else) and mass convert those oddities? Or, if it is ultimately easier, I could do the export/import again if there is something I overlooked in the 5 steps above. Thanks.

  • #2 / Nov 29, 2008 9:17pm

    Ryan M.

    1511 posts

    I am utterly confused. When I use Navicat to go in and look at the titles I have “corrected” by typing the correct character in and re-saving the entry, they look like:

    Bernard-Henri Lévy: Suffering From “American Vertigo”

    That’s the correct version, at least on the template side. The incorrect versions are showing the é in the database as I would like it to appear on the template, but on the template I’m getting the diamond question mark. What is “right” here?

  • #3 / Nov 29, 2008 9:57pm

    Ryan M.

    1511 posts

    I just went to SQL Manager >  System Variables in the EE admin and I see this:

    character_set_client   latin1
    character_set_connection   latin1
    character_set_database   utf8
    character_set_filesystem   binary
    character_set_results   latin1
    character_set_server   latin1
    character_set_system   utf8
    character_sets_dir   /usr/share/mysql/charsets/
    collation_connection   latin1_swedish_ci
    collation_database   utf8_general_ci
    collation_server   latin1_swedish_ci

    How and why is that latin stuff still in there? Everything in phpMyAdmin shows that this DB is utf-8 (collation, tables). This stuff is way confusing!

    And to go back to the characters that are “working” correctly on the template - they all look odd in the DB…just wondering if this is how the following characters should look in the DB:
    á = á
    é = é
    í = í
    ó = ó
    ú = ú

  • #4 / Nov 30, 2008 4:13am

    Ingmar

    29245 posts

    Well, character encodings is a tricky field. First off, the Navicat view is fine, it’s just that the utf-8 characters are displayed as iso-8859-1—not iso-9951, incidentally, which refers to “measurement of gas flow in closed conduits” 😊—so you need to configure the character encoding there.

    In your five steps of conversion, did you also actually find & replace the encoded characters, or just the character set declaration?

  • #5 / Nov 30, 2008 12:32pm

    Ryan M.

    1511 posts

    Ingmar, I couldn’t get the data to do what I wanted, so I went back to dev database 2. (Always nice to have a few dev DB’s around to mess with). So I set EE back to iso-8859-1. as well, and now all the characters appear OK (although my beloved Playa multi-relationship custom field is having problems with the special characters).

    I did eventually go through and run ALTER TABLE CONVERT on all the tables - so my DB was utf-8, th data in it was utf-8, and I set EE to utf-8. It just never seemed to fully work out, and I would have had to sweep through the data again replacing é í á ó and curly quotes (maybe the ALTER TABLE command didn’t work as it should have?)

    Even though I may just stick with iso-8859-1 for now, are these characters:
    á = á
    é = é
    í = í
    ó = ó
    ú = ú
    correct when stored in the DB? Meaning, if I were using utf-8, should those special vowels look like those two characters when looking at the fields directly in the db?

  • #6 / Nov 30, 2008 12:34pm

    Ryan M.

    1511 posts

    ...it’s just that the utf-8 characters are displayed as iso-8859-1—not iso-9951, incidentally, which refers to “measurement of gas flow in closed conduits” 😊

    Heh, whoops. By that time I was so ready for my day to be over, especially considering that it was a Saturday. I could have used some of that gas had it been laughing gas.

  • #7 / Nov 30, 2008 2:13pm

    Erdal Demirtas

    84 posts

    Hi,

    this is how I handle the iso-8859-1 to UTF-8 conversion (in the data):

    1) Export the iso-8859-1 DB to a txt file with phpMyAdmin (send as zip and if the DB is too large, export it to a few files by selecting tables)

    2) Change the format of the file(s) with UltraEdit or Notepad++ to “UTF-8 without BOM”

    3) Search replace the strange characters (do not forget uppercase, lowercase)

    4) Search replace the other necessary stuff in table definitions (for example latin1_general_ci -> utf8_general_ci)

    5) Now we have a utf-8 data, export this to the new DB (utf-8) with phpMyAdmin

    It looks like a lot of job, but it does not take that much time, this way always worked for me.

  • #8 / Nov 30, 2008 2:23pm

    Ryan M.

    1511 posts

    Thanks, Erdal. I’m curious about #3. How do I know what all the strange characters are? What if I happen to miss one? Then I drop all the tables, search and replace for that one and re-import again? Just keep doing this until I’m sure I have them all? Is there a defined list of well-known messed-up characters? How did you know what to search and replace?

  • #9 / Nov 30, 2008 2:36pm

    Erdal Demirtas

    84 posts

    No, I do not have a list of strange characters.
    I do this mostly for my turkish or german sites.
    I know all those characters to replace. So it is just experience.

    And, yes, if i miss one, i search replace it and import again.
    I usually have also the drop table statements in the file.
    So it is not a big job.

    Edit:
    I replace not only strange characters but also the html encoded ones if there are any. This way you really have a clean data.
    For example

    ö-> ö 
    ä-> ä
    ü-> ü
  • #10 / Dec 01, 2008 6:08pm

    Ingmar

    29245 posts

    Ryan, does that clear up the confusion somewhat? What’s your status, exactly?

.(JavaScript must be enabled to view this email address)

ExpressionEngine News!

#eecms, #events, #releases