ExpressionEngine CMS
Open, Free, Amazing

Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

Foreign alphabet characters.

June 30, 2008 1:09pm

Subscribe [4]
  • #1 / Jun 30, 2008 1:09pm

    julianps

    175 posts

    This question may be related to a resolved thread. - not so fast!

    I set-up ExpressionEngine on EngineHosting’s servers and set the Default Character Set to UTF-8.

    I created one entry called “Franche-Compté” and another called “Île-de-France” (as a separate test I also created one category in each name too; sorted alphabetically). EE sorts both entries and categories to have Île-de-France listed first.

    This is because, as a result of a “feature” between EE and EH/mySQL, the Î character is written to the database as ÃZ and consequently sorted first.

    It was initially suggested that this was a collation (sorting) issue and that we were confused into seeing the ÃZ as a result of faulty phpMyAdmin settings.

    So we added a field to the weblog entry to point to the Wiki and for the title of the link we used the Weblog {title} tag.

    Sure enough the Wiki then tells us that we do not have an entry for ÃZli-de-France (and would we like to create one?).

    So we tested it further and edited the original article using the Pages Module, creating a Page called Île-de-France.html

    Having created the entry we attempted to connect to that page in the browser and things went fine, EE provided that Page with that URL.

    So we then changed the Default Character Set to ISO-8859-1 and re-entered the entry Title and Page Title.

    EE can no longer find this page at the Pages Title. In other words it needs to see ÃZ in order to translate back to Î.

    So we appear able to emphatically demonstrate that using ExpressionEngine on Enginehosting will not generate reliable international characters under either ISO-8859-1 (and all these characters are in the latin-general-ci character-set) or UTF-8

    I am more than happy to provide username+passwords to this EH hosted site to Ellislab’s staffers wishing to review these phenomena in the wild.

    Of course getting to understand them better may mean we can use ExpressionEngine for real, so there’s a potential commercial benefit in showing us how the application is intended to work.

    jiF

  • #2 / Jun 30, 2008 4:13pm

    Lisa Wess

    20502 posts

    Jules, you’re going to need to settle on one charset, and settling on UTF-8 should resolve all of the problems you are experiencing.  But you can not continue to switch between them and expect everything to keep working.

  • #3 / Jun 30, 2008 4:23pm

    julianps

    175 posts

    Jules, you’re going to need to settle on one charset, and settling on UTF-8 should resolve all of the problems you are experiencing.  But you can not continue to switch between them and expect everything to keep working.

    Lisa, you are jumping to conclusions; I am using UTF8 and it fails for the reasons stated.

    You previously asked for repeatable steps - here you have them and I bought you a hosting account to try them on.

    What more can I do?

    jiF

  • #4 / Jun 30, 2008 4:25pm

    Lisa Wess

    20502 posts

    Having created the entry we attempted to connect to that page in the browser and things went fine, EE provided that Page with that URL.

    So it worked when you stayed as UTF-8, yes?

    So we then changed the Default Character Set to ISO-8859-1 and re-entered the entry Title and Page Title.

    EE can no longer find this page at the Pages Title. In other words it needs to see ÃZ in order to translate back to Î.

    and then broke when you changed your character set.

    You can’t go randomly changing character sets like that.  You need to choose one and stick to it.

  • #5 / Jun 30, 2008 4:46pm

    julianps

    175 posts

    So it worked when you stayed as UTF-8, yes?

    How do you defined “worked”?

    Yes, when I key in the names of French regions that include French alphabet characters what I see on the input screen is what I get on the web page. That “works”.

    When I type in the names of more French regions that include French alphabet characters they too “work” but the sort-order, both when sorting weblog entries using these names, and category name order when using the same names as categories, is wrong. That does “not work”.

    To avoid “jumping to conclusions” I will say for certain French characters do not appear in the database the way they do on screen; for example the capital, circumflex-i (as in Î) appears as AZ. This may, or may not, explain why the French region of Île-de-France is ALWAYS sorted at the end of the A’s.

    Interestingly if I then create a link from the Île-de-France weblog entry, to the Wiki, using the weblog entry {title} the Wiki tries to create a new entry with the title AZle-de-France (remarkable like to the gibberish value within the database).

    So this too “does not work”.

    So, for the very easiest of test EE looks like it works but for real-world use it does not work.

    Is that any clearer?

    jiF

  • #6 / Jun 30, 2008 5:59pm

    Lisa Wess

    20502 posts

    Alright - first, switching your charset *is* going to break links, and I’d expect this.

    So, we need the character set to remain the same.  Now, you mentioned a problem with sorting.  I’d like to test this, so can you give me a few entry titles that I can work with to try to reproduce?

    Thank you.  Also - what version and build of ExpressionEngine are you on now?

  • #7 / Jun 30, 2008 6:14pm

    julianps

    175 posts

    Alright - first, switching your charset *is* going to break links, and I’d expect this.

    So, we need the character set to remain the same.  Now, you mentioned a problem with sorting.  I’d like to test this, so can you give me a few entry titles that I can work with to try to reproduce?

    Thank you.  Also - what version and build of ExpressionEngine are you on now?

    Lisa, in fairness to us we had finalised on ISO-8859-1 ages ago, until we found that if we created a Pages Module URL like Île-de-France.html would could not type that into the browser and reach the page on our site. Changing the Default character set to UTF8 meant we could and so the wholesale change of our data started to take place. Ordinarily we would have stuck with what we had.

    We’re still running 1.6.3 on this site.

    You can go to http://www.immocherche.com/i/ (that “i” is really important) and you’ll see that the first entry is not in title order - view the image file to see why.

    For the rest, we dropped the latin-swedish-ci database and reloaded using utf8-default-ci and as a result can now move from EE to Wiki and critically back the Pages Module URL of the Île-de-France page.

    We still cannot write our required URL’s into the WIKI module but we have since received a software-engineering explanation for that and simply have to live with the limitation.

  • #8 / Jun 30, 2008 6:20pm

    Lisa Wess

    20502 posts

    UTF-8 really is the best option to use here, but you’re switching from UTF-8 to ISO-8859-1 then back again, and you really just need to choose UTF-8 and stick with it.

    Ok, can you put this into a template:

    {exp:weblog:entries weblog="default_site" orderby="title" sticky="off" limit="20"}
    {title} - {url_title}
    
    {/exp:weblog:entries}

    replace the weblog= with the appropriate shortname, and give me a link to the test template please? This template should have *only this code* - no HTML tags, nothing, just this code.

  • #9 / Jun 30, 2008 6:28pm

    julianps

    175 posts

    UTF-8 really is the best option to use here, but you’re switching from UTF-8 to ISO-8859-1 then back again, and you really just need to choose UTF-8 and stick with it.

    Ok, can you put this into a template:

    {exp:weblog:entries weblog="default_site" orderby="title" sticky="off" limit="20"}
    {title} - {url_title}
    
    {/exp:weblog:entries}

    replace the weblog= with the appropriate shortname, and give me a link to the test template please? This template should have *only this code* - no HTML tags, nothing, just this code.

    http://www.immocherche.com/i/index2

    It’s the last, and second to last entries you’re looking for.

    jiF

  • #10 / Jun 30, 2008 6:29pm

    Derek Jones

    7561 posts

    Just a note that with your above example, you’ll need to force the browser to interpret the output as UTF-8, since there’s no meta tag in the document instructing as such.  What you see otherwise is a Latin-1 interpretation of the unicode characters.

  • #11 / Jun 30, 2008 6:33pm

    julianps

    175 posts

    Just a note that with your above example, you’ll need to force the browser to interpret the output as UTF-8, since there’s no meta tag in the document instructing as such.  What you see otherwise is a Latin-1 interpretation of the unicode characters.

    Thanks; why Latin1 particularly?

    jiF

  • #12 / Jun 30, 2008 6:38pm

    Derek Jones

    7561 posts

    Well it’s one of two things: either what the server is set to send, and without the HTML overriding it, the browser uses that, or on servers that do not force a charset, the browser makes its best guess based on the first output it encounters, which varies from vendor to vendor.  Nine times out of ten, it will be Latin-1 / ISO-8859-1.

  • #13 / Jun 30, 2008 6:53pm

    julianps

    175 posts

    Well it’s one of two things: either what the server is set to send, and without the HTML overriding it, the browser uses that, or on servers that do not force a charset, the browser makes its best guess based on the first output it encounters, which varies from vendor to vendor.  Nine times out of ten, it will be Latin-1 / ISO-8859-1.

    Good; good.

    So if I change the default stylesheet for my browser from Latin1 to UTF8; yes, there we go, the entry corrects itself to Île-de-France on my screen.

    However in phpMyAdmin I’m still seeing the title (in the weblog_titles table) as the Latin-1 interpretation of the unicode characters even though my stylesheet is UTF8 and the page meta is UTF8.

    This could of course be to do with the mySQL configuration on this server which has latin1 defaults all over it (though not on this database).

    Well that’s not the answer because I’m also seeing Latin-1 interpretation of the unicode characters on EngineHosting and they are 101% UTF8

    What was that you said about servers defaulting to sending latin1 characters?

    and more importantly, why is mySQL using Latin-1 interpretation of the unicode characters for its sort-order?

    jiF

  • #14 / Jun 30, 2008 6:57pm

    Derek Jones

    7561 posts

    Not defaulting to send Latin-1 characters, but to send instructions to the browser as to what character set the output should be interpreted as.  The output is the same either way.

    Check with your host for this install that your database, table, and columns all have unicode collation.  And of course, the entries in question will all have to have been entered into the CP with UTF-8 selected as your character set in your preferences, and the database collations would need to match at that time as well.  If any setting was incorrect or switched at any point along the way, then manual data conversion would be necessary.

    And some info on MySQL’s sort and order behavior:  http://dev.mysql.com/doc/refman/4.1/en/charset-configuration.html

  • #15 / Jun 30, 2008 6:58pm

    Ingmar

    29245 posts

    As Derek said, please add

    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />

    to your test template.

    In a quick test, my entries sorted correctly.

    ETA: I see you’re ahead of me…

.(JavaScript must be enabled to view this email address)

ExpressionEngine News!

#eecms, #events, #releases