OT? Encoding Question

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

OT? Encoding Question

Diane Houdek
Can someone explain to me the benefits and/or differences between UTF-8 and Latin-1 encoding in MySQL?  The reason I'm asking is that we're using Lasso 8.5.5/MySQL 4.3.x on Mac OS 10.4.11 to run a research data collection instrument (all in English).  A large number of the 705 fields are text fields that allow description, including text that may be quoted or have apostrophes in it.  When I output the data (I'm using Navicat), there are multiple escape characters before these apostrophes or quotes.  For example: mother's becomes mother\\\\'s or "grandmother's" becomes "\\\\\grandmother\\\'s\\\\\".  Does anyone know why?  Is this a function of the encoding type or something else?

Another problem we have is that there are three textarea fields that may have large amounts of data in them (think paragraphs) and of course when I output the data, I get \r\n characters where the respondents have entered hard returns in the field.  Mac deals with this just fine, but when I try to open the file on Windows (which is what the research crew will be using) it treats those characters as end of record markers rather than just internal line feeds.  Again, should I be doing a different kind of encoding to fix this problem, or stripping this out using Lasso before passing it to MySQL?

Thanks for any insight!
Diane
[hidden email]



--
This list is a free service of LassoSoft: http://www.LassoSoft.com/
Search the list archives: http://www.ListSearch.com/Lasso/Browse/
Manage your subscription: http://www.ListSearch.com/Lasso/

Reply | Threaded
Open this post in threaded view
|

Re: OT? Encoding Question

Tami Williams
Diane:

I think this is caused by the encode_sql that's on the form entries  
before they're saved in the database (to prevent sql injection).



On Oct 21, 2008, at 8:14 PM, Diane Houdek wrote:

> Can someone explain to me the benefits and/or differences between  
> UTF-8 and Latin-1 encoding in MySQL?  The reason I'm asking is that  
> we're using Lasso 8.5.5/MySQL 4.3.x on Mac OS 10.4.11 to run a  
> research data collection instrument (all in English).  A large  
> number of the 705 fields are text fields that allow description,  
> including text that may be quoted or have apostrophes in it.  When  
> I output the data (I'm using Navicat), there are multiple escape  
> characters before these apostrophes or quotes.  For example:  
> mother's becomes mother\\\\'s or "grandmother's" becomes "\\\\
> \grandmother\\\'s\\\\\".  Does anyone know why?  Is this a function  
> of the encoding type or something else?
>
> Another problem we have is that there are three textarea fields  
> that may have large amounts of data in them (think paragraphs) and  
> of course when I output the data, I get \r\n characters where the  
> respondents have entered hard returns in the field.  Mac deals with  
> this just fine, but when I try to open the file on Windows (which  
> is what the research crew will be using) it treats those characters  
> as end of record markers rather than just internal line feeds.  
> Again, should I be doing a different kind of encoding to fix this  
> problem, or stripping this out using Lasso before passing it to MySQL?
>
> Thanks for any insight!
> Diane
> [hidden email]
>
>
>
> --
> This list is a free service of LassoSoft: http://www.LassoSoft.com/
> Search the list archives: http://www.ListSearch.com/Lasso/Browse/
> Manage your subscription: http://www.ListSearch.com/Lasso/
>


--
This list is a free service of LassoSoft: http://www.LassoSoft.com/
Search the list archives: http://www.ListSearch.com/Lasso/Browse/
Manage your subscription: http://www.ListSearch.com/Lasso/

Reply | Threaded
Open this post in threaded view
|

Re: OT? Encoding Question

Wade Maxfield
In reply to this post by Diane Houdek
On 22/10/08 1:14 PM, Diane Houdek wrote:

> Can someone explain to me the benefits and/or differences between UTF-8 and Latin-1 encoding in MySQL?  The reason I'm asking is that we're using Lasso 8.5.5/MySQL 4.3.x on Mac OS 10.4.11 to run a research data collection instrument (all in English).  A large number of the 705 fields are text fields that allow description, including text that may be quoted or have apostrophes in it.  When I output the data (I'm using Navicat), there are multiple escape characters before these apostrophes or quotes.  For example: mother's becomes mother\\\\'s or "grandmother's" becomes "\\\\\grandmother\\\'s\\\\\".  Does anyone know why?  Is this a function of the encoding type or something else?
>
> Another problem we have is that there are three textarea fields that may have large amounts of data in them (think paragraphs) and of course when I output the data, I get \r\n characters where the respondents have entered hard returns in the field.  Mac deals with this just fine, but when I try to open the file on Windows (which is what the research crew will be using) it treats those characters as end of record markers rather than just internal line feeds.  Again, should I be doing a different kind of encoding to fix this problem, or stripping this out using Lasso before passing it to MySQL?
>
> Thanks for any insight!
> Diane
> [hidden email]
>
>
>

The \\\\'s look like someone has escaped the ' multiple times.  I don't
think that would have anything to do with Latin-1 vs UTF-8.  We switched
to UTF-8 along time ago, since it handled characters with umlauts and
things like the accented e in café, which we needed for a clients site.
Dealing with Classical music in particular needed the extended
characters for all those weird composers names.  8)

For the new lines, get Lasso to process the data before
inserting/updating the records in MySQL. It's always a good idea to
normalise the line endings.  Depending on what browser/platform the user
was on, you get different line endings in the text fields.

Most people I know have settled on using \r.

Var: 'theText' = (Action_Param: 'theText');
$theText->(Replace: '\r\n', '\r');
$theText->(Replace: '\n', '\r');

  - Wade


--
This list is a free service of LassoSoft: http://www.LassoSoft.com/
Search the list archives: http://www.ListSearch.com/Lasso/Browse/
Manage your subscription: http://www.ListSearch.com/Lasso/

Reply | Threaded
Open this post in threaded view
|

Re: OT? Encoding Question

Diane Houdek
What's weird is that these folks are just typing in the field and then
maybe copying and pasting the same text repeatedly when necessary.
They're not at all computer savvy so they don't understand the concept
of escape characters or anything.  Also I can't figure out why I'm
getting both \r and \n for line endings - I would expect one or the
other, just not both!!

Thanks for the info, though.
Diane

>>> On Tue, Oct 21, 2008 at  7:30 PM, in message
<[hidden email]>,
Wade Maxfield <[hidden email]> wrote:
>
> The \\\\'s look like someone has escaped the ' multiple times.  I
don't
> think that would have anything to do with Latin-1 vs UTF-8.  We
switched
> to UTF-8 along time ago, since it handled characters with umlauts and

> things like the accented e in café, which we needed for a clients
site.
> Dealing with Classical music in particular needed the extended
> characters for all those weird composers names.  8)
>
> For the new lines, get Lasso to process the data before
> inserting/updating the records in MySQL. It's always a good idea to
> normalise the line endings.  Depending on what browser/platform the
user

> was on, you get different line endings in the text fields.
>
> Most people I know have settled on using \r.
>
> Var: 'theText' = (Action_Param: 'theText');
> $theText->(Replace: '\r\n', '\r');
> $theText->(Replace: '\n', '\r');
>
>   - Wade
>
>
> --
> This list is a free service of LassoSoft: http://www.LassoSoft.com/
> Search the list archives: http://www.ListSearch.com/Lasso/Browse/
> Manage your subscription: http://www.ListSearch.com/Lasso/


--
This list is a free service of LassoSoft: http://www.LassoSoft.com/
Search the list archives: http://www.ListSearch.com/Lasso/Browse/
Manage your subscription: http://www.ListSearch.com/Lasso/

Reply | Threaded
Open this post in threaded view
|

Re: OT? Encoding Question

Wade Maxfield
On 22/10/08 2:13 PM, Diane Houdek wrote:
> What's weird is that these folks are just typing in the field and then
> maybe copying and pasting the same text repeatedly when necessary.
> They're not at all computer savvy so they don't understand the concept
> of escape characters or anything.  Also I can't figure out why I'm
> getting both \r and \n for line endings - I would expect one or the
> other, just not both!!
>

The reason you are getting both is that the data is probably coming from
a Windows machine.

Macs use CR (\r), Unix uses LF (\n), and Windows uses CRLF (\r\n) for
their line breaks.

  - Wade

--
This list is a free service of LassoSoft: http://www.LassoSoft.com/
Search the list archives: http://www.ListSearch.com/Lasso/Browse/
Manage your subscription: http://www.ListSearch.com/Lasso/

Reply | Threaded
Open this post in threaded view
|

Re: OT? Encoding Question

Diane Houdek
Wade,

Thank you for the explanation, that is VERY helpful.

Diane

>>> On Tue, Oct 21, 2008 at  8:26 PM, in message <[hidden email]>,
Wade Maxfield <[hidden email]> wrote:

> On 22/10/08 2:13 PM, Diane Houdek wrote:
>> What's weird is that these folks are just typing in the field and then
>> maybe copying and pasting the same text repeatedly when necessary.
>> They're not at all computer savvy so they don't understand the concept
>> of escape characters or anything.  Also I can't figure out why I'm
>> getting both \r and \n for line endings - I would expect one or the
>> other, just not both!!
>>
>
> The reason you are getting both is that the data is probably coming from
> a Windows machine.
>
> Macs use CR (\r), Unix uses LF (\n), and Windows uses CRLF (\r\n) for
> their line breaks.
>
>   - Wade
>
> --
> This list is a free service of LassoSoft: http://www.LassoSoft.com/
> Search the list archives: http://www.ListSearch.com/Lasso/Browse/
> Manage your subscription: http://www.ListSearch.com/Lasso/



--
This list is a free service of LassoSoft: http://www.LassoSoft.com/
Search the list archives: http://www.ListSearch.com/Lasso/Browse/
Manage your subscription: http://www.ListSearch.com/Lasso/

Reply | Threaded
Open this post in threaded view
|

Re: OT? Encoding Question

Bil Corry-3
In reply to this post by Diane Houdek
Diane Houdek wrote on 10/21/2008 7:14 PM:
> Can someone explain to me the benefits and/or differences between
> UTF-8 and Latin-1 encoding in MySQL?

Use UTF-8 for everything (database, page encoding, etc).  Latin-1 doesn't have curly-quotes, em-dashes, etc, and undoubtedly you'll have a user who will copy/paste from Word and it won't show up properly with Latin-1.


> The reason I'm asking is that
> we're using Lasso 8.5.5/MySQL 4.3.x on Mac OS 10.4.11 to run a
> research data collection instrument (all in English).  A large number
> of the 705 fields are text fields that allow description, including
> text that may be quoted or have apostrophes in it.  When I output the
> data (I'm using Navicat), there are multiple escape characters before
> these apostrophes or quotes.  For example: mother's becomes
> mother\\\\'s or "grandmother's" becomes "\\\\\grandmother\\\'s\\\\\".
> Does anyone know why?  Is this a function of the encoding type or
> something else?

That's unrelated to charsets.  Most likely someone has encode_sql a few times too many.


> Another problem we have is that there are three textarea fields that
> may have large amounts of data in them (think paragraphs) and of
> course when I output the data, I get \r\n characters where the
> respondents have entered hard returns in the field.  Mac deals with
> this just fine, but when I try to open the file on Windows (which is
> what the research crew will be using) it treats those characters as
> end of record markers rather than just internal line feeds.  Again,
> should I be doing a different kind of encoding to fix this problem,
> or stripping this out using Lasso before passing it to MySQL?

Output the data to what?  A tab-delimited file?  An Excel spreadsheet?  You have to be careful with tabs and linefeeds that are inside of the data when you're exporting to a tab-delimited file since tabs and linefeeds have special meaning (field and record separators).


- Bil



--
This list is a free service of LassoSoft: http://www.LassoSoft.com/
Search the list archives: http://www.ListSearch.com/Lasso/Browse/
Manage your subscription: http://www.ListSearch.com/Lasso/

Reply | Threaded
Open this post in threaded view
|

Re: OT? Encoding Question

Jonathan Vanherpe (T & T NV)
Bil Corry wrote:
> Diane Houdek wrote on 10/21/2008 7:14 PM:
>> Can someone explain to me the benefits and/or differences between
>> UTF-8 and Latin-1 encoding in MySQL?
>
> Use UTF-8 for everything (database, page encoding, etc).  Latin-1 doesn't have curly-quotes, em-dashes, etc, and undoubtedly you'll have a user who will copy/paste from Word and it won't show up properly with Latin-1.
>
[snip]
> - Bil

I agree,

there's no reason *not* to use utf-8 for everything nowadays.

Jonathan

--
Jonathan Vanherpe - Tallieu & Tallieu NV - [hidden email]

--
This list is a free service of LassoSoft: http://www.LassoSoft.com/
Search the list archives: http://www.ListSearch.com/Lasso/Browse/
Manage your subscription: http://www.ListSearch.com/Lasso/