Discussion:
Converting ISO-8859 to text
Lewis@Gmail
2008-03-11 00:27:25 UTC
Permalink
I have a file encoded as ISO-8859 (according to the file command at
the command line). it is the ratings.file from imdb's database, and
BBEdit says it's "Western (Mac OS Roman)"

I need the file to be plain ASCII so that I can do grep searches
against it via a php script. Here is some sample data:

0000000123 119567 8.6 LÈon (1994)
0000000124 120390 8.6 Fabuleux destin d'AmÈlie Poulain, Le
(2001)
0000000123 24627 8.5 RashÙmon (1950)
0000000124 69931 8.4 Vita Ë bella, La (1997)
0000000123 12564 8.3 Smultronst‰llet (1957)
0000000114 17411 8.2 8Ω (1963)

I can Zap Gremlins to replace with the code

0000000123 119567 8.6 L\0xC8on (1994)
0000000124 120390 8.6 Fabuleux destin d'Am\0xC8lie Poulain,
Le (2001)

But that doesn't help me in doing a grep search through the file.

I also don't understand why "Smultronstället" shows up as "Smultronst
‰llet" or why 'LÈon' appears instead of 'Léon', etc.

what I want is 'Leon', 'Fabuleux destine d'Amelie Poulain, Le',
'Rashomon', 'Vita e bella, La', and 'Smultronstallet' and '8 1/2'.

And it needs to be fairly quick and easy to fix because I need to
update this file every month or two.

And if anyone knows what I am doing: yes, I did try to compile the
moviedb-3.24 package under Leopard and failed badly.
--
We will fight for Bovine Freedom and hold our large heads high
We will run free with the Buffalo or die
--
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
If so, please send mail to <***@barebones.com>, not to the list.
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
To unsubscribe, send mail to: <bbedit-talk-***@barebones.com>
Fidelis Semper
2008-03-11 03:16:16 UTC
Permalink
Yo~

Plain ASCII is, of course, just the first 128 characters from the
ASCII table, so it doesn't surprise me that your accented characters
got knocked down to unaccented characters most closely resembling the
original -- È and Ë become E, Ü becomes U, etc. -- when you did a
"Zap Gremlins" on your data. You can get the same effect if you
perform the "Convert to ASCII" function from BBEdit's "Text" menu.
(Actually, the Convert to ASCII has one advantage over Zap Gremlins
in that *some* of the special characters will be converted to literal
equivalents -- π will become pi, © will become (c), ∑ becomes Sum, ¥
will become Yen, etc.)

As for the ‰ ("per thousand") symbol, the closest ASCII equivalent
would be what you got: 0/00, which most people would interpret as per
thousand, thus retaining the meaning (if not the look) of the
character you zapped. Many other special characters in the ASCII
table also get a "literal" translation when they are converted
(knocked down, really) from their special character status to plain
ol' ASCII. For example, the 8Ω pair of characters in your sample data
becomes 8Ohm when reduced to plain ASCII by BBEdit's "Convert to
ASCII" method.

Perhaps my explanation doesn't help you prep your file to make it
easier to handle your data with grep, but at least you can figure out
most of the plain ASCII equivalents you will get when you look up the
special characters greater than 128 in the ASCII table.

HTH!

~Semper Fi, Mac!

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on a mailing list?

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Post by ***@Gmail
I have a file encoded as ISO-8859 (according to the file command at
the command line). it is the ratings.file from imdb's database,
and BBEdit says it's "Western (Mac OS Roman)"
I need the file to be plain ASCII so that I can do grep searches
0000000123 119567 8.6 LÈon (1994)
0000000124 120390 8.6 Fabuleux destin d'AmÈlie Poulain,
Le (2001)
0000000123 24627 8.5 RashÙmon (1950)
0000000124 69931 8.4 Vita Ë bella, La (1997)
0000000123 12564 8.3 Smultronst‰llet (1957)
0000000114 17411 8.2 8Ω (1963)
I can Zap Gremlins to replace with the code
0000000123 119567 8.6 L\0xC8on (1994)
0000000124 120390 8.6 Fabuleux destin d'Am\0xC8lie
Poulain, Le (2001)
But that doesn't help me in doing a grep search through the file.
I also don't understand why "Smultronstället" shows up as
"Smultronst‰llet" or why 'LÈon' appears instead of 'Léon', etc.
what I want is 'Leon', 'Fabuleux destine d'Amelie Poulain, Le',
'Rashomon', 'Vita e bella, La', and 'Smultronstallet' and '8 1/2'.
And it needs to be fairly quick and easy to fix because I need to
update this file every month or two.
And if anyone knows what I am doing: yes, I did try to compile the
moviedb-3.24 package under Leopard and failed badly.
--
We will fight for Bovine Freedom and hold our large heads high
We will run free with the Buffalo or die
--
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
--
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
If so, please send mail to <***@barebones.com>, not to the list.
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
To unsubscribe, send mail to: <bbedit-talk-***@barebones.com>
Lewis@Gmail
2008-03-11 04:37:57 UTC
Permalink
Post by Fidelis Semper
Plain ASCII is, of course, just the first 128 characters from the
ASCII table, so it doesn't surprise me that your accented characters
got knocked down to unaccented characters most closely resembling
the original -- È and Ë become E, Ü becomes U,
But that isn't what happens, that is what I WANT to happen.
Post by Fidelis Semper
etc. -- when you did a "Zap Gremlins" on your data. You can get the
same effect if you perform the "Convert to ASCII" function from
BBEdit's "Text" menu. (Actually, the Convert to ASCII has one
advantage over Zap Gremlins in that *some* of the special characters
will be converted to literal equivalents -- π will become pi, ©
will become (c), ∑ becomes Sum, ¥ will become Yen, etc.)
But "Fabuleux destin d'AmÈlie Poulain, Le (2001)" gets converted to
"Fabuleux destin d'AmElie Poulain, Le (2001)" when it should be
displayed as "Fabuleux destin d'Amélie Poulain, Le (2001)" and
converted to "Fabuleux destin d'Amelie Poulain, Le (2001)" (note the
case).

BBEdit is not showing the text the same way that Firefox or nvi shows
it. It has the character as È instead of é (wrong case and wrong
accent).
Post by Fidelis Semper
As for the ‰ ("per thousand") symbol, the closest ASCII equivalent
would be what you got: 0/00, which most people would interpret as
per thousand,
But the character is supposed to be ä, and that is how it appears in
both firefox and nvi.

There are two issues here, the main one is that BBEdit shows the wrong
character (Omega for 1/2 for example) and that means when I convert to
ASCII, I still have gibberish.
Post by Fidelis Semper
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on a mailing list?
Seriously? You top posted with this signature? Trying for ironic? ;)
--
Critics look at actresses one of two ways: you're either bankable or
boinkable.
--
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
If so, please send mail to <***@barebones.com>, not to the list.
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
To unsubscribe, send mail to: <bbedit-talk-***@barebones.com>
Fidelis Semper
2008-03-11 04:47:00 UTC
Permalink
As you've now noted in another reply in this same thread,
***@gmail.com hit the nail on the head with his suggestion that
the solution is to re-open the file as Windows (Latin-1) so that the
correct translation occurs.

Sorry if I led you astray with all the "Convert ASCII" stuff I posted
earlier; glad to hear that your solution is now at hand, thank to
gkreme.

~Semper Fi, Mac!

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Post by ***@Gmail
Post by Fidelis Semper
Plain ASCII is, of course, just the first 128 characters from the
ASCII table, so it doesn't surprise me that your accented
characters got knocked down to unaccented characters most closely
resembling the original -- È and Ë become E, Ü becomes U,
But that isn't what happens, that is what I WANT to happen.
Post by Fidelis Semper
etc. -- when you did a "Zap Gremlins" on your data. You can get
the same effect if you perform the "Convert to ASCII" function
from BBEdit's "Text" menu. (Actually, the Convert to ASCII has one
advantage over Zap Gremlins in that *some* of the special
characters will be converted to literal equivalents -- π will
become pi, © will become (c), ∑ becomes Sum, ¥ will become Yen, etc.)
But "Fabuleux destin d'AmÈlie Poulain, Le (2001)" gets converted to
"Fabuleux destin d'AmElie Poulain, Le (2001)" when it should be
displayed as "Fabuleux destin d'Amélie Poulain, Le (2001)" and
converted to "Fabuleux destin d'Amelie Poulain, Le (2001)" (note
the case).
BBEdit is not showing the text the same way that Firefox or nvi
shows it. It has the character as È instead of é (wrong case and
wrong accent).
Post by Fidelis Semper
As for the ‰ ("per thousand") symbol, the closest ASCII equivalent
would be what you got: 0/00, which most people would interpret as
per thousand,
But the character is supposed to be ä, and that is how it appears
in both firefox and nvi.
There are two issues here, the main one is that BBEdit shows the
wrong character (Omega for 1/2 for example) and that means when I
convert to ASCII, I still have gibberish.
Post by Fidelis Semper
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on a mailing list?
Seriously? You top posted with this signature? Trying for ironic? ;)
--
Critics look at actresses one of two ways: you're either bankable
or boinkable.
--
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
--
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
If so, please send mail to <***@barebones.com>, not to the list.
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
To unsubscribe, send mail to: <bbedit-talk-***@barebones.com>
Jan Pieter Kunst
2008-03-11 04:13:00 UTC
Permalink
Post by ***@Gmail
I have a file encoded as ISO-8859 (according to the file command at
the command line). it is the ratings.file from imdb's database, and
BBEdit says it's "Western (Mac OS Roman)"
[...]
Post by ***@Gmail
I also don't understand why "Smultronstället" shows up as "Smultronst
‰llet" or why 'LÈon' appears instead of 'Léon', etc.
BBEdit thinks the file is encoded as Mac OS Roman, but it is actually
ISO Latin 1. Try 'Reopen Using Encoding' and choose Western (ISO
Latin 1). Or, when opening, choose Western (ISO Latin 1) in the 'Read
As' select menu.

JP
--
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
If so, please send mail to <***@barebones.com>, not to the list.
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
To unsubscribe, send mail to: <bbedit-talk-***@barebones.com>
Lewis@Gmail
2008-03-11 04:39:50 UTC
Permalink
Post by Jan Pieter Kunst
BBEdit thinks the file is encoded as Mac OS Roman, but it is actually
ISO Latin 1. Try 'Reopen Using Encoding' and choose Western (ISO
Latin 1). Or, when opening, choose Western (ISO Latin 1) in the 'Read
As' select menu.
Yes!

(I had tried this before, but for some reason the 'reopen as' submenu
was all grey, and changing the encoding popup on the file's bottom
toolbar did nothing. Reopened the file, changed it to ISO Latin-1,
Converted to Text, and all is right with the world. Woot.)
--
Hey, baby, I've got just the cure for that penis envy back at my
apartment...
--
------------------------------------------------------------------
Have a feature request? Not sure the software's working correctly?
If so, please send mail to <***@barebones.com>, not to the list.
List FAQ: <http://www.barebones.com/support/lists/bbedit_talk.shtml>
List archives: <http://www.listsearch.com/BBEditTalk.lasso>
To unsubscribe, send mail to: <bbedit-talk-***@barebones.com>
Loading...