8859-1 Encoded Filenames, Directory Names And Content
I have got a lot of Perl-scripts, and text data files on an older Linux system. All this stuff together makes a management system which I still need. Now I want to bring this management system to a current Linux system, because I suspect that the old one maybe will stop to work some day.
The oldish Linux system uses 8859-1 as encoding for filenames, directory names and content of text files.
The current Linux uses UTF-8 as encoding for filenames, directory names and content of text files.
Of course, as the system is in German, many files, filenames and directory names contain Umlauts like ÄÖÜäöü.
To work with the system on the current Linux, all must be converted from ISO 8859-1 encoding into UTF-8 encoding, including file- and directory names.
Side Note: Bash Glob Problem
Otherwise, such simple things as a bash glob do not work any more. Example: I have two files Maus and Möhre on the old 8859-1 system in a directory.
Now I bring these files to the new system and try to list them with ls.
> ls
Maus M?hre
>
> ls M*
Maus
Unbelievable. But true.
Use Iconv to Change Encoding
The iconv tool is available on most Linux systems, including my new one. With iconv, you can convert text from any encoding known in the world to any other.
As I have to change the encoding not only of the file contents but also of the filenames and directories, I have written a bash script to do all the transformation.
The script creates a backup .bak.8859-1 of every file it transcodes and it creates an empty .eonc file (eonc for encoding of name changed) for every file or directory as a marker that its name has been changed. We need to set markers for the already transcoded names, otherwise a second run of the script would damage things.
The script transcodes all filenames, directory names, file content in and below the current directory from ISO 8859-1 to UTF-8. Don’t run it in / 😉
#! /bin/bash
eoncExtension=.eonc
function iconvContent {
local tmpf=tmp.`basename $1`
if [[ -s $1 && ! -e $1.bak.8859-1 ]] ; then
echo "iconvContent $PWD $1"
cp -p $1 $1.bak.8859-1
iconv -f ISO_8859-1 -t UTF8 $1 -o $tmpf
# Iconv does no preserve file attributes.
# chmod --reference ... does not work on my Linux version.
cat $tmpf > $1 # this preserves the file attributes of $1
rm $tmpf
fi
}
# Change encoding of a file- or directory name.
# But only if the encoding has not already been changed.
function iconvName {
if [[ -e $1 && ! -f $1$eoncExtension ]] ; then
local target=$(echo $1 | iconv -f ISO_8859-1 -t UTF8)
if [[ $1 != $target ]] ; then
echo "iconvName $PWD $1 --> $target"
mv $1 $target
# Create a .eonc file as a marker that the encoding is
# already changed.
touch $target$eoncExtension
fi
fi
}
# Check if the file is not one of those that I've created myself
# as backup or marker or temp file.
function isSelfCreatedFile {
local bak='.*\.bak.*'
local eonc='.*\.eonc'
local tmpre=".*tmp\..*"
local f=$1
local self=1
if [[ $f =~ $bak || $f =~ $eonc || $f =~ $tmpre ]] ; then
self=0
fi
return $self
}
function iconvNamesInDir {
for f in `find . -maxdepth 1` ; do
if [ -e $f ] ; then
if ! isSelfCreatedFile $f ; then
iconvName $f
fi
fi
done
}
function iconvContentInDir {
for f in `find . -maxdepth 1 -type f` ; do
if [ -s $f ] ; then
if ! isSelfCreatedFile $f ; then
iconvContent $f
fi
fi
done
}
function iconvCurrentDir {
echo "iconvCurrentDir $PWD"
iconvNamesInDir
iconvContentInDir
for f in `find . -maxdepth 1 -type d` ; do
if [[ ! $f == . ]] ; then
local dir=$PWD
cd $f
iconvCurrentDir
cd $dir
fi
done
}
if [ "$1" == "run" ] ; then
iconvCurrentDir
else
echo "USAGE: iconv-multi run"
echo " This script changes the encoding of all filenames, "
echo " directory names and all file content in and below the "
echo " current directory from ISO 8859-1 to UTF-8."
echo " Running it twice does not hurt because the script "
echo " creates backup and marker files to detect which "
echo " files already have been encoded."
echo " (c) 2014 by Andreas Wicker."
fi