8859-1 Encoded Filenames, Directory Names And Content
I have got a lot of Perl-scripts, and text data files on an older Linux system. All this stuff together makes a management system which I still need. Now I want to bring this management system to a current Linux system, because I suspect that the old one maybe will stop to work some day.
The oldish Linux system uses 8859-1 as encoding for filenames, directory names and content of text files.
The current Linux uses UTF-8 as encoding for filenames, directory names and content of text files.
Of course, as the system is in German, many files, filenames and directory names contain Umlauts like ÄÖÜäöü
.
To work with the system on the current Linux, all must be converted from ISO 8859-1 encoding into UTF-8 encoding, including file- and directory names.
Side Note: Bash Glob Problem
Otherwise, such simple things as a bash glob do not work any more. Example: I have two files Maus
and Möhre
on the old 8859-1 system in a directory.
Now I bring these files to the new system and try to list them with ls
.
> ls
Maus M?hre
>
> ls M*
Maus
Unbelievable. But true.
Use Iconv to Change Encoding
The iconv
tool is available on most Linux systems, including my new one. With iconv
, you can convert text from any encoding known in the world to any other.
As I have to change the encoding not only of the file contents but also of the filenames and directories, I have written a bash script to do all the transformation.
The script creates a backup .bak.8859-1
of every file it transcodes and it creates an empty .eonc
file (eonc for encoding of name changed) for every file or directory as a marker that its name has been changed. We need to set markers for the already transcoded names, otherwise a second run of the script would damage things.
The script transcodes all filenames, directory names, file content in and below the current directory from ISO 8859-1 to UTF-8. Don’t run it in /
😉
#! /bin/bash
eoncExtension=.eonc
function iconvContent {
local tmpf=tmp.`basename $1`
if [[ -s $1 && ! -e $1.bak.8859-1 ]] ; then
echo "iconvContent $PWD $1"
cp -p $1 $1.bak.8859-1
iconv -f ISO_8859-1 -t UTF8 $1 -o $tmpf
# Iconv does no preserve file attributes.
# chmod --reference ... does not work on my Linux version.
cat $tmpf > $1 # this preserves the file attributes of $1
rm $tmpf
fi
}
# Change encoding of a file- or directory name.
# But only if the encoding has not already been changed.
function iconvName {
if [[ -e $1 && ! -f $1$eoncExtension ]] ; then
local target=$(echo $1 | iconv -f ISO_8859-1 -t UTF8)
if [[ $1 != $target ]] ; then
echo "iconvName $PWD $1 --> $target"
mv $1 $target
# Create a .eonc file as a marker that the encoding is
# already changed.
touch $target$eoncExtension
fi
fi
}
# Check if the file is not one of those that I've created myself
# as backup or marker or temp file.
function isSelfCreatedFile {
local bak='.*\.bak.*'
local eonc='.*\.eonc'
local tmpre=".*tmp\..*"
local f=$1
local self=1
if [[ $f =~ $bak || $f =~ $eonc || $f =~ $tmpre ]] ; then
self=0
fi
return $self
}
function iconvNamesInDir {
for f in `find . -maxdepth 1` ; do
if [ -e $f ] ; then
if ! isSelfCreatedFile $f ; then
iconvName $f
fi
fi
done
}
function iconvContentInDir {
for f in `find . -maxdepth 1 -type f` ; do
if [ -s $f ] ; then
if ! isSelfCreatedFile $f ; then
iconvContent $f
fi
fi
done
}
function iconvCurrentDir {
echo "iconvCurrentDir $PWD"
iconvNamesInDir
iconvContentInDir
for f in `find . -maxdepth 1 -type d` ; do
if [[ ! $f == . ]] ; then
local dir=$PWD
cd $f
iconvCurrentDir
cd $dir
fi
done
}
if [ "$1" == "run" ] ; then
iconvCurrentDir
else
echo "USAGE: iconv-multi run"
echo " This script changes the encoding of all filenames, "
echo " directory names and all file content in and below the "
echo " current directory from ISO 8859-1 to UTF-8."
echo " Running it twice does not hurt because the script "
echo " creates backup and marker files to detect which "
echo " files already have been encoded."
echo " (c) 2014 by Andreas Wicker."
fi