How to Change the Encoding of Filenames, Directory Names and File Content From ISO 8859-1 to UTF-8

8859-1 Encoded Filenames, Directory Names And Content

I have got a lot of Perl-scripts, and text data files on an older Linux system. All this stuff together makes a management system which I still need. Now I want to bring this management system to a current Linux system, because I suspect that the old one maybe will stop to work some day.

The oldish Linux system uses 8859-1 as encoding for filenames, directory names and content of text files.

The current Linux uses UTF-8 as encoding for filenames, directory names and content of text files.

Of course, as the system is in German, many files, filenames and directory names contain Umlauts like ÄÖÜäöü.

To work with the system on the current Linux, all must be converted from ISO 8859-1 encoding into UTF-8 encoding, including file- and directory names.

Side Note: Bash Glob Problem

Otherwise, such simple things as a bash glob do not work any more. Example: I have two files Maus and Möhre on the old 8859-1 system in a directory.

Now I bring these files to the new system and try to list them with ls.

> ls
Maus M?hre
>
> ls M*
Maus

Unbelievable. But true.

Use Iconv to Change Encoding

The iconv tool is available on most Linux systems, including my new one. With iconv, you can convert text from any encoding known in the world to any other.

As I have to change the encoding not only of the file contents but also of the filenames and directories, I have written a bash script to do all the transformation.

The script creates a backup .bak.8859-1 of every file it transcodes and it creates an empty .eonc file (eonc for encoding of name changed) for every file or directory as a marker that its name has been changed. We need to set markers for the already transcoded names, otherwise a second run of the script would damage things.

The script transcodes all filenames, directory names, file content in and below the current directory from ISO 8859-1 to UTF-8. Don’t run it in / 😉

#! /bin/bash

eoncExtension=.eonc

function iconvContent {
  local tmpf=tmp.`basename $1`
  if [[ -s $1 && ! -e $1.bak.8859-1 ]] ; then 
    echo "iconvContent $PWD $1"
    cp -p $1 $1.bak.8859-1
    iconv -f ISO_8859-1 -t UTF8 $1 -o $tmpf

    # Iconv does no preserve file attributes. 
    # chmod --reference  ... does not work on my Linux version. 
    cat $tmpf > $1      # this preserves the file attributes of $1
    rm $tmpf
  fi 
}

# Change encoding of a file- or directory name.
# But only if the encoding has not already been changed.     
function iconvName {
  if [[ -e $1 && ! -f $1$eoncExtension ]] ; then 
    local target=$(echo $1 |  iconv -f ISO_8859-1 -t UTF8)
    if [[ $1 != $target ]] ; then
      echo "iconvName $PWD $1  --> $target"
      mv $1 $target

      # Create a .eonc file as a marker that the encoding is 
      # already changed.
      touch $target$eoncExtension      
    fi
  fi
}              

# Check if the file is not one of those that I've created myself 
# as backup or marker or temp file.
function isSelfCreatedFile {
  local bak='.*\.bak.*'
  local eonc='.*\.eonc'
  local tmpre=".*tmp\..*"
  local f=$1
  local self=1
  if [[ $f =~ $bak || $f =~ $eonc || $f =~ $tmpre ]] ; then   
    self=0
  fi
  return $self
}

function iconvNamesInDir {
  for f in `find . -maxdepth 1`  ; do
    if [ -e $f ] ; then 
      if ! isSelfCreatedFile $f ; then   
        iconvName $f
      fi
    fi
  done
}

function iconvContentInDir {
  for f in `find . -maxdepth 1 -type f`  ; do
    if [  -s $f ] ; then   
      if ! isSelfCreatedFile $f ; then 
        iconvContent $f
      fi
    fi
  done
}

function iconvCurrentDir {
  echo "iconvCurrentDir  $PWD"

  iconvNamesInDir
  iconvContentInDir  

  for f in `find . -maxdepth 1 -type d`  ; do
    if [[ ! $f == . ]] ; then 
      local dir=$PWD
      cd $f
      iconvCurrentDir
      cd $dir
    fi
  done
}

if [ "$1" == "run" ] ;  then 
  iconvCurrentDir
else 
  echo "USAGE: iconv-multi run"
  echo "  This script changes the encoding of all filenames, "
  echo "  directory names and all file content in and below the "
  echo "  current directory from ISO 8859-1 to UTF-8."
  echo "  Running it twice does not hurt because the script "
  echo "  creates backup and marker files to detect which "
  echo "  files already have been encoded."
  echo "  (c) 2014 by Andreas Wicker."
fi