0000588: Support of string.latin for character conversion

FLAM Issue Tracker - FL5
View Issue Details

ID	Project	Category	View Status	Date Submitted	Last Update
0000588	FL5	2.2 Subprogram FLUC (CONV)	public	2014-11-07 10:05	2015-04-15 15:36

Reporter	Falk Reichbott
Assigned To	Falk Reichbott
Priority	normal	Severity	feature	Reproducibility	N/A
Status	resolved	Resolution	fixed
Platform	General	OS	General	OS Version	General
Product Version	5.0
Target Version	5.1	Fixed in Version	5.1.02

Summary	0000588: Support of string.latin for character conversion
Description	In Germany and also in the European Union a UNICODE subset (unicode for latin) was specified and must be used for statutory reporting and other things. It would be very helpful to have an support in the character conversion modul of FLAM for this and other kind of subsets.
Steps To Reproduce
Additional Information
Tags	No tags attached.
Relationships
Attached Files

Issue History
Date Modified	Username	Field	Change
2014-11-07 10:05	Falk Reichbott	New Issue
2014-11-07 10:05	Falk Reichbott	Status	new => assigned
2014-11-07 10:05	Falk Reichbott	Assigned To	=> Falk Reichbott
2014-11-07 10:07	Falk Reichbott	Note Added: 0000675
2014-11-07 10:11	Falk Reichbott	Note Added: 0000676
2014-11-07 10:15	Falk Reichbott	Note Added: 0000677
2014-11-07 10:15	Falk Reichbott	Status	assigned => resolved
2014-11-07 10:15	Falk Reichbott	Fixed in Version	=> 5.1
2014-11-07 10:15	Falk Reichbott	Resolution	open => fixed
2014-11-07 10:16	Falk Reichbott	Summary	Support for string.latin subset => Support of string.latin for character conversion
2014-11-07 10:20	Falk Reichbott	Note Added: 0000678
2014-11-13 11:02	Falk Reichbott	Note Edited: 0000677	bug_revision_view_page.php?bugnote_id=677#r192
2015-04-15 15:36	Falk Reichbott	Fixed in Version	5.1 => 5.1.02

Notes

The free defined subset support for character conversion is available with version 5.1.1 of FLAM. The solution was realised over the user table definition. A sample user table (FLUTNPAS) for string.latin is now part of the installation package. Below you can find the manpage from our user manunal.

MANPAGE FOR CNVCHR.USRTAB

A user substitution table (USRTAB) has the syntax listed below:

----------------------------------------------------------------------
   usr_tab_file -> usr_definition_list
   usr_definition_list -> usr_definition usr_definition_list
                       | EMPTY
   usr_definition -> '(' CODEPOINT '=' code_point_list ')'
                       | '(' '+' CODEPOINT '=' code_point_list ')'
                       | '(' '*' CODEPOINT '=' code_point_list ')'
                       | '(' '^' CODEPOINT '=' code_point_list ')'
                       | '(' CODEPOINT ')'
                       | '(' '+' CODEPOINT ')'
                       | '(' '*' CODEPOINT ')'
                       | '(' '-' CODEPOINT ')'
                       | '(' '-' CODEPOINT '-' CODEPOINT ')'
                       | '(' '+' CODEPOINT '-' CODEPOINT ')'
                       | '(' CODEPOINT '-' CODEPOINT ')'
   code_point_list -> CODEPOINT '/' code_point_list
                       | CODEPOINT ',' code_point_list
                       | CODEPOINT ';' code_point_list
                       | CODEPOINT
----------------------------------------------------------------------

Note: In the description below "/" is used as code point separator.

The optional sign of the initial code point defines

----------------------------------------------------------------------
   '+' - a valid transliteration (only if CP not in the target set)
   '*' - a valid mapping (this translate is always done)
   '^' - a case mapping/folding (only if case=usrtab activated)
   '-' - an invalid definition (removes CP for subset definitions)
----------------------------------------------------------------------

code point definition. If no sign is used, then a transliteration for a
valid code point is defined. Valid code points are accepted, invalid
code points result in error handling depending on the mode (STOP,IGNORE,
SUBSTITUTE). Definition of '+' code points without code point list, thus
without '=' is equal definition of valid code points. With an asterisk
"*", you can define an enforced mapping for this code point. For
example, this can be used to delete this character if no code point list
is provided or to translate this character always in another value. This
can also be used to convert code points outside of a subset into this
subset.

In contrast to the transliteration (+ is only done if a character
doesn't exist in the target encoding), the mapping is always done for
each character. Invalid (-) code points can be used to deactivate a
code point. To activate (+) or deactivate (-) a range of code points you
can use the minus mark (-) followed by one sub code point. The range
goes from the smaller code point to the bigger one. For example:

----------------------------------------------------------------------
   (-00-7F) deactivate all US-ASCII code points
   (+39-30) activate all decimal digits
----------------------------------------------------------------------

The range operator (-) with the optional plus sign (+) can also be used
to activate code points.

To define a own case mapping or folding (required for certain subsets)
the sign '^' can be used. This mapping will only done if user table
activated for case mapping (CASE=USRTAB).

To define transliterations/substitutions/mappings an assignment "=" of a
code point list is required. The code point list can contain a maximum
of 8 code points. If no code point list is specified, then the character
will be ignored. A transliteration to it self simply activates this code
point. The same is valid for mapping definitions. A code point is a
hexadecimal number representing a 21 bit UNICODE point.

----------------------------------------------------------------------
   00000000 to 001FFFFF hex
----------------------------------------------------------------------

If a code point is not defined in the substitution table (USRTAB and/or
SYSTAB), the appropriate SUBCHR will be used. If the SUBCHR is not
defined, the substitution will not be executed. Depending on the MODE, a
STOP or a IGNORE will be performed.

Text before and after brackets are comments.

Between "(" and "=" or "/" or ")" you can define hex digits until the
first non-hex digit is reached. All non-hex digits up to the next
separator are interpreted as comment. Leading whitespace is ignored.

-----------------------------------------------------------------------
   REPLACE GERMAN SZ (00DF=0073/0073) WITH ss
THIS IS AN EXAMPLE FOR COMMENTS IN CODE POINT LIST:
   REPLACE EURO MARK (20AC= 45#E / 55#U / 52#R / 4F#O) WITH EURO
   REPLACE BOM MARK (EFFF=) WITH NOTHING
-----------------------------------------------------------------------

Leading zeros are possible. If no hex value is written, then 0x00 is
used.

-----------------------------------------------------------------------
   REPLACE GERMAN SZ (00DF=/) WITH 0x00 0x00
   REPLACE EURO MARK (20AC=00000045/00055/000052/000004F) WITH EURO
-----------------------------------------------------------------------

ATTENTION: Please don't use parenthesis '()' or the operators in your
           comments.

To describe your own subsets, you can use a user table without a system
table. The USRTAB can also be used to overwrite or add transliterations
when a system substitution table (for example SYSTAB=ICONV) is used. The
transliteration works recursively, that is if one of the substitution
code points is not in the target set, this substitution will be used
instead and so on.

-----------------------------------------------------------------------
   REPLACE GERMAN OE (0000D6=004F/0045) WITH O E
   REPLACE GERMAN SZ (0000DF=0073/0073) WITH s s
   REPLACE EURO MARK (0020AC=00D6/00DF/52/4F) WITH OE SZ R O
-----------------------------------------------------------------------

If you have a EURO sign in your text and convert it to 'Latin1', the
resulting byte string will be 'D6DF524F'. On the other hand, if you
convert it to 'ASCII', the byte string will be '4F457373524F'.

The mapping as described above is always done and is also done recursive
including the transliteration result. Means if you define a mapping from
(*30=39) then the target data don't have any zero in the text anymore.

By replacing other code points recursively, you could easily cause
infinite loops. To prevent this, the amount of replacements and the
length of one replacement is limited to a maximum of 64. Note that this
could still result in a large expansion of data. Therefore, be careful
about defining recursive substitutions.

A sample user table for the ICONV system table to change the
transliteration of German umlauts to AE, OE or UE is located in the
SAMPLE directory under FLUTDEXL. Another sample user table called
FLUTNPAS defines the 'string.latin' subset which is mainly used for
statutory reporting.

The order of definition is the oder how the definitions are calculated
in the open function. Means if you define a mapping or a transliteration
for code point X and later you make this code point invalid, then the
code point is invalid. On the other hand you can first deactivate all
code points and then activate your subset and define your
transliterations or mappings.

All user table definitions are pre-calculated once at beginning of
execution to reduce the required CPU time for the conversion. Means the
use of a user table increased the effort in the open function but has no
effect on CPU usage for the real conversion effort in the run function.