FLAM Issue Tracker - FL5
View Issue Details
0000588FL52.2 Subprogram FLUC (CONV)public2014-11-07 10:052015-04-15 15:36
Falk Reichbott 
Falk Reichbott 
normalfeatureN/A
resolvedfixed 
GeneralGeneralGeneral
5.0 
5.15.1.02 
0000588: Support of string.latin for character conversion
In Germany and also in the European Union a UNICODE subset (unicode for latin) was specified and must be used for statutory reporting and other things. It would be very helpful to have an support in the character conversion modul of FLAM for this and other kind of subsets.
No tags attached.
Issue History
2014-11-07 10:05Falk ReichbottNew Issue
2014-11-07 10:05Falk ReichbottStatusnew => assigned
2014-11-07 10:05Falk ReichbottAssigned To => Falk Reichbott
2014-11-07 10:07Falk ReichbottNote Added: 0000675
2014-11-07 10:11Falk ReichbottNote Added: 0000676
2014-11-07 10:15Falk ReichbottNote Added: 0000677
2014-11-07 10:15Falk ReichbottStatusassigned => resolved
2014-11-07 10:15Falk ReichbottFixed in Version => 5.1
2014-11-07 10:15Falk ReichbottResolutionopen => fixed
2014-11-07 10:16Falk ReichbottSummarySupport for string.latin subset => Support of string.latin for character conversion
2014-11-07 10:20Falk ReichbottNote Added: 0000678
2014-11-13 11:02Falk ReichbottNote Edited: 0000677bug_revision_view_page.php?bugnote_id=677#r192
2015-04-15 15:36Falk ReichbottFixed in Version5.1 => 5.1.02

Notes
(0000675)
Falk Reichbott   
2014-11-07 10:07   
String.latin also define a deticated case folding with capitals. This is different to the unicode case folding specification. To support string.lation the definition of a own case folding must also be possible.
(0000676)
Falk Reichbott   
2014-11-07 10:11   
The string.latin specification defines for different code pages for each code point deticated translitterations. These definitions are deticated defined per codepage and not defined on code point level, but we have make a plausi check we have verified that all defined translitteration are code point equivalent.
(0000677)
Falk Reichbott   
2014-11-07 10:15   
(edited on: 2014-11-13 11:02)
The free defined subset support for character conversion is available with version 5.1.1 of FLAM. The solution was realised over the user table definition. A sample user table (FLUTNPAS) for string.latin is now part of the installation package. Below you can find the manpage from our user manunal.

MANPAGE FOR CNVCHR.USRTAB

A user substitution table (USRTAB) has the syntax listed below:

----------------------------------------------------------------------
   usr_tab_file -> usr_definition_list
   usr_definition_list -> usr_definition usr_definition_list
                       | EMPTY
   usr_definition -> '(' CODEPOINT '=' code_point_list ')'
                       | '(' '+' CODEPOINT '=' code_point_list ')'
                       | '(' '*' CODEPOINT '=' code_point_list ')'
                       | '(' '^' CODEPOINT '=' code_point_list ')'
                       | '(' CODEPOINT ')'
                       | '(' '+' CODEPOINT ')'
                       | '(' '*' CODEPOINT ')'
                       | '(' '-' CODEPOINT ')'
                       | '(' '-' CODEPOINT '-' CODEPOINT ')'
                       | '(' '+' CODEPOINT '-' CODEPOINT ')'
                       | '(' CODEPOINT '-' CODEPOINT ')'
   code_point_list -> CODEPOINT '/' code_point_list
                       | CODEPOINT ',' code_point_list
                       | CODEPOINT ';' code_point_list
                       | CODEPOINT
----------------------------------------------------------------------

Note: In the description below "/" is used as code point separator.

The optional sign of the initial code point defines

----------------------------------------------------------------------
   '+' - a valid transliteration (only if CP not in the target set)
   '*' - a valid mapping (this translate is always done)
   '^' - a case mapping/folding (only if case=usrtab activated)
   '-' - an invalid definition (removes CP for subset definitions)
----------------------------------------------------------------------

code point definition. If no sign is used, then a transliteration for a
valid code point is defined. Valid code points are accepted, invalid
code points result in error handling depending on the mode (STOP,IGNORE,
SUBSTITUTE). Definition of '+' code points without code point list, thus
without '=' is equal definition of valid code points. With an asterisk
"*", you can define an enforced mapping for this code point. For
example, this can be used to delete this character if no code point list
is provided or to translate this character always in another value. This
can also be used to convert code points outside of a subset into this
subset.

In contrast to the transliteration (+ is only done if a character
doesn't exist in the target encoding), the mapping is always done for
each character. Invalid (-) code points can be used to deactivate a
code point. To activate (+) or deactivate (-) a range of code points you
can use the minus mark (-) followed by one sub code point. The range
goes from the smaller code point to the bigger one. For example:

----------------------------------------------------------------------
   (-00-7F) deactivate all US-ASCII code points
   (+39-30) activate all decimal digits
----------------------------------------------------------------------

The range operator (-) with the optional plus sign (+) can also be used
to activate code points.

To define a own case mapping or folding (required for certain subsets)
the sign '^' can be used. This mapping will only done if user table
activated for case mapping (CASE=USRTAB).

To define transliterations/substitutions/mappings an assignment "=" of a
code point list is required. The code point list can contain a maximum
of 8 code points. If no code point list is specified, then the character
will be ignored. A transliteration to it self simply activates this code
point. The same is valid for mapping definitions. A code point is a
hexadecimal number representing a 21 bit UNICODE point.

----------------------------------------------------------------------
   00000000 to 001FFFFF hex
----------------------------------------------------------------------

If a code point is not defined in the substitution table (USRTAB and/or
SYSTAB), the appropriate SUBCHR will be used. If the SUBCHR is not
defined, the substitution will not be executed. Depending on the MODE, a
STOP or a IGNORE will be performed.

Text before and after brackets are comments.

Between "(" and "=" or "/" or ")" you can define hex digits until the
first non-hex digit is reached. All non-hex digits up to the next
separator are interpreted as comment. Leading whitespace is ignored.

-----------------------------------------------------------------------
   REPLACE GERMAN SZ (00DF=0073/0073) WITH ss
 THIS IS AN EXAMPLE FOR COMMENTS IN CODE POINT LIST:
   REPLACE EURO MARK (20AC= 45#E / 55#U / 52#R / 4F#O) WITH EURO
   REPLACE BOM MARK (EFFF=) WITH NOTHING
-----------------------------------------------------------------------

Leading zeros are possible. If no hex value is written, then 0x00 is
used.

-----------------------------------------------------------------------
   REPLACE GERMAN SZ (00DF=/) WITH 0x00 0x00
   REPLACE EURO MARK (20AC=00000045/00055/000052/000004F) WITH EURO
-----------------------------------------------------------------------

ATTENTION: Please don't use parenthesis '()' or the operators in your
           comments.

To describe your own subsets, you can use a user table without a system
table. The USRTAB can also be used to overwrite or add transliterations
when a system substitution table (for example SYSTAB=ICONV) is used. The
transliteration works recursively, that is if one of the substitution
code points is not in the target set, this substitution will be used
instead and so on.

-----------------------------------------------------------------------
   REPLACE GERMAN OE (0000D6=004F/0045) WITH O E
   REPLACE GERMAN SZ (0000DF=0073/0073) WITH s s
   REPLACE EURO MARK (0020AC=00D6/00DF/52/4F) WITH OE SZ R O
-----------------------------------------------------------------------

If you have a EURO sign in your text and convert it to 'Latin1', the
resulting byte string will be 'D6DF524F'. On the other hand, if you
convert it to 'ASCII', the byte string will be '4F457373524F'.

The mapping as described above is always done and is also done recursive
including the transliteration result. Means if you define a mapping from
(*30=39) then the target data don't have any zero in the text anymore.

By replacing other code points recursively, you could easily cause
infinite loops. To prevent this, the amount of replacements and the
length of one replacement is limited to a maximum of 64. Note that this
could still result in a large expansion of data. Therefore, be careful
about defining recursive substitutions.

A sample user table for the ICONV system table to change the
transliteration of German umlauts to AE, OE or UE is located in the
SAMPLE directory under FLUTDEXL. Another sample user table called
FLUTNPAS defines the 'string.latin' subset which is mainly used for
statutory reporting.

The order of definition is the oder how the definitions are calculated
in the open function. Means if you define a mapping or a transliteration
for code point X and later you make this code point invalid, then the
code point is invalid. On the other hand you can first deactivate all
code points and then activate your subset and define your
transliterations or mappings.

All user table definitions are pre-calculated once at beginning of
execution to reduce the required CPU time for the conversion. Means the
use of a user table increased the effort in the open function but has no
effect on CPU usage for the real conversion effort in the run function.

(0000678)
Falk Reichbott   
2014-11-07 10:20   
We plan to support string.latin also as system table (SYSTAB=STRING.LATIN). This make it easer to use this with byte and record interface in applications because no addtional file is required to ensure the character data is handled as defined in the string.latin specification.