0000588: Support of string.latin for character conversion

Notes
(0000675) Falk Reichbott (administrator) 2014-11-07 10:07	String.latin also define a deticated case folding with capitals. This is different to the unicode case folding specification. To support string.lation the definition of a own case folding must also be possible.

(0000676) Falk Reichbott (administrator) 2014-11-07 10:11	The string.latin specification defines for different code pages for each code point deticated translitterations. These definitions are deticated defined per codepage and not defined on code point level, but we have make a plausi check we have verified that all defined translitteration are code point equivalent.

(0000677) Falk Reichbott (administrator) 2014-11-07 10:15 edited on: 2014-11-13 11:02	The free defined subset support for character conversion is available with version 5.1.1 of FLAM. The solution was realised over the user table definition. A sample user table (FLUTNPAS) for string.latin is now part of the installation package. Below you can find the manpage from our user manunal. MANPAGE FOR CNVCHR.USRTAB A user substitution table (USRTAB) has the syntax listed below: ---------------------------------------------------------------------- usr_tab_file -> usr_definition_list usr_definition_list -> usr_definition usr_definition_list \| EMPTY usr_definition -> '(' CODEPOINT '=' code_point_list ')' \| '(' '+' CODEPOINT '=' code_point_list ')' \| '(' '' CODEPOINT '=' code_point_list ')' \| '(' '^' CODEPOINT '=' code_point_list ')' \| '(' CODEPOINT ')' \| '(' '+' CODEPOINT ')' \| '(' '' CODEPOINT ')' \| '(' '-' CODEPOINT ')' \| '(' '-' CODEPOINT '-' CODEPOINT ')' \| '(' '+' CODEPOINT '-' CODEPOINT ')' \| '(' CODEPOINT '-' CODEPOINT ')' code_point_list -> CODEPOINT '/' code_point_list \| CODEPOINT ',' code_point_list \| CODEPOINT ';' code_point_list \| CODEPOINT ---------------------------------------------------------------------- Note: In the description below "/" is used as code point separator. The optional sign of the initial code point defines ---------------------------------------------------------------------- '+' - a valid transliteration (only if CP not in the target set) '' - a valid mapping (this translate is always done) '^' - a case mapping/folding (only if case=usrtab activated) '-' - an invalid definition (removes CP for subset definitions) ---------------------------------------------------------------------- code point definition. If no sign is used, then a transliteration for a valid code point is defined. Valid code points are accepted, invalid code points result in error handling depending on the mode (STOP,IGNORE, SUBSTITUTE). Definition of '+' code points without code point list, thus without '=' is equal definition of valid code points. With an asterisk "", you can define an enforced mapping for this code point. For example, this can be used to delete this character if no code point list is provided or to translate this character always in another value. This can also be used to convert code points outside of a subset into this subset. In contrast to the transliteration (+ is only done if a character doesn't exist in the target encoding), the mapping is always done for each character. Invalid (-) code points can be used to deactivate a code point. To activate (+) or deactivate (-) a range of code points you can use the minus mark (-) followed by one sub code point. The range goes from the smaller code point to the bigger one. For example: ---------------------------------------------------------------------- (-00-7F) deactivate all US-ASCII code points (+39-30) activate all decimal digits ---------------------------------------------------------------------- The range operator (-) with the optional plus sign (+) can also be used to activate code points. To define a own case mapping or folding (required for certain subsets) the sign '^' can be used. This mapping will only done if user table activated for case mapping (CASE=USRTAB). To define transliterations/substitutions/mappings an assignment "=" of a code point list is required. The code point list can contain a maximum of 8 code points. If no code point list is specified, then the character will be ignored. A transliteration to it self simply activates this code point. The same is valid for mapping definitions. A code point is a hexadecimal number representing a 21 bit UNICODE point. ---------------------------------------------------------------------- 00000000 to 001FFFFF hex ---------------------------------------------------------------------- If a code point is not defined in the substitution table (USRTAB and/or SYSTAB), the appropriate SUBCHR will be used. If the SUBCHR is not defined, the substitution will not be executed. Depending on the MODE, a STOP or a IGNORE will be performed. Text before and after brackets are comments. Between "(" and "=" or "/" or ")" you can define hex digits until the first non-hex digit is reached. All non-hex digits up to the next separator are interpreted as comment. Leading whitespace is ignored. ----------------------------------------------------------------------- REPLACE GERMAN SZ (00DF=0073/0073) WITH ss THIS IS AN EXAMPLE FOR COMMENTS IN CODE POINT LIST: REPLACE EURO MARK (20AC= 45#E / 55#U / 52#R / 4F#O) WITH EURO REPLACE BOM MARK (EFFF=) WITH NOTHING ----------------------------------------------------------------------- Leading zeros are possible. If no hex value is written, then 0x00 is used. ----------------------------------------------------------------------- REPLACE GERMAN SZ (00DF=/) WITH 0x00 0x00 REPLACE EURO MARK (20AC=00000045/00055/000052/000004F) WITH EURO ----------------------------------------------------------------------- ATTENTION: Please don't use parenthesis '()' or the operators in your comments. To describe your own subsets, you can use a user table without a system table. The USRTAB can also be used to overwrite or add transliterations when a system substitution table (for example SYSTAB=ICONV) is used. The transliteration works recursively, that is if one of the substitution code points is not in the target set, this substitution will be used instead and so on. ----------------------------------------------------------------------- REPLACE GERMAN OE (0000D6=004F/0045) WITH O E REPLACE GERMAN SZ (0000DF=0073/0073) WITH s s REPLACE EURO MARK (0020AC=00D6/00DF/52/4F) WITH OE SZ R O ----------------------------------------------------------------------- If you have a EURO sign in your text and convert it to 'Latin1', the resulting byte string will be 'D6DF524F'. On the other hand, if you convert it to 'ASCII', the byte string will be '4F457373524F'. The mapping as described above is always done and is also done recursive including the transliteration result. Means if you define a mapping from (*30=39) then the target data don't have any zero in the text anymore. By replacing other code points recursively, you could easily cause infinite loops. To prevent this, the amount of replacements and the length of one replacement is limited to a maximum of 64. Note that this could still result in a large expansion of data. Therefore, be careful about defining recursive substitutions. A sample user table for the ICONV system table to change the transliteration of German umlauts to AE, OE or UE is located in the SAMPLE directory under FLUTDEXL. Another sample user table called FLUTNPAS defines the 'string.latin' subset which is mainly used for statutory reporting. The order of definition is the oder how the definitions are calculated in the open function. Means if you define a mapping or a transliteration for code point X and later you make this code point invalid, then the code point is invalid. On the other hand you can first deactivate all code points and then activate your subset and define your transliterations or mappings. All user table definitions are pre-calculated once at beginning of execution to reduce the required CPU time for the conversion. Means the use of a user table increased the effort in the open function but has no effect on CPU usage for the real conversion effort in the run function.

(0000678) Falk Reichbott (administrator) 2014-11-07 10:20	We plan to support string.latin also as system table (SYSTAB=STRING.LATIN). This make it easer to use this with byte and record interface in applications because no addtional file is required to ensure the character data is handled as defined in the string.latin specification.

Issue History
Date Modified	Username	Field	Change
2014-11-07 10:05	Falk Reichbott	New Issue
2014-11-07 10:05	Falk Reichbott	Status	new => assigned
2014-11-07 10:05	Falk Reichbott	Assigned To	=> Falk Reichbott
2014-11-07 10:07	Falk Reichbott	Note Added: 0000675
2014-11-07 10:11	Falk Reichbott	Note Added: 0000676
2014-11-07 10:15	Falk Reichbott	Note Added: 0000677
2014-11-07 10:15	Falk Reichbott	Status	assigned => resolved
2014-11-07 10:15	Falk Reichbott	Fixed in Version	=> 5.1
2014-11-07 10:15	Falk Reichbott	Resolution	open => fixed
2014-11-07 10:16	Falk Reichbott	Summary	Support for string.latin subset => Support of string.latin for character conversion
2014-11-07 10:20	Falk Reichbott	Note Added: 0000678
2014-11-13 11:02	Falk Reichbott	Note Edited: 0000677	View Revisions
2015-04-15 15:36	Falk Reichbott	Fixed in Version	5.1 => 5.1.02

Relationships