A patch to enable specifying text encoding of controlled terminology files

mokjpn · April 24, 2014, 6:36am

I prepared two test files to confirm the function of this patch.

First, a tiny dataset in Dataset-XML format: https://www.dropbox.com/s/80zim8y349xlsal/dm.xml
This data contains Japanese characters at DM.RACE and DM.ETHNIC.

Of course validatiing this data by OpenCDISC 1.5 generates errors for these two data; “Value for ETHNIC not found in (ETHNIC) CT codelist” and “Value for RACE not found in (RACE) CT codelist”.

So let’s modify “SDTM Terminology.txt” ( …that is a bad practice, I think, but for the test purpose…) to pass Japanese data. This is a modified version: https://www.dropbox.com/s/8iq8ywvsvvnq20e/SDTM%20Terminology.txt

This modified terminology.txt is encoded by EUC-JP encoding. But OpenCDISC 1.5 interprets delimited text files using platform’s “default text encoding”. EUC-JP is not the default on Windows nor Mac in Japanese locale, so OpenCDISC 1.5 still report same errors, though terminology file now supports the value.

Try applying my patch above, and specify Engine.ControlledTerminology.FileEncoding = EUC-JP on settings.properties file. After that, these errors will disappear.

mokjpn · April 24, 2014, 3:32pm

In addition to the CT encoding patch, I’m trying to write a tiny patch to enable specifying text encoding for .XPT dataset.

Currently modifying only one line patch seems to be working: https://www.dropbox.com/s/5uvye3ekpdjvdm1/OpenCDISC_XPTEncoding.patch

I checked it by this xpt file:
https://www.dropbox.com/s/baztwebsdbcf6fn/dm.xpt

But this xpt file is generated by R, not SAS. I will continue to test by more datasets.

mokjpn · April 24, 2014, 6:05am

I’m testing OpenCDISC 1.5 with some dataset with Japanese characters.

OpenCDISC 1.5 can handle Japanese values correctly when it is passed by Dataset-XML format. (but currently .xpt format seems not to be supported yet).

But still there is a problem to validate datasets with non-ASCII characters. Checking controlled terminologies is done by parsing text files in config/data/CDISC/SDTM/yyyy-mm-dd/ folder, but there is no way to specify character encoding for these terminology files. It will be a barrier when users would like to check their data by localized terminologies.

So I wrote a small patch to enable specifying text encoding of terminology files.
As I failed to attach patch file to this forum post, please download patch file from this link: https://www.dropbox.com/s/ps5w8spk1p8no06/OpenCDISC_CTEncoding.patch

When this patch is applied, users can specify text encoding of terminology files by setting Engine.ControlledTerminology.FileEncoding property.

For example, if localized terminology file is encoded by EUC-JP encoding, users should append following line to lib/settings/settings.properties file.

Engine.ControlledTerminology.FileEncoding = EUC-JP

I will present some test dataset in my next post.

Topic		Replies	Views
Downloading the new updated SDTM Terminology.txt SDTM	2	15	November 19, 2010
issue while trying to modify opencdisc v1.5 CT text file Troubleshooting and Problems	1	17	March 16, 2016
issue while trying to modify opencdisc v1.5 CT text file Troubleshooting and Problems	0	34	March 15, 2016
Command line and control terminology file Troubleshooting and Problems	0	13	April 30, 2015
structure of file SDTM Terminology.txt use with openCdiscValidator1.5 SDTM	0	13	May 27, 2015

A patch to enable specifying text encoding of controlled terminology files

Related topics