Of course validatiing this data by OpenCDISC 1.5 generates errors for these two data; “Value for ETHNIC not found in (ETHNIC) CT codelist” and “Value for RACE not found in (RACE) CT codelist”.
This modified terminology.txt is encoded by EUC-JP encoding. But OpenCDISC 1.5 interprets delimited text files using platform’s “default text encoding”. EUC-JP is not the default on Windows nor Mac in Japanese locale, so OpenCDISC 1.5 still report same errors, though terminology file now supports the value.
Try applying my patch above, and specify Engine.ControlledTerminology.FileEncoding = EUC-JP on settings.properties file. After that, these errors will disappear.
I’m testing OpenCDISC 1.5 with some dataset with Japanese characters.
OpenCDISC 1.5 can handle Japanese values correctly when it is passed by Dataset-XML format. (but currently .xpt format seems not to be supported yet).
But still there is a problem to validate datasets with non-ASCII characters. Checking controlled terminologies is done by parsing text files in config/data/CDISC/SDTM/yyyy-mm-dd/ folder, but there is no way to specify character encoding for these terminology files. It will be a barrier when users would like to check their data by localized terminologies.