Skip to main content

File path: {Installation Path}/DgSecure/IDPs/HDFSIDP /expandedArchive/WEB-INF/classes/com/dataguise/hadoop/util/

Following are the properties defined in the file which the user may need to configure:

  1. Tez Support: This property is used when you are using Tez to run your HDFS tasks. The default is set to ‘N’ meaning disabled. To enable, change ‘N’ to ‘Y’.
    When enabled, add the following parameter to point to the tez config file path.
    tezConfigPath=absolute path where tez-site.xml is located

    Example: tez=Y
    tezConfigPath=/home/conf/tez (tez-site.xml will be within this directory)

  2. Time-sync Properties: Used to correctly process incremental jobs

    1. cluster.useNTPPORT: The port being used. Should be set to Y to use NTP port.

    2. cluster.namenode.ip: IP address of cluster name node.

    3. cluster.timediff.millisecs: Determines the time differential.

  3. KeyStore Config Params: Sets the keystore parameters.

    1. keystore.fullpath: The path of the keystore file as defined in the code.

    2. keystore.type: Defines the type of keystore file (e.g., JCEKS or JKS.)

  4. Dir Traversal Properties: Determines the Hadoop directories are displayed on the Hadoop Attribute Assignment screen.

    1. dir.traverse.level: Specifies which directory level displays. For a value of “1” (the default setting), the screen displays all root level directories.

    2. dir.traverse.ignore: Filters out specific directories from being shown. By default, dataguise$, dgout, and dg$output are filtered out.

  5. Directory Settings: dg.output.parallel.dir: Used to define the output directory for masked files.

    1. Name of the DG meta dir.

    2. Parallel Directories

    3. Branch Points

  6. Suppress Real Values in Results File: results.suppressRealValues: Optionally suppresses sensitive values in the Samples tab on the Hadoop Results page.

    If this property is set to “Y”, the sample discovery results for each file in Hadoop will only contain the types and offsets of the sensitive data found in that file. The results will not contain the data that matched a sensitive type.

    If this flag is set to “N” (the default setting), the sample discovery results will contain actual sensitive data.

  7. Compress Results File: results.CompresstionType: Optionally compress results file in HDFS. By default, the property is set to ‘none’ to disable compression.

  8. Consistent Masking Engine Type: Consistent Masking Engine, can either be ‘hbase’ or ‘couchbase’

  9. Consistent Masking Server Details: This property is used to define couchbase. The user can define single or multiple couchbases. Use a comma separated list to specify multiple addresses when a cluster is used.

  10. HBase Properties (Consistent Masking Parameters)

    1. - The Hbase table that stores the consistent values. This must be configured with Consistent Masking Coprocessor.

    2. hbase.zk.quorum - IP address of the ZooKeeper Quorum. Can be found from hbase-site.xml.

    3. - Port address of the Zookeeper Quorum can be found from hbase-site.xml.

  11. Couchbase Properties (Consistent Masking Parameters): These properties define Couchbase’s retry counter, retry time, and retry increment. The default setting for each is shown below:

    1. couchbaseRetryCounter=5

    2. couchbaseRetryTime=1000

    3. couchbaseIncrementRetryTime=1000

    4. multiplexKeys- This property limits the number of new entries being created in couchbase. If it is set to 'Y', couchbase groups multiple values together into a single key. If it is set to 'N', it does not group multiple values together into a single key. By default, it is set to 'Y'.

  12. Discovery Engine Settings: Determines whether a task discovering, or masking names will include names that are part of an address. If this property is set to “true”, a name that is part of an address (Ex. Francis St.) would not be included in the results of a discovery or masking task.

    If this property is set to “False,” a name that is part of an address would be included in the results of a discovery or masking task.

  13. MR Job Settings: This property is used when there are multiple queues defined on a cluster. Tasks are sent to specific queues based on PK Protect domains. By default, this property only has one queue listed “”

    To associate a specific domain with a specific queue, add queue.<queue name>=<domain name>

    Example: queue.first=sales
    Where queue name = first and domain name = sales.

  14. FPE Key Alias

  15. Keytab File Location: Three properties compose the keytab File Location settings, they are only used when a CDH (Cloudera) cluster is configured with Kerberos. For a MapR cluster all these properties can be commented out.

    1. keytabLocation: This is the location of keytab file on your local system.

    2. kerberosTicketCache: This is the location of Kerberos ticket cache on local system.

    3. kinit.path: This is the location of kinit file. E.g., /usr/bin/kinit.

  16. Kinit File Path: This property defines the Absolute Path to the kinit file.

  17. DGUSER for Kerberos Environment

  18. Absolute Path to Directory containing Hadoop Config Files

  19. hadoopConfigPath: This property defines the Absolute Path to the directory containing hadoop config files like core-site, hdfs-site, and mapred-site.
    For Linux: /opt/mapr/conf.

  20. ACL Fetch

  21. acl.source:  Set HDFS for fetching the ACL from HDFS or controller from fetching it from controller.

  22. Cluster ID associated with IDP in Admin:

  23. PKWARE Interval Scheduler
    workflow.status.interval: Length in seconds the PKWARE Scheduler waits before checking for job status of a given task.

  24. Define Authentication – Group
    preproc.suppressGroupPermissionsCheck: If set to “Y”, the HDFS IDP will delegate the permissions check to Hadoop. If set to “N”, the IDP will run the permissions check itself.

  25. Define Authentication – All
    preproc.suppressAllPermissionsCheck: If set to “Y”, all permissions check is disabled for the browser and preprocessing (which may cause task failure due to insufficient privileges). If set to “N”, all permissions check will be enabled for the browser and preprocessing, to verify all permissions prior to submitting the task for MapReduce.

  26. MapR Authentication Type: Set it to kerberos for kerberos authentication. PAM for PAM authentication. Leave blank for non-secure cluster. It is blank by default.

  27. MapR Login Path: Absolute Path to directory containing maprlogin.

  28. Binary Files
    processTextFilesOnly: If set to “Y”, binary files will not be added to scan path. If set to “N”, all files within the scan path will be added.

  29. Files Skipped by Name

    1. processTextFilesOnly: This property should be set to either “Y” or “N”. When set to Y, the system will skip any binary files it detects in the scanpath of a task run against text or Avro files. A log entry for the file will appear
      `hdfs://<host>:<port>/dataguise$/tmp/<taskName>/<taskInstanceID>/_skipped` file.

    2. skip.filename.regex: If a regular expression (regex) is specified, the regex is only matched against the filename of the files in the scanpath. If the pattern specified by the regex appears in the file name, the file skipped, and a log entry is appended to – `hdfs://<host>:<port>/dataguise$/tmp/<taskName>/<taskInstanceID>/_skipped`.

    3. skip.filepath.regex: However, when uncommented and a regex is specified, it will be matched against the filepath (as opposed to just the file name) and if a match is found, the file is skipped and log entry should appear in –

  30. Files Skipped by Path

  31. Specific Signature CCN Masking: This property determines the format of Random -> Retain Card Type masking. Enter the number of characters (in digits) that the mask should retain. Alternatively, set the property as retainFiveOnesAlgo to retain the first digit, replace each of the following five digits with a “1”, and randomly mask the remaining digits.

  32. Positional Structure Masking: This property defines positional structure padding character(s) to be used when the original string is longer than masking string. Enter the padding characters inside the brackets.

  33. Structured Discovery

  34. Additional Delimiters

  35. Seed for Hashing: Seed for hashing based per-instance general encryption marker.

  36. KMIP Configuration: When using a key management interoperability protocol (KMIP), these properties define the keystore location, XML files location, and key retrieval. If Safenet or RSA is the chosen KMIP, refer to Section 3.1.7 in this document.

  37. Safenet Properties Location: kmip.stubconfig.path=/home/cloudera/software/KMIPV2/KMIPClientTest/src/ch/ntb/inf/kmip/stub/config/
    kmip.retrieval: Should be set to ‘Y’ when using Safenet keys, else to ‘N’
    safenet.ingriannae.props.location: Absolute path of ‘’ file

    *Note: When using Safenet keys, set ‘key.retrieval.source’ to ‘other’

  38. RSA Configuration
    kmip.retrieval: Should be set to ‘Y’ when using Safenet keys, else to ‘N’
    safenet.ingriannae.props.location: Absolute path of ‘’ file

    *Note: When using Safenet keys, set ‘key.retrieval.source’ to ‘other’

  39. RC/ORC Discovery: This property determines whether a task recognizes the columnar boundaries in a given row. When the property is set to “true”, columnar boundaries in a row are recognized. When the property is set to “false, a given row is treated as though it has now internal boundaries. It is set to “false” by default.


  40. Generate Multiple Reducers: This property can be set to ‘Y’ or ‘N’. The default is ‘N’

    If it is set to ‘Y’: Multiple reducers for large input files will be run instead of 1 reducer/file for Masking/Encryptions tasks only. Also, the multiple small masked/encrypted files would be created for large input files. # of Reducers will be calculated based on the cluster config.

    If it is set to ‘N’, it would work as before i.e., 1 reducer would run for 1 file.

  41. New Encryption Marker: This property can be set to “Y” or “N”. When set to “Y”, the encryption marker ends in ? instead of bank.

  42. Size of thread pool for Auto Discovery
    auto.discovery.thread.pool.size: Default is 10. This decides how many simultaneous MR jobs can be run. Please realize high number of threads may decrease system overall performance. You can tune IDP performance through "auto.discovery.thread.pool.size”

  43. Auto Discovery with Structured Discovery: This property can be set to “Y” or “N”. This is used to enable or disable Structured Discovery with Auto Discovery.

  44. Sequence File Schema jar File Path
    sequenceFileSchemaJarPath: This is used when running Auto Discovery Tasks on Self Defined Sequence Files. It references the location of the schema jars for Self-Defined Sequence Files. The referenced directory should be on the machine where the HDFS IDP is installed so the HDFS IDP can access it.

    If the schema file jars are not provided for Self-Defined Sequence files, these files would be skipped when executing Auto Discovery tasks. The file path must be referencing the exact folder where the files are located.

  45. Direct Kerberos keytab: If this property is set to “Y”, the IDP logs into the system using keytab directly. If this property is missing or set to “N”, it uses keytab to kinit and then will use cache file for login. Only applicable when Kerberos is used.

  46. Metadata Push Timer: This property sets the time (in seconds) that the data is pushed to the metadata repository.

  47. Encryption Delimiter

  48. Masked/Encrypted Files Permissions

  49. Detection: Number of Rows Read: Set the maximum number of rows of a file a task instance reads to determine if the file is structured or unstructured. This property only applies to auto discovery tasks. The property is set to 100 by default.


  50. Separator to Split Original and PartialFPE Values: Keeping this property empty will just do the PartialFPE, original values will not be prepended with the PartialFPE values.

  51. FPM Consistent Masking Compatibility: Implemented for backward compatibility. When set to “Version 1.0,” IDP will generate the masked values same as with newMaskingAlgo=false in DgSecure4.4.2.x builds. When set to “Version 2.0,” IDP will generate the masked values same as with newMaskingAlgo=true in DgSecure4.4.2.x builds.

    When set to “Version 3.0,” IDP will generate the masked values same as with any of the DgSecure5.0.x.x builds.

  52. Custom Validation Plugins: Map custom validation plugins with the custom expression names.
    Format:  <Custom Expression Name>:<Validation Plugin full name>
    Use a comma separated list for multiple validation functions mapping.
    Ex. <Custom Expression Name>:<Validation Plugin full name>,<Custom Expression Name>:<Validation Plugin full name>

  53. PK Protect tmp Files: Local tmp file home to store PK Protect tmp files.

  54. Additional Text File Handling: This property is used to make the IDP handle additional file types (in MIME type format such as application/example) to be treated as a text file. If you are adding more than one type, separate them by comma (,). By default, the IDP handles all files whose type starts with "text/", "application/xml", and "application/json" as a text file.

  55. Key Rotation
    do.xattr.keyrotation: This property is used if you want to support key rotation and can use extended attributes in your system.

  56. Concise Detail Results
    enable.concise.details: This property is used if the user would like to see detailed results after the task is completed in HDFS. By default, this is set to ‘Y’, however it can be changed to ‘N’ in case the detailed results are not wanted.

  57. Detail and Summary Report Location This property is used to specify if the report is to be saved to HDFS or to S3/GCS (S3 or GCS will be defined in Admin). Use ‘processor’ for HDFS and ‘source’ for S3 or GCS.

  58. EDI Dictionary HDFS Path: This property points to the place where the HDFS dictionary will be copied when the EDI Auto Discovery task runs. Modifying this property is necessary only when the HDFS IDP does not use the default /Dataguise$ folder.

  59. Single UGI
    use.single.ugi: This is set to ‘N’ by default and should only be changed in using DGUSER feature.

  60. S3 Filesystem Property

  61. Encryption Provider: Currently, both Java and Bouncy Castle are installed as part of PK Protect. To specify which keystore should be used in encryption tasks, set the property to either “bc” for Bouncy Castle or “jce” for Java. It is set to Java by default.

  62. FPE Salt Use: Specify whether to use the FPE salt. Valid values are “Y” and “N”. A value of “N” means the salt is ignored and “Y” means the salt is processed. Use this field with a N to ensure backwards compatibility with older versions of PK Protect. It is set to “Y” by default.

  63. Hive/Beeline Decrypter6 UDF: There are two properties that need to be set for the DgDecrypter6 UDF to work. The first property is” OTF.decryption.configuration” and the second is “OTF.decryption.metadata.path”. The first property generates Hive/Beeline parameters to cluster and must be set to “Y” before initial encryption. It is set to “N” by default. The second property determines the repository for the decryption metadata.

  64. Decryption Metadata Location

  65. In Memory Masking: Set this property to Y, then random name masking will use a reference name list instead of HBase and Couchbase.

  66. Preprocessing Property
    preprocessing.synchronized: This property is used to enable or disable the preprocessing which can be synchronized or asynchronized. Asyncronized (“N”) is the preferred option especially if you have lots of files in a single discovery task.

  67. High Availability Flag: When set to Y, this property sends metadata to the controller. This enables a secondary IDP to resume tasks if the first IDP fails. This property is set to N by default.

  68. Binary File with Magic Header
    custom.magic.headers: The binary files will be specified below as unstructured files. They will be separated by ‘,’.
    Ex. custom.magic.headers=SPL,META

  69. Navigator Integration: There are four properties to set to integrate HDFS results with Cloudera Navigator.
    dg.access.control: This property can be set to either “NONE” or “Navigator”. When set to NONE, HDFS results will not be sent to Navigator. When set to Navigator, HDFS results can be sent to Navigator.

    Ex. dg.access.control = Navigator
    dg.access.control.url: Enter Navigator’s URL

    Ex. dg.access.control.url = http://localhost:7187

    Enter the access credentials for Navigator
    Ex. dg.access.control.username=i0BVPEH0IcuYBRw8QI9Xdw==

    Lastly, enter the prefix that all PK Protect tags in Navigator should have. Tag prefix length should not exceed 15 characters. The maximum tag length in Navigator is 50 characters.

    Ex. dg.access.control.tag.prefix=DG_

  70. Custom Masking: PK Protect allows customers to leverage their own custom masking solution. To leverage custom masking, there are two properties that need to be set. The first property “local.custom.masking.jar.path” is the location of the custom masking jar. The second property “hdfs.custom.masking.jar.path” is the location where PK Protect will place the jar on the cluster. The HDFS IDP uploads the custom jar from local.custom.masking.jar.path to hdfs.custom.masking.jar.path.

    Ex. local.custom.masking.jar.path=/opt/Dataguise/DgSecure/IDPs/HDFSIDP/expandedArchive/WEB-INF/plugins/custom_protection/

  71. Machine Learning: This property covers a feature that is currently in beta. To try PKWARE’s machine learning on HDFS, please contact professional services.

  72. Yarn Memory Management: This property is needed when we need to override the setting from the mapred-site.xml. At times, the load hitting the cluster will be greater than can be set in mapred-site.xml and therefore we can set the parameter manually within the file.

    By default, this parameter will be blank. We will populate it only when we need to override the automatic setting from mapred-site.xml. The format should be [FileSize1],[MemorySize1];[FileSize2],[MemorySize2]


  73. Combine Small Files to One Mapper in Detection: This property is used to combine small files in a batch to a single mapper and all Sequence file type tasks.

  74. Controller ID This property defines the controller ID of the Dg Controller.

  75. Submit Job Time Out
    submit.job.timeout= This property defines the time out for submitting a job in seconds. Set the property to ‘0’ if you do not want to use the feature

  76. Spark
    spark= This property is used if you want to use Spark instead of map reduce.

  77. Masked Data Output File
    dg.masked.output.append.newlinechar= This property is used to determine whether to append a new line character at the end of a masked output file.

  78. Local File Size Limit
    local.file.size.limitation= This property is used to define the limit of the file size. If it is greater than the size notated, the file will be skipped. If the value is ‘-1’, then no files will be skipped.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.