Skip to main content

Detection - HDFS

Detection tasks scan the target source system for sensitive data elements. Click + Show Advanced Options button to view the advanced options.

  • (Re)Scan All Files: Check this checkbox to scan all the available files for a given connection between the dates specified in Files Modified After and Files Modified Before drop- down. The Files Modified After and Files Modified Before textbox are visible when (Re) Scan All Files option is selected.

  • Skip Files with No Extension: Check this checkbox to skip the scanning of the files which have no extension. Once the task has been executed successfully, the result can be seen under the Skipped Files tab of the By Task screen for the Results module. The reason for skipping the file will be cited as File Name without extension under the Skipped Reason column.

  • Unstructured Input Files: Check this checkbox, if you know that the scan location only contains files of text type only.  This option allows the system to bypass the file type detection process and to finish the task in lesser time.

    *Note: Enable property enable.discovery.unstructured.sampling=True in

     HDFSAgentConfig.properties file, located at .../installation

     directory/Agents/<IDP>/expandedArchive/WEB-INF/classes to perform the sampling for unstructured files under discovery. By Default, value of this property is set to false. Restart the IDP.

  • Don’t Report File Size: Check this checkbox if you do not want to report the size of the file once scanning gets completed.

  • Create Metadata Info File: Check the Create Metadata Info File checkbox to write the metadata of the scanned files as a separate file named Metatdata.txt. It is recommended to run any task with this flag for few files to understand the metadata, as for large number of files, it will take a long time.

    *Note: By default, the Metadata.txt file is available at location: ‘resultsDirectory/task_name/task_id’.

  • Read Files: Choose to read the entire file or a part of the file at random. 

    • Entire File: Choose this option, if you want to read all the content of the selected files.

    • Part of Files: Choose this option if you want to read the content of a structured file at random. On selecting this option, the Exit on first Hit option is visible.

      To scan any unstructured file partially, set the value of enable.discovery.unstructured.sampling property to True in HDFSAgentConfig.properties file, located at  .../installation directory/Agents/<IDP>/expandedArchive/WEB-INF/classes.

  • Exit on first hit: Check this checkbox to stop the scanning, when the first sensitive record in a file is detected during the detection process. This option is visible when you select Part of Files option in the Read Files option.

  • Header Row Number: The value in this field specifies the row number where column headers are defined in the file. By default, the value is set to -1. This value specifies that file does not contain any column headers. If any other value is specified in this field, it means this is the row number where column headers are placed.
    For example, in the below screenshot the header row number is set to 5. It means that the column headers are defined in the fifth row.

  • Filename Regex: Enter the regular expression of the file. Only those files are scanned whose name matches with the regex entered in this field.

  • Auto Batch Size: This defines the number of files per batch during the scanning process.

    • Batch Size (Files): This option will scan the files per batch during the detection process. By default, the batch size per file is 30. This option is visible when auto batch size checkbox is not checked.

  • Auto Batch Size (MB): This option allows you to enter the minimum batch size in Batch min size (MB) option.  This option is visible when you check the Auto Batch Size (MB) option. By default, the batch size if 512 MB.

  • Force Distribution: Select this checkbox when user needs to recreate both distribution and model type for unique count extrapolation process.

  • Unique Count Extrapolation: Select either No or Stochastic from the given option in the Unique Count Extrapolation drop-down. This field will appear only when Detection is selected in the Task Type. 

    • Selecting the Stochastic as option in this field will display the unique count of Sensitive Data Types at the database level.

  • Sampling Configuration: PK Protect is equipped with data sampling to limit the area of scan which helps in reducing the time taken for detection. By default, there are two options to scan sample data from the database are: 

    • Top 1000 Rows - It will sample approximate 1000 rows from the top of the database.

    • Read top 5% of data - It will sample 5 percent of the data  from the top of the database.

      By default, Top 1000 Rows option is selected. To create a sample, go to HDFS > Tasks > Sampling Configuration tab. Click + Add Configuration button. You can also create a sample by clicking Add button next to the Sampling Configuration drop-down on this screen.


      Perform the below steps to create a sample:

      • Enter the name and description of the sampling configuration in the Name and Description textbox.

      • Check the option Set Sampling Config as Default to set the Sampling Configuration as the default configuration for all your tasks.

      • Check the option Show Advance Sampling Details to set the advanced settings for sampling. Below are the options for advanced settings:

        • File Size Range: Enter the range for the sample in Bytes.

        • To: Enter the ending range for the sample.

        • By: To Specify how to pick data for sampling from the source system, there are two ways:·

          • Percent: Select Percent from the drop-down, to sample a percentage of the data.

          • Rows: Select Rows from the drop-down, to sample data based on the number of rows.  

        • Value: Enter the numeric value. It will specify the total number of records to be processed if sampling By-Rows is selected and denotes the percentage of sampling By-Percent is selected.

      • After setting up the required configuration, click Add to add the user-defined sampling configuration to the list. Click the Save button to make the changes effective else click Cancel button.

To proceed further for remaining steps, refer to step 4 of Create a HDFS task.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.