Skip to main content

Detection - Files

Detection tasks scan the target source system for sensitive data elements. On selecting Detection in the Task Type following options are shown.

*Note: Detection can be performed on Image Files also.

  1. Click the + Show Advanced Options button to set the advanced settings for the task.

    1. (Re)Scan All Files: Check this checkbox to scan all the available files for a given connection between the dates specified in Files Modified After and Files Modified Before drop- down. The Files Modified After and Files Modified Before fields appear when (Re) Scan All Files option is checked.

    2. Skip Files with No Extension: Check this checkbox to skip the scanning of the files which have no extension. Once the task has been executed successfully, the result can be seen under the Skipped Files tab of the By Task screen for the Results module. The reason for skipping the file will be cited as File Name without extension under the Skipped Reason column.

    3. Unstructured Input Files: Check this checkbox, if you know that the scan location only contains files of text type only. This option allows the system to bypass the file type detection process and to finish the task in lesser time.

      *Note: Enable property enable.discovery.unstructured.sampling=True in HDFSAgentConfig.properties file, located at .../installation irectory/Agents/<IDP>/expandedArchive/WEB-INF/classes to perform the sampling for unstructured files under discovery. By Default, value of this property is set to false. Restart the IDP

    4. Don’t Report File Size: Check the checkbox if you do not want to report the size of the file once scanning gets completed.

    5. Create Metadata Info File: Check the Create Metadata Info File checkbox to write the metadata of the scanned files as a separate file named Metatdata.txt. It is recommended to run any task with this flag for few files to understand the metadata, as for large number of files, it will take a long time.

      *Note: *Note: By default, the Metadata.txt file is available at location: ‘resultsDirectory/task_name/task_id’.

    6. Read Files: Choose to read the entire file or a part of the file at random.

      1. Entire File: Choose this option, if you want to read all the content of the selected files.

      2. Part of Files: Choose this option if you want to read the content of a structured file at random. On selecting this option, the Exit on first Hit option is visible.

        To scan any unstructured file partially, set the value of enable.discovery.unstructured.sampling property to True in HDFSAgentConfig.properties file, located at  .../installation directory/Agents/<IDP>/expandedArchive/WEB-INF/classes.

    7. Exit on first hit: Check this checkbox to stop the scanning, when the first sensitive record in a file is detected during the detection process. This option is visible when you select Part of Files option in the Read Files option.


      Header Row Number: The value in this field specifies the row number where column headers are defined in the file. By default, the value is set to -1. This value specifies that file does not contain any column headers. If any other value is specified in this field, it means that this is the row number where column headers are placed.

      For example, in the below screenshot the header row number is set to 5. It means that the column headers are defined in the fifth row.

    8. Filename Regex: Enter the regular expression of the file. Only those files are scanned whose name matches with the regex entered in this field.

    9. Sync Results with Privacy: Check this checkbox to push detection results on Privacy IDP.

    10. Apply Label: Check the Apply Label checkbox to classify and protect the data by applying labels to the output file. To know more, refer to Appendix I: Mapping.
      On checking the Apply label checkbox, following fields will appear:

      1. Mapping: This field displays the mappings that you have created under the Mapping screen of the Policy module. Select the desired mapping from the drop-down. This field is visible only when mapping is defined. To know more, refer to Appendix I: Mapping.

      2. Overwrite Files: If you check this checkbox, then the labeled file will be created at the same location of the input file. The original input file gets deleted and a new file with a label is created at the same location.

      3. Destination Path: If you do not wish to overwrite the files, then provide another file path where you want to save the labeled file.


        *Note: The Overwrite Files and Destination Path fields appear only when you select a mapping in the Mapping drop-down with Source value as MIP.

  2. Sampling Configuration: PK Protect is equipped with data sampling to limit the area of scan which helps in reducing the time taken for detection. By default, there are two options to scan sample data from files are:

    1. Top 1000 Rows - It will sample approximate 1000 records from the top of the database.

    2. Read top 5% of data - It will sample 5 percent of the data from the top of the database.

      By default, Top 1000 Rows option is selected. To create a sample, go to Files > Tasks > Sampling Configuration tab. Click + Add Configuration button. You can also create a sample by clicking Add button next to the Sampling Configuration drop-down on the Add New Task Definition screen.


      Perform the below steps to create a sample:


      1. Enter the name and description of the sampling configuration in the Name and Description textbox.

      2. Check the option Set Sampling Config as Default to set the Sampling Configuration as the default configuration for all tasks.

      3. Check the option Show Advance Sampling Details to set the advanced settings for sampling as shown above. Below are the options for advanced settings:

        1. File Size Range (Bytes): Enter the range for the sample in Bytes.

        2. To: Enter the ending range for the sample.

        3. By: To specify how to pick data for sampling from the source system, there are two ways:

          1. Rows: Select ‘Rows’ from the drop-down, to sample data based on the number of rows.

          2. Percent: Select ‘Percent” from the drop-down, to sample a percentage of the data.

      4. Value: Enter the numeric value. It will specify the total number of records to be processed if sampling By-Rows is selected and denotes the percentage of sampling By-Percent is selected.

      5. After setting up the required configuration, click Add to add the user-defined sampling configuration to the list. Click the Save button to make the changes effective else click Cancel button.

To proceed further for remaining steps, refer to step 4 of Create a task in files.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.