Detection tasks scan the target source system for sensitive data elements. Click the + Show Advanced Options button to view the advanced settings for the detection task.
(Re)Scan All Files: Check this checkbox to scan all the available files for a given connection between the dates specified in Files Modified After and Files Modified Before drop- down. The Files Modified After and Files Modified Before fields are visible when (Re) Scan All Files option is checked.
Skip Files with No Extension: Check this checkbox to skip the scanning of the files which have no extension. Once the task has been executed successfully, the result can be seen under the Skipped Files tab of the By Task screen for the Results module. The reason for skipping the file will be cited as File Name without extension under the Skipped Reason column.
Unstructured Input Files: Check this checkbox, if you know that the scan location only contains files of text type only. This option allows the system to bypass the file type detection process and to finish the task in lesser time.
*Note: Enable property enable.discovery.unstructured.sampling=True in
HDFSAgentConfig.properties file, located at .../installation/directory/Agents/<IDP>
/expandedArchive/WEB-INF/classes to perform the sampling for unstructured files under discovery. By Default, value of this property is set to false. Restart the IDP.
Don’t Report File Size: Check this checkbox if you do not want to report the size of the file once scanning gets completed.
Create Metadata Info File: Check the Create Metadata Info File checkbox to write the metadata of the scanned files as a separate file named Metatdata.txt. It is recommended to run any task with this flag for few files to understand the metadata, as for large number of files, it will take a long time.
*Note: By default, the Metadata.txt file is available at location: ‘resultsDirectory/task_name/task_id’.
Read Files: Choose to read the entire file or a part of the file at random. Selecting Entire File option reads all the content of the selected files.
Choose Part of Files option if you want to read the content of a structured file at random. On selecting this option, the Exit on first Hit option is visible. To scan any unstructured file partially, set the value of enable.discovery.unstructured.sampling property to True in HDFSAgentConfig.properties file, located at .../installation/directory/Agents/<IDP>/
Exit on first hit: Check this checkbox to stop the scanning, when the first sensitive record in a file is detected during the detection process. This option is visible when you select Part of Files option in the Read Files option.
Header Row Number: The value in this field specifies the row number where column headers are defined in the file. By default, the value is set to -1. This value specifies that file does not contain any column headers. If any other value is specified in this field, it means that the row number where column headers are placed.
For example, in the below screenshot the header row number is set to 5. It means that the column headers are defined in the fifth row.
Objectname Regex: Enter the regular expression of the object. Only those objects are scanned whose name matches with the regex entered in this field.
Auto Batch Size: This option defines the number of files per batch during the scanning process.
Batch Size (Files): This option will scan the files per batch during the detection process. By default, the batch size per file is 30. This option is visible when auto batch size checkbox is not checked.
Auto Batch Size (MB): This option allows you to enter the minimum batch size in Batch min size (MB) option. This option is visible when you check the Auto Batch Size (MB) option. By default, the batch size if 512 MB.
Sync Results with Privacy: Check this checkbox to push detection results on the Privacy IDP.
Sampling Configuration: PK Protect is equipped with data sampling to limit the area of scan which helps in reducing the time taken for detection. By default, there are two options to scan sample data from the database are:
Top 1000 Rows - It will sample approximate 1000 records from the top of the database.
Read top 5% of data - It will sample 5 percent of the data from the top of the database. It shows the result by percent.
By default, Top 1000 Rows option is selected. To create a sample, go to Files > Tasks > Sampling Configuration tab. Click + Add Configuration button. You can also create a sample in the Add New Task Definition screen by clicking Add button next to the Sampling Configuration drop-down.
Perform the below steps to create a sample:
Enter the name and description of the sampling configuration in the Name and Description textbox.
Check the option Set Sampling Config as Default to set the Sampling Configuration as the default configuration for all your tasks.
Check the option Show Advance Sampling Details to set the advanced settings for sampling as shown above.
File Size Range (Bytes): Enter the range for the sample in Bytes.
To: Enter the ending range for the sample. `
By: To Specify how to pick data for sampling from the source system, there are two ways:
By Rows: Select ‘Rows’ from the drop-down, to sample data based on the number of rows.
By Percent: Select ‘Percent” from the drop-down, to sample a percentage of the data.
Value: Enter the numeric value. It will specify the total number of records to be processed if sampling By-Rows is selected and denotes the percentage of sampling By-Percent is selected.
After setting up the required configuration, click Add to add the user-defined sampling configuration to the list.
To proceed further for remaining steps, refer to step 4 of Create AWS task.