Detection tasks scan the target source system for sensitive data elements. Click the + Show Advanced Options button to view the advanced options specific to detection.
Filter Type: The Filter Type option allows you to select either Exclusion or Inclusion option.
Exclusion: Select Exclusion to exclude the selected tables while performing the detection task.
Inclusion: Select Inclusion to include the selected tables while performing the detection task.
Sampling Configuration: For Hive Detection, PK Protect now supports enhanced sampling options to accommodate concepts like partitions and buckets, supported by Hive. Additionally, the sampling options are enhanced, so that users can define different sampling sizes to be used, based on different sizes of tables that would be scanned. Partitioning is a way of splitting the table into multiple segments on the basis of one or more columns. While, bucketing helps in organizing data in each partition into multiple files.
There are three pre-defined sampling options that are available:
Scan 1000 Rows- It randomly samples approximate 1000 rows/maps from the table.
Scan 1 MB of data- It randomly samples 1 MB of data from the table.
Scan 5% of data- It randomly samples 5 percent of data from the table.
Perform the following steps to create a sample:
Go to Hadoop > Hive > Add/Edit Sampling Configuration tab. Click the + Add Configuration button. You can also create a sample by clicking the +Add button next to the Sampling Configuration drop-down on the Add New Task Definition screen.
Provide values for the fields depicted in the screenshot:
Provide a name to the sampling in the Name field. This field accepts numerals and characters.
Provide a brief description for the sampling in the Description field.
Check the Set Sampling Config As Default option to set the Sampling Configuration as the default configuration for all the tasks.
Check the Show Advanced Sampling Details option to view and set the advanced settings for the sampling. Below are the options for advanced settings:
Table Partition Count Range: Specify the Hive table partitions start range.
To: Specify the Hive table partitions end range.
Partition Sampling options: There are three ways to specify how to pick partition values for sampling:
Percentage of Partitions: Enter the percentage of partitions to be scanned. Hive IDP randomly picks defined %age of partition values from the partition column with highest number of unique partitions.
Number of Partitions: Enter the number of partitions to be scanned. Hive IDP randomly picks defined number of partition values from the partition column with highest number of unique partitions.
Value of Partitions: You can enter one or more actual partition values. The values will be comma separated.
Value: Provide the desired value for the option selected in the Partition Sampling options field.
Auto: On selecting this option, the Hive IDP automatically identifies the partition column(s) in the table. If more than one partition column exists, then the one with the higher number of partitions is picked.
On Column: On selecting this option, the user need to manually specify the partition column names.
Table row count range: Specify the Hive table row start range.
To: Specify the Hive table row end range.
By: There are three ways to specify how to pick data for sampling from the table:
Number of Bytes: Select this option and define the size of the data that needs to be sampled in a table. Size can be specified in terms of Bytes, KB, MB, and GB.
Rows: Select this option and enter the number of rows to be sampled in a table.
Percent: Select this option and enter the %age of the data to be sampled in a table.
Value: Provide the desired value for the option selected in the By field.
Use Bucket Sampling: Check this option to apply bucket sampling. Hive IDP divides the total number of buckets into groups of a particular number of buckets and then picks a particular number of bucket of each group. It displays two options:
In the first box, specify the index number of the buckets that need to be picked and sampled in each group. In the second box, specify the number of buckets in each group.
In the above example, if you have total 32 buckets and considering the values provided in the above screenshot, the Hive IDP will divide the 32 buckets into groups of 8 buckets, resulting into four groups of 8 buckets. Hence, picking the 6th, 14th, 22nd, and 30th buckets as a result.
Auto Select Buckets But Samples %: On selecting the second option, the Hive IDP will select the buckets on its own. You just need to specify the %age of buckets that needs to be sampled in the buckets.
After specifying the values, click the Add button to add the user-defined sampling configuration to the list. Click the Save button to save the configuration in the system, else click Cancel.
To proceed further for remaining steps, refer to step 3 of Create a Hive task.