Skip to main content

Detection - BigQuery

Detection tasks scan the target source system for sensitive data elements.

  1. PK Protect is equipped with Sampling Configuration to limit the area of scan which helps in reducing the time taken for detection. By default, there are two options to scan sample data from the database are:· 

    1. Top 1000 Rows - It will sample approximate 1000 records from the top of the database.

    2. Read top 5% of data - It will sample 5 percent of the data from the top of the database.

      By default, Top 1000 Rows option is selected. To create a sample, go to Google Cloud > BigQuery > Tasks > Sampling Configuration tab. Click + Add Configuration button.

      You can also create a sample in the Add New Task Definition screen by clicking Add button next to the Sampling Configuration drop-down.


      Perform the below steps to create a sample:

      1. Enter the name and description of the sampling configuration in the Name and Description textbox.

      2. Check the Set Sampling Config as Default option to set the Sampling Configuration as the default configuration for all tasks.

      3. Check the Show Advance Sampling Details option to set the advanced settings for sampling. Below are the options for advanced settings: 

        1. File Size Range (Bytes): Enter the file range size for the sample in Bytes.

        2. To: Enter the ending range for the sample.

        3. By: To specify how to pick data for sampling from the source system, there are two ways:

          1. By Rows: Select Rows from the drop-down, to sample data based on the number of rows.

          2. By Percent: Select Percent from the drop-down, to sample a percentage of the data.

        4. Value: Enter the numeric value. It will specify the total number of records to be processed if sampling By-Rows is selected and denotes the percentage of sampling By-Percent is selected.

        5. Type: Select the sampling type. The available options are: Top and Random.

          1. Top: If you select this option, then the sample data for the scan will be selected from the entries at the top of the table, based on the range specified.

          2. Random: This option in the Type field scans random entries in the database. It will scan the number of entries based on the value entered in the Value field.

      4. After setting up the required configuration, click Add to add the user-defined sampling configuration to the list. Click the Save button to make the changes effective, else click Cancel.

  2. Click the Show Advanced Options button to view an advanced setting for the task by the name of Incremental.


    The Incremental option is used to execute detection only on the new entries of the database. This option significantly decreases the time taken to scan the database. To setup incremental option, perform the following steps: 

    1. Check the Incremental checkbox.

    2. Select the type of increment to the database that must be considered for scanning i.e., addition of New Partitions, or Columns/Tables to the database.

      1. New Partitions: Check this checkbox, if you want to scan the new partitions added in the table after the last scan was executed.

      2. New Columns/Tables: Check this checkbox, if you want to scan the new columns/tables added in the database after the last scan was executed.

        *Note: The users can run database object filters in BigQuery to specify which tables and/or columns to include or exclude while performing a detection scan. Once a filter is defined, then only those tables and/or columns that match the filter are scanned.

        Perform the following steps to run object filters in BigQuery:

CODE
Create a JSON file and provide data inside the file in the following format:
[
      {

        "colOperator": "equalsto",

        "columnFilter": "SSN,CCNO",

        "dataset": ["dataset1","dataset2"],

        "projectId": ["projectId1","projectId2"],       

        "filterId": 18,

        "filterListType": 1,

        "joinOperator": "OR",

        "tabOperator": "equalsto",

        "tableFilter": "EmailSSN20"

      },

      {

        "colOperator": "notstartswith",

        "columnFilter": "*",

        "dataset": “*”,

        "projectId": "*",

        "filterId": 19,

        "filterListType": 1,

        "joinOperator": "OR",

       "tabOperator": "contains",

        "tableFilter": "*"

      }
 ]
  1. Store the JSON file at a location which is accessible by BigQuery IDP.

  2. In the ‘bigquery-config.properties’ file, provide the absolute path to the JSON file corresponding to ‘dg.filter.object.file.path’ property. If no path is specified, then no filter will be applicable.

  3. Create and execute a BigQuery detection task in PK Protect.

*Note:  The user can be provide comma separated values in ‘columFilter and ‘tableFilter’ fields.

Points to remember:

  • There are eight types of colOperator and tabOperator on the basis of which you can select the table and column name: equalsto, notequalsto, contains, notcontains, startswith, notstartswith, endswith, and notendswith.

  • There are two joinOperator on the basis of which multiple filter conditions are combined: ‘AND’ and ‘OR’.

  • Datatype Filter and Upload Filter List functionalities are yet not supported in BigQuery.


To proceed further for remaining steps, refer to step 3 of Create a BigQuery task.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.