PK Protect Data Store Manager-FAQs

What is PK Protect and what are its functions?
PK Protect is a complete data security solution that enables enterprises to leverage their data to achieve greater business goals while minimizing the risk of exposure and running afoul of data handling regulations such as, PII, PCI, HIPAA, and GDPR. PK Protect provides core solution for finding Sensitive data (DETECT), encrypting or masking (PROTECT) it, and providing an overall report (AUDIT).
What are the system requirements to install PK Protect?
The DSM Administrator’s machine must have the following configuration to install PK Protect:
- Apache Tomcat (Tomcat 9.x)
- Operating System 64 bit: For Linux: CentOS 7.9, RedHat 8 and 8.3, Ubuntu 18 and 20, Rocky Linux 9.1. For Windows: Windows 10, Windows server 2012 and 2019.
- Authentication Systems: The options include Active Directory Server, LDAP, and DB Authentication.
- Metadata Repository: Options include PostgreSQL, MySQL, Oracle, and SQL Server.
- Memory: Minimum 16 GB (RAM) (32 GB is preferred)
- Disk Space: Minimum 60 GB
  
  The IDP’s machine must have the following configuration:
- Operating System 64 bit: CentOS 6.4 or 6.7, CentOS 7.0, RHEL 6.5, RHEL 7.2, OpenSuSe 11.4, OpenSuSe 13.2
- Memory: Minimum 8 GB (RAM)
- Disk Space: Minimum 100 GB
- HDFS IDP should be deployed on an edge node instead of a head node or worker node.

*Note: PK Protect will use bundled JDK (Java Development Kit) and JRE (Java Runtime Environment) files.

What are the conditions for installing PK Protect through non-root?
To run the non-root DSM Administrator, user will require 777 permissions on the /tmp directory and needs to manually start the PK Protect IDP services from the respective IDP directories after completing the installation.
How can license be configured in PK Protect?
A user must have appropriate license configured to use PK Protect. In case of new installation, the user is prompted to specify the path to the license (.lic) file (provided by PK Protect team). After doing the necessary configurations, user will be able to use the features that are listed in the installed license.
What is the default path of PK Protect installation?
The default path of PK Protect is /opt/Dataguise. It can be changed at the time of installation.
What is the role of DSM Administrator in the PK Protect working?
DSM Administrator acts as a mediator between IDPs and PK Protect front end. It authenticates users, route requests to the appropriate IDP(s), conveys results back to the user, and logs all activities for auditing purposes.
What are the privileges of a user with DEFAULT role?
The users with Default role have ‘Product’ and ‘Owner’ access on Policies, Sensitive Types, Domain, Structures, etc., by default. This means that the users with this role can see the objects created by them in addition to the predefined ones. They can also be given access to the objects created by other users.
Which databases are supported by PK Protect?
The metadata repository database is used to store and retrieve the user, access control, IDP details, task definitions, and task results. PK Protect supports the following databases: Postgres, MySQL, SQL Server, Oracle, and RDS variants of these are also supported.
What is an IDP and how many IDPs PK Protect supports?
IDP stands for Intelligent Data Processor. An IDP is the main component which processes data according to the instruction received from the DSM Administrator. PK Protect has following IDPs:
- Azure Cloud IDP
- Azure Data IDP
- DBMS Detection IDP
- DBMS Masking IDP
- DgWalker IDP
- File IDP
- Google BigQuery IDP
- Google Cloud IDP
- HBase IDP
- Hadoop Controller IDP
- Hadoop Data IDP
- Hive IDP
- NoSQL Detection IDP
- Privacy IDP
- S3 Cloud IDP
- S3 IDP
Can we install DSM Administrator and IDPs on different machines?
Yes, DSM Administrator and IDPs can be installed on different machines.
What is a sensitive type?
A sensitive type is the basic building block which is used by PK Protect to identify sensitive data in the repositories it scans. Different sensitive types are data elements within databases that indicate or comprise of private and confidential information. For example, account number, credit card number etc., can be considered as sensitive type under PCI policy. A user can also create custom sensitive type to search industry specific data unique to an enterprise.
How sensitive data is different from sensitive type?
Sensitive data is the vital information of the user that needs to be secured from unauthorized access e.g., social security number, account number etc. On the other hand, sensitive type is component used by PK Protect to identify sensitive data within database or file stores.
What is the difference between Task and Task Instance?
A Task is a primary unit of work which consists of one or more policies, an action, and target scan path. On the other hand, a Task Instance is the unique record generated after execution of the corresponding task.
What is metadata repository?
A metadata repository is a database created to store metadata. Metadata is information about the structures that contain the actual data.
What is a policy?
A policy contains a set of guidelines which are created to protect the sensitive information. A Policy is a combination of sensitive data types which are related to regulatory compliance requirements. Currently, PK Protect supports PII, PCI, HIPAA, and GDPR policies.
Can a user create its own policy?
Yes, PK Protect allows the users to create a custom policy unique to an enterprise by adding additional data types according to corresponding industry-specific data. The policy cannot be deleted once it is created.
Is there any limit to define sensitive types in a policy?
No, there is no limit to the number of sensitive types that can be associated with the policy. Sensitive types can be added by sensitive group or individually.
Can a user delete a policy after creating it?
A policy can’t be deleted after its creation if it has been assigned to any task. It can be hidden from the policy screen by checking the checkbox corresponding to the policy and clicking the hide button. To show hidden policies, click the <UserName> at the bottom of the navigation panel. Once the drop-down menu appears, select User Preferences. Check the Show Hidden Items checkbox and click Save.
What are the conditions to define the name of a policy?
The primary condition to define name of a policy is that it should be unique to each individual policy. The policy name field can hold up to 256 characters and it accepts letters, numbers, and symbols. Also, the policy name cannot be edited once it is defined.
What is data detection?
Data Detection is a process of locating and identifying sensitive data in any data stores or file stores, based on the defined policy.
What is metadata discovery?
Metadata Discovery is a process which scans the database to provide information about the type of data available in the database.
What is data masking?
Data masking is a process of replacing sensitive data with fictitious content using one of the many available protection options or create a cipher that restricts access to the selected users.
What is the full form of CUPS?
CUPS are the data consistency options which stands for Consistent, Unique, Persistent, and Synchronized. These are the various parameters provided by PK Protect that decide the implementation of masking on the data. Besides these, there are two other parameters: Keep null and Stateless. The masking results can be enhanced by combining masking options and these data consistency options to generate fictitious but usable masked results.
What is the difference between Masking and Encryption?
In masking, the sensitive data is replaced with fictitious content using various masking methods. Also, once data is masked, it can’t be restored into its original form. While encryption converts the sensitive data into a ciphertext using a key that only certain users will be able to decrypt. The encrypted data can be decoded using decryption.
What is the difference between static and dynamic masking?
Static data masking permanently replaces sensitive data by altering data at rest. Dynamic data masking aims to replace sensitive data in transit leaving the original at-rest data intact and unaltered.
What do you understand by consistency during masking?
To maintain consistency means that a value will be masked in the same way each time it occurs. It is significant in case of columns containing duplicate values.
How can be Persistent masking useful for automation testing?
In Persistent option, the masked values are saved for future use. If this option has been opted, then reference tables are created at the backend to ensure that any future masking operations yield consistent data masking. This feature is very helpful in setting up test scenarios for automation.
What is the purpose of synchronization in data masking?
The purpose of the Synchronization option is to preserve relational integrity for data sets. In large databases, reference and id columns are mapped to several tables often. To avoid creating multiple tasks and mask each table separately, Synchronize (S) option is used. If this option is checked against a column, PK Protect masks all columns with similar entries using the specified masking option.
What is the objective of Stateless Masking?
The Stateless Masking is used to mask data consistently. This option yields results like persistent option without creating reference tables.
Where can you define the directory path for output masking results?
The user can define the output directory by accessing the Domain screen and specifying the destination directory under the SELECT A DIRECTORY/DB/NAMESPACE/TABLE panel. This page shows the mapping of source directories and destination directories to which each has been mapped.
What are the permissible characters that can be used in Character masking?
The following characters can be used in Character masking to replace one or more characters at the beginning or end of a string:
- Hash (#)
- Dollar ($)
- At the rate (@)
- Percent (%)
- Period (.)
Where is Format Preservation Masking (FPM) used?
FPM is used where a random value can be inserted to preserves the format of the original data. Following scenarios are suitable for the application of FPM:
- The number of characters or the length of the field is not changed.
- Capital letters are masked with random capital letters.
- Small letters are masked with random small letters.
- Digits are masked with random digits.
- Special characters are left as it is.
What is the purpose of Intellimask?
Intellimask is used to replace a subset of characters with random values based on a regular expression. In this scenario, only a specific sensitive portion is masked, and rest of the non-sensitive portion remains intact.
Where is random masking used?
Random masking generates random data values as per selected option such as, Random String, Credit Cards Numbers, Full Address, Date etc., with not much dependency or relation with original input data.
Does a user need to make any configurational change to perform masking/encryption of Snappy compressed files?
The following property needs to be updated in jetty-embedded.properties file:
JavaOptions=-Dhdp.version=<HadoopVersion> -Djava.library.path=<path of snappy lib files>
What is the primary condition to configure Custom Lookup Masking?
In order to configure Custom Lookup masking, user need to specify the two columns that will form the join i.e., Base Reference Column (in table of actual data) and Lookup Reference Column (in table of fictitious data). During execution of masking, column with fictitious data will replace the actual data. The Lookup Reference Column field must have unique data for custom lookup masking to successfully mask the data.
What are attributes?
Attributes are the variables which filter data and approve data presentation on PK Protect’s dashboards. Attributes compare sensitive information identified in databases or files store across various departments of the organization.
What information does Home page of PK Protect provides to the user?
The Home page is the landing page upon logging into PK Protect. It displays different widgets acting as the navigation shortcuts to the other screens. These widgets display information such as, running task instances a user is authorized to see along with associated notifications, reports, number of active, inactive, decommissioned, and total number of IDPs and controller services etc.
How can notifications be classified?
There are two types of notifications: Event-Based and Time-Based. The event-based notification will be received whenever an event is triggered and time-based notification is received after the set period.
What kinds of reports does PK Protect provide to the user and how are they helpful?
PK Protect provides audit reports, entitlement reports, time-based reports etc., to the user which displays types of sensitive data detected, number of sensitive columns searched and masked, location of sensitive columns and many other useful aspects related to data security.
What is the difference between Audit reports by user and by event?
The audit reports by users provide the information about the users, their roles and contact information. However, audit reports by events displays the audited logs including event type, user/role, event details, and time stamp.
What kind of information does Entitlement Report provides?
The Entitlement Report can be processed either for a specific user/group or directory/file. If processed for a user/group, it indicates their access rights for a specific file or directory. If processed for a directory/file, it indicates the access rights for all the users or a specific user/group. The report indicates which files are scanned, contains sensitive data, and are masked and/or encrypted.
What is the significance of Time- based report?
Access the Time-based Report Details page by clicking on a data point in one of the three charts on the Time-Based Report. There are various charts in Time based reports which display following vital information:
- Total number of files searched, encrypted and/or masked.
- Total number of sensitive files found.
- The number of files with sensitive data by policy.
What is the significance of Reports Diff tab?
The Report Diff tab allows insight into the similarities and differences between instances of a task. This comparison is useful in determining whether the risk profile of a target database is improving over time.
What is the requirement of Custom Functions?
Custom masking enables the users to write their own logic for masking/protecting the sensitive data. Custom validation functions help validate the discovery results and can improve accuracy where user can define criteria which should be fulfilled to mark a token as sensitive.

A custom function is created at the backend. The user can view and select these user-defined functions through the Policy screen or Static Masking screen of RDBMS. Custom functions are not supported in Privacy.
What are the different file types supported by PK Protect for Hadoop?
The file formats that can be processed by PK Protect for Hadoop are text (RTF, LOG, DAT), sequence, RC, ORC, CSV, ZIP, PDF, excel (xls, xlsx), doc, docx, PPT, PPTX, XML, JSON Avro, Parquet, and EDI.
How can HDFS IDP act as Local File IDP?
In HDFSIDPConfig.properties file, set the value of ‘Distro’ property to ‘local’ in order to make the HDFS IDP act as Local File IDP.
What is the function of PK Protect Cloud IDP?
The Cloud IDP is responsible for interacting with the Google Cloud and AWS platforms in order to spin up clusters, manage the cluster lifecycle, install the GCS IDP on the cluster, and browsing buckets in cloud platforms.
Is it possible to access mounted drive through Local File IDP (LFA)?
Yes, it is possible to access mounted drive through Local File IDP.
What are the various decryption methods provided by PK Protect?
PK Protect provides two decryption methods: Bulk Decryption and On-The-Fly (OTF) Decryption. Bulk decryption decodes specific policies or sensitive types provided the user has the appropriate permissions. OTF decryption decodes individual data elements and requires the use of a decryption library. In OTF decryption, data is decrypted on request from any user or application while reading data from physical storage.
Where can a user find files for task results?
The files for task results can be found at:
hadoop fs --cat /dataguise$/results/(taskname)/(task_instance_ID)
Detailed Results can be found at:
hadoop fs --cat /dataguise$/results/(taskname)/(task_instance_ID)/(summary_results_structured).

The Field Results are only available for Detection tasks executed for structured files. This file provides information such as, sensitive types discovered, type of scan (full vs incremental), ratio of matched to total rows, and total number of columns in which sensitive data was found.
Which are the files that a user needs to copy to process Snappy compressed file?
The following files need to be copied from the cluster to any path on the machine where PK Protect is installed in order to process snappy compressed files:
- libsnappy.so
- libsnappy.so.1
- libsnappy.so.1.1.4
At what path does the Hadoop configuration file gets stored in a machine?
The Hadoop configuration file gets stored at the following path:{InstallationPath}/DgSecure/IDPs/HDFSIDP/expandedArchive/WEB-INF/classes/com/dataguise/hadoop/util/HDFSIDPConfig.properties
Which are the files that a user needs to copy from the cluster before running HDFS installer?
The following XML files need to be copied from the cluster and put into single directory on Linux server where HDFS IDP will be setup:
- hdfs-site.xml
- core-site.xml
- yarn-site.xml
- mapred.xml
  After installation, hadoopConfigPath will be equal to the absolute path to the directory containing Hadoop config files.
What types of tasks can be executed using PK Protect AWS?
Currently, PK Protect can execute two types of AWS tasks: S3 and RDS/Redshift. S3 tasks search for sensitive data hosted in Amazon’s S3 and RDS/Redshift tasks search and mask sensitive data found in AWS databases.
What is the default location of logs that are generated after installing/updating PK Protect?
After installing or updating PK Protect, logs are generated in /tmp folder having file name in the format bitrock_installer.log.
How can a user Encrypt or Decrypt a string using PK Protect?
A user needs to execute the following steps to encrypt/decrypt a string:
1. In Linux:
  1. Navigate to /opt/Dataguise/DgSecure/contrib
  2. Execute the following command:
    1. For encryption:
      java EncString <string> -E
    2. For decryption:
      java EncString <string> -d
2. In Windows:
  1. Open command prompt
  2. Navigate to C:\Program Files\Dataguise\DgSecure\contrib
  3. Execute the following command:
    1. For encryption
      java EncString <string> -E
    2. For decryption
      java EncString <string> -d
Does PK Protect supports LDAP Authentication? If yes, then how can a user change DB Authentication to LDAP Authentication?
Yes, PK Protect supports LDAP Authentication. To modify authentication, the user needs to log into Admin and navigate to the Authentication screen. Select ‘Active Directory’ as directory type and ‘ldap’ as protocol. Then, enter the required details.
What does oracle error “ORA-01000: maximum open cursors exceeded” means and how can it be resolved?
The initialization parameter OPEN_CURSORS determines the maximum number of cursors per user. By default, its value is 300. The mentioned error happens when the application attempts to open more ResultSets than the configured cursors on a database instance. To resolve this issue, raise the value of open_cursors to a higher number by executing the following command:
ALTER SYSTEM SET open_cursors = 400 SCOPE=BOTH;
What are Assets in Assets in Scope chart?
The Assets in Scope charts provide an overview of user’s data assets on both on-premises and cloud. So, one asset is equal to one Hadoop cluster or one database.
What is the objective of Mandatory Field Name Match option?
Mandatory Field Name Match is a flag provided to exclude results that do not fall under the specified column heading. If the header did not match the column header regex, the results won’t be displayed even though the column has sensitive data.
What is the functionality of Detect if Found With option?
When an element’s sensitivity is dependent on another element, it can be useful to build a sensitive type with built-in dependencies. For example, CCNO is dependent on CVV, First Name is dependent on Last Name etc. A simple relationship can be established in such a scenario, by using the Detect if Found with option when creating a new Sensitive Type.
What do you understand by term False Positives?
Sometimes, non-sensitive data is mistakenly identified as sensitive because it resembles common sensitive types of data. These types of records are referred as False Positives.
Is there any way to avoid reporting False Positives in PK Protect?
There are some options to reduce False Positives in PK Protect:
- Using the Results Review functionality. It allows you to mark the record as Non-sensitive, Sensitive, or set the value of Confidence Factor.
- Tweaking the value of confidence factor.
- Enhancing custom sensitive types by improving regexes, using inclusion/exclusion list, using header matches etc.
- Changing sample size.
- There is an option called ‘Edit Safelist’ on the UI through which False positives can be reported by adding these false positives into an exception list called Safelist. The data columns in this list are excluded from future searches.
Why do we get error “User hasn’t ACL permission to execute decryption task” while executing decryption task?
There is difference between executing encryption and decryption task. To execute a decryption task, a new role needs to be defined in Role Management screen and that role should be mapped with the appropriate Access Control List (ACL). If such role is not created, then user will get the mentioned error as decryption restores the sensitive data in its original state.