Snowflake Data Classification — Script

John Thuma
5 min readSep 28, 2022

--

Classification enables answering questions about the data stored in tables and views, such as:

  • Does the table/view contain PII (Personally Identifiable Information) or sensitive data?
  • Where is the data stored and how long has it been stored?
  • How can the data be protected from exposure while still deriving insights?

The classification process samples all the supported columns in a table or view and uses the column names and values to classify the data into system categories provided by Snowflake. The categories can be assigned to the columns as tags, which can be set manually or using the provided stored procedure.

Once the tags produced by classification have been assigned to a table, view, or column, they can be used to enable a variety of data governance, sharing, and privacy use cases, including:

PII Classification

You can use classification to identify PII (Personally Identifiable Information) in your data to mitigate risk and meet compliance.

Data Access

You can use classification tags to configure security controls to prevent unauthorized access to personal data.

Policy Management

You can use classification tags to determine how to set masking policies to protect the privacy of the data.

Anonymization

You can use classification to streamline anonymization of personal data. Anonymization relies on classification privacy categories to protect the identity of the associated subjects while still making their data available for analysis.

Snowflake supports classifying data stored in all types of tables and views, including:

  • External tables
  • Materialized views
  • Secure views

Classification can be performed on table/view columns of all supported data types except for the following data types:

  • GEOGRAPHY
  • BINARY
  • VARIANT

If a table/view contains columns that are not of a supported data type or the column contains all NULL values, the classification process ignores the columns and does not include them in the output.

Classification Categories

Snowflake utilizes two category types for classifying data in table/view columns:

  • Semantic categories
  • Privacy categories

Semantic Categories

A semantic category identifies a column as storing personal attributes. Some of the semantic categories supported by Snowflake include:

  • Name
  • Address
  • Zip code
  • Phone number (currently US numbers only)
  • Age
  • Gender

For a complete list of the semantic categories supported in the current release, see Category Tag Values and Mappings. Additional semantic categories will be added in future releases.

Privacy Categories

If a column is determined to have a semantic category, the column is further classified according to one of the following privacy categories:

Identifier

Also known as direct identifiers, these attributes uniquely identify an individual (e.g. name, social security number, or phone number).

Quasi-identifier

Also known as indirect identifiers, these attributes, when combined with other attributes, can be used to uniquely identify an individual (e.g. age + gender + zip).

Sensitive

Personal attributes that are not identifying, but are information that individuals do not want disclosed for privacy reasons (e.g. salary or medical/healthcare status).

This is the output of my example script below. Notice that PAN and a few other fields did not get tagged

THE SCRIPT:

use role useradmin;
create role tag_admin;
use role accountadmin;
grant create tag on schema HC_DEMO.PUBLIC to role tag_admin;
grant apply tag on account to role tag_admin;

use role tag_admin;
use schema HC_DEMO.PUBLIC;
create tag cost_center;

— NEED TO TURN ON DATA CLASSIFICATION
/*
ALTER ACCOUNT CGA39353 SET
ENABLE_DATA_CLASSIFICATION=true,
ENABLE_TAGGING=true,
PARAMETER_COMMENT=’SNOW-273085: Enable Data Classification PrPr’;
*/



— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

— The category tags are stored in the DATA_PRIVACY schema in the SNOWFLAKE read-only shared database. To view the tag names, you can use the SHOW TAGS command.
— — For the SEMANTIC_CATEGORY tag, the possible tag values are the semantic categories (e.g. NAME, AGE).
— — For the PRIVACY_CATEGORY tag, the possible tag values are the privacy categories (e.g. IDENTIFIER, QUASI_IDENTIFIER).

use schema snowflake.data_privacy;

show tags;

— To classify data in a table or view, the role used to classify the data must have the following privileges:
— — IMPORTED PRIVILEGES privilege on the SNOWFLAKE read-only shared database.
— — SELECT and UPDATE privileges on the table/view, as well as the USAGE privilege on the database and schema where the table/view is located.

— Classification utilizes tags, which means the appropriate tagging privileges are required to perform certain classification tasks.
— — -For example, the global APPLY TAG privilege is required to enable applying classification category tags to table/view columns.

— Functions and Stored Procedures
— — Snowflake provides the following functions and stored procedures for classifying data and applying the generated categories as tags:

— — Function: EXTRACT_SEMANTIC_CATEGORIES

— — Stored Procedure: ASSOCIATE_SEMANTIC_CATEGORY_TAGS (utilizes caller’s rights)

— In addition, the following tag functions and views can be used for viewing and tracking the applied tags:

— — TAG_REFERENCES View (in Account Usage): View classification tags for all or specified columns. Note that the latency for the view may be up to 120 minutes.
— — TAG_REFERENCES table function (in Information Schema): Return the classification tags for a specified column.
— — SYSTEM$GET_TAG function: Return the tag value for a specified tag associated with a column.

— EXAMPLE:

— RESET
DROP database “Data_Classification”

use role accountadmin;
create or replace database Data_Classification;
use database Data_Classification;

create or replace schema dc;

use schema dc;

create or replace table Data_Classification.dc.Personal_Data
(“first_name” VARCHAR,
“last_name” varchar,
“street_address” varchar,
“city” varchar,
“state” varchar,
“country” varchar,
“zipcode” varchar,
“PAN” varchar,
“CVV” varchar,
“mobile” varchar,
“email” varchar,
“person_sk” integer,
“SSN” integer,
“lat” varchar,
“long” varchar

) COMMENT = ‘table has fake personal data!’;

— load data

create stage fake_persons
storage_integration = si_hc_demo
url = ‘s3://unstruct-data-hc-demo/fake_persons/’;

list @fake_persons;

copy into Personal_Data from @fake_persons FILE_FORMAT = ( TYPE = ‘CSV’ field_delimiter = ‘,’ skip_header = 1 validate_UTF8=false ) on_error=’CONTINUE’;

select count(*) from PERSONAL_DATA;

select * from DATA_CLASSIFICATION.DC.PERSONAL_DATA limit 10;

— COMMANDS TO RUN TO CLASSIFY DATA
select extract_semantic_categories(‘DATA_CLASSIFICATION.DC.PERSONAL_DATA’);

call associate_semantic_category_tags(‘DATA_CLASSIFICATION.DC.PERSONAL_DATA’,
extract_semantic_categories(‘DATA_CLASSIFICATION.DC.PERSONAL_DATA’));

— Applied tag semantic_category to 6 columns.
— Applied tag privacy_category to 6 columns.

select * from snowflake.account_usage.tag_references
where tag_name = ‘PRIVACY_CATEGORY’
and tag_value = ‘IDENTIFIER’;

select * from snowflake.account_usage.tag_references WHERE OBJECT_NAME like ‘%PERSONAL_DATA%’;

CREATE VIEW view_PERSONAL_DATA as select * from DATA_CLASSIFICATION.DC.PERSONAL_DATA;
DROP VIEW view_PERSONAL_DATA;

--

--

John Thuma

Experienced Data and Analytics guru. 30 years of hands-on keyboard experience. Love hiking, writing, reading, and constant learning. All content is my opinion.