01-projects/certifications/snowpro-genai-c02

study document processing

2026-05-16·study-notes·source: docs.snowflake.com/en/sql-reference/functions/parse_document-snowflake-cortex
certificationsnowflakedocument-processingparse-document

Document processing functions - SnowPro Gen AI C02 study notes

Domain 4 covers document parsing pipelines built on Snowflake's processing functions, NOT the legacy Document AI extract-from-template product (deprecated).

Two functions, same family

Function Status
SNOWFLAKE.CORTEX.PARSE_DOCUMENT Legacy. Deprecated by end of 2026. Still on exam.
AI_PARSE_DOCUMENT Modern. Recommended for new work.

Signature (PARSE_DOCUMENT, AI_PARSE_DOCUMENT is similar):

SNOWFLAKE.CORTEX.PARSE_DOCUMENT('@<stage>', '<path>', [<options>])

Modes

Mode Output
OCR (default) Plain text only, no structure
LAYOUT Text + structural content including tables + reading order
SELECT SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@my_stage, 'report.pdf',
                                        {'mode': 'LAYOUT'});

Supported formats

Options

Option Description
mode 'OCR' or 'LAYOUT'
page_split Boolean. If TRUE, output splits by page (PDF/PPTX/DOCX only)

Output shape

page_split = FALSE (default):

{
  "content": "Full extracted text...",
  "metadata": {...},
  "errorInformation": {...}
}

page_split = TRUE:

{
  "pages": [
    {"index": 0, "content": "Page 1 text..."},
    {"index": 1, "content": "Page 2 text..."}
  ],
  "metadata": {...}
}

Returns a JSON string — wrap with PARSE_JSON() to manipulate as SQL object.

Canonical pipeline (memorize end-to-end)

-- 1. Create stage, upload files
CREATE OR REPLACE STAGE docs
  ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
  DIRECTORY = (ENABLE = TRUE);
PUT file:///local/*.pdf @docs AUTO_COMPRESS = FALSE;
ALTER STAGE docs REFRESH;

-- 2. Parse each file
CREATE OR REPLACE TABLE parsed_docs AS
SELECT
  relative_path,
  PARSE_JSON(SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@docs, relative_path,
                                              {'mode':'LAYOUT',
                                               'page_split': TRUE})) AS doc
FROM DIRECTORY(@docs);

-- 3. Flatten pages
CREATE OR REPLACE TABLE pages AS
SELECT
  relative_path,
  p.value:index::INT AS page_index,
  p.value:content::STRING AS page_text
FROM parsed_docs, LATERAL FLATTEN(input => doc:pages) p;

-- 4. Chunk text
CREATE OR REPLACE TABLE chunks AS
SELECT
  relative_path, page_index,
  c.value::STRING AS chunk
FROM pages,
     LATERAL FLATTEN(input =>
       SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(
         page_text, 'markdown', 1500, 200)) c;

-- 5. Either embed manually OR create a Cortex Search Service
CREATE OR REPLACE CORTEX SEARCH SERVICE doc_search
  ON chunk
  ATTRIBUTES relative_path, page_index
  WAREHOUSE = wh_xs
  TARGET_LAG = '1 hour'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
  AS (SELECT chunk, relative_path, page_index FROM chunks);

PARSE_DOCUMENT vs AI_EXTRACT - common exam confusion

Function Input Output When
PARSE_DOCUMENT / AI_PARSE_DOCUMENT File in a stage Raw text + layout First step of any doc pipeline
AI_EXTRACT Text, image, or file ref + a schema Structured object matching schema Pull specific fields (invoice number, total) from already-loaded content

You typically chain: PARSE_DOCUMENT → AI_EXTRACT with a schema if you need structured fields like invoice totals or contract dates.

Cost + governance

Pitfalls