Document processing functions - SnowPro Gen AI C02 study notes

Domain 4 covers document parsing pipelines built on Snowflake's processing functions, NOT the legacy Document AI extract-from-template product (deprecated).

Two functions, same family

Function	Status
`SNOWFLAKE.CORTEX.PARSE_DOCUMENT`	Legacy. Deprecated by end of 2026. Still on exam.
`AI_PARSE_DOCUMENT`	Modern. Recommended for new work.

Signature (PARSE_DOCUMENT, AI_PARSE_DOCUMENT is similar):

SNOWFLAKE.CORTEX.PARSE_DOCUMENT('@<stage>', '<path>', [<options>])

Modes

Mode	Output
`OCR` (default)	Plain text only, no structure
`LAYOUT`	Text + structural content including tables + reading order

SELECT SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@my_stage, 'report.pdf',
                                        {'mode': 'LAYOUT'});

Supported formats

PDF
PowerPoint (.pptx)
Word (.docx)
Image files (PNG, JPG, TIFF, etc.)

Options

Option	Description
`mode`	`'OCR'` or `'LAYOUT'`
`page_split`	Boolean. If `TRUE`, output splits by page (PDF/PPTX/DOCX only)

Output shape

page_split = FALSE (default):

{
  "content": "Full extracted text...",
  "metadata": {...},
  "errorInformation": {...}
}

page_split = TRUE:

{
  "pages": [
    {"index": 0, "content": "Page 1 text..."},
    {"index": 1, "content": "Page 2 text..."}
  ],
  "metadata": {...}
}

Returns a JSON string — wrap with PARSE_JSON() to manipulate as SQL object.

Canonical pipeline (memorize end-to-end)

-- 1. Create stage, upload files
CREATE OR REPLACE STAGE docs
  ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
  DIRECTORY = (ENABLE = TRUE);
PUT file:///local/*.pdf @docs AUTO_COMPRESS = FALSE;
ALTER STAGE docs REFRESH;

-- 2. Parse each file
CREATE OR REPLACE TABLE parsed_docs AS
SELECT
  relative_path,
  PARSE_JSON(SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@docs, relative_path,
                                              {'mode':'LAYOUT',
                                               'page_split': TRUE})) AS doc
FROM DIRECTORY(@docs);

-- 3. Flatten pages
CREATE OR REPLACE TABLE pages AS
SELECT
  relative_path,
  p.value:index::INT AS page_index,
  p.value:content::STRING AS page_text
FROM parsed_docs, LATERAL FLATTEN(input => doc:pages) p;

-- 4. Chunk text
CREATE OR REPLACE TABLE chunks AS
SELECT
  relative_path, page_index,
  c.value::STRING AS chunk
FROM pages,
     LATERAL FLATTEN(input =>
       SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(
         page_text, 'markdown', 1500, 200)) c;

-- 5. Either embed manually OR create a Cortex Search Service
CREATE OR REPLACE CORTEX SEARCH SERVICE doc_search
  ON chunk
  ATTRIBUTES relative_path, page_index
  WAREHOUSE = wh_xs
  TARGET_LAG = '1 hour'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
  AS (SELECT chunk, relative_path, page_index FROM chunks);

PARSE_DOCUMENT vs AI_EXTRACT - common exam confusion

Function	Input	Output	When
`PARSE_DOCUMENT` / `AI_PARSE_DOCUMENT`	File in a stage	Raw text + layout	First step of any doc pipeline
`AI_EXTRACT`	Text, image, or file ref + a schema	Structured object matching schema	Pull specific fields (invoice number, total) from already-loaded content

You typically chain: PARSE_DOCUMENT → AI_EXTRACT with a schema if you need structured fields like invoice totals or contract dates.

Cost + governance

Billed per page processed (LAYOUT mode > OCR mode)
Same CORTEX_USER role grant required as other Cortex functions
Cross-region inference applies if the document parsing model isn't local

Pitfalls

DIRECTORY = (ENABLE = TRUE) required on the stage to use DIRECTORY() table function
Encryption type must be SNOWFLAKE_SSE for internal stages used with Cortex
Forgetting to call ALTER STAGE ... REFRESH after PUT — directory listing is stale
Confusing the old Document AI product (extract-from-template UI flow, deprecated) with document processing functions (the modern API on the exam)
Token budget: very large PDFs in LAYOUT mode can be expensive; use page_split and process incrementally