Document processing functions - SnowPro Gen AI C02 study notes
Domain 4 covers document parsing pipelines built on Snowflake's processing functions, NOT the legacy Document AI extract-from-template product (deprecated).
Two functions, same family
| Function | Status |
|---|---|
SNOWFLAKE.CORTEX.PARSE_DOCUMENT |
Legacy. Deprecated by end of 2026. Still on exam. |
AI_PARSE_DOCUMENT |
Modern. Recommended for new work. |
Signature (PARSE_DOCUMENT, AI_PARSE_DOCUMENT is similar):
SNOWFLAKE.CORTEX.PARSE_DOCUMENT('@<stage>', '<path>', [<options>])
Modes
| Mode | Output |
|---|---|
OCR (default) |
Plain text only, no structure |
LAYOUT |
Text + structural content including tables + reading order |
SELECT SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@my_stage, 'report.pdf',
{'mode': 'LAYOUT'});
Supported formats
- PowerPoint (.pptx)
- Word (.docx)
- Image files (PNG, JPG, TIFF, etc.)
Options
| Option | Description |
|---|---|
mode |
'OCR' or 'LAYOUT' |
page_split |
Boolean. If TRUE, output splits by page (PDF/PPTX/DOCX only) |
Output shape
page_split = FALSE (default):
{
"content": "Full extracted text...",
"metadata": {...},
"errorInformation": {...}
}
page_split = TRUE:
{
"pages": [
{"index": 0, "content": "Page 1 text..."},
{"index": 1, "content": "Page 2 text..."}
],
"metadata": {...}
}
Returns a JSON string — wrap with PARSE_JSON() to manipulate as SQL object.
Canonical pipeline (memorize end-to-end)
-- 1. Create stage, upload files
CREATE OR REPLACE STAGE docs
ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
DIRECTORY = (ENABLE = TRUE);
PUT file:///local/*.pdf @docs AUTO_COMPRESS = FALSE;
ALTER STAGE docs REFRESH;
-- 2. Parse each file
CREATE OR REPLACE TABLE parsed_docs AS
SELECT
relative_path,
PARSE_JSON(SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@docs, relative_path,
{'mode':'LAYOUT',
'page_split': TRUE})) AS doc
FROM DIRECTORY(@docs);
-- 3. Flatten pages
CREATE OR REPLACE TABLE pages AS
SELECT
relative_path,
p.value:index::INT AS page_index,
p.value:content::STRING AS page_text
FROM parsed_docs, LATERAL FLATTEN(input => doc:pages) p;
-- 4. Chunk text
CREATE OR REPLACE TABLE chunks AS
SELECT
relative_path, page_index,
c.value::STRING AS chunk
FROM pages,
LATERAL FLATTEN(input =>
SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(
page_text, 'markdown', 1500, 200)) c;
-- 5. Either embed manually OR create a Cortex Search Service
CREATE OR REPLACE CORTEX SEARCH SERVICE doc_search
ON chunk
ATTRIBUTES relative_path, page_index
WAREHOUSE = wh_xs
TARGET_LAG = '1 hour'
EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
AS (SELECT chunk, relative_path, page_index FROM chunks);
PARSE_DOCUMENT vs AI_EXTRACT - common exam confusion
| Function | Input | Output | When |
|---|---|---|---|
PARSE_DOCUMENT / AI_PARSE_DOCUMENT |
File in a stage | Raw text + layout | First step of any doc pipeline |
AI_EXTRACT |
Text, image, or file ref + a schema | Structured object matching schema | Pull specific fields (invoice number, total) from already-loaded content |
You typically chain: PARSE_DOCUMENT → AI_EXTRACT with a schema if you need structured fields like invoice totals or contract dates.
Cost + governance
- Billed per page processed (LAYOUT mode > OCR mode)
- Same
CORTEX_USERrole grant required as other Cortex functions - Cross-region inference applies if the document parsing model isn't local
Pitfalls
DIRECTORY = (ENABLE = TRUE)required on the stage to useDIRECTORY()table function- Encryption type must be
SNOWFLAKE_SSEfor internal stages used with Cortex - Forgetting to call
ALTER STAGE ... REFRESHafter PUT — directory listing is stale - Confusing the old Document AI product (extract-from-template UI flow, deprecated) with document processing functions (the modern API on the exam)
- Token budget: very large PDFs in LAYOUT mode can be expensive; use
page_splitand process incrementally