Settings & Flags Reference¶

This page provides a complete overview of all configuration options in LLM Extractinator.

It follows a professional documentation pattern:

A quick summary table for fast scanning
Detailed per‑flag descriptions for deeper understanding

1. CLI Flags Overview (Summary)¶

Flag	Default	Description
`--task_id`	required	Selects which task JSON file to run.
`--run_name`	`"run"`	Name used in logs and output folders.
`--n_runs`	`1`	Number of times to repeat the task.
`--verbose`	`False`	Enables detailed logging.
`--overwrite`	`False`	Overwrites existing outputs if enabled.
`--seed`	`None`	Random seed for reproducibility.
`--model_name`	`"phi4"`	Model used via Ollama.
`--embedding_model`	`"nomic-embed-text"`	Embedding model for few‑shot selection.
`--temperature`	`0.0`	Sampling randomness.
`--top_k`	`None`	Top‑K sampling.
`--top_p`	`None`	Nucleus sampling.
`--num_predict`	`512`	Maximum generated tokens.
`--max_context_len`	`"max"`	Context length strategy.
`--reasoning_model`	`False`	Enables reasoning‑model mode.
`--num_examples`	`0`	Number of few‑shot examples.
`--chunk_size`	`None`	Chunk size for long inputs.
`--translate`	`False`	Translate input to English first.
`--output_dir`	`output/`	Output location.
`--log_dir`	`output/`	Log location.
`--data_dir`	`data/`	Input data directory.
`--task_dir`	`tasks/`	Task JSON directory.
`--example_dir`	`examples/`	Few‑shot example directory.
`--translation_dir`	`translations/`	Translation output directory.

2. Detailed CLI Flag Descriptions¶

`--task_id`¶

Type: int
Default: required
Selects which task JSON file to run, based on its numeric prefix
(e.g., Task001_*.json → --task_id 1).

`--run_name`¶

Type: str
Default: "run"
Human‑friendly name used to structure log and output folders.

`--n_runs`¶

Type: int
Default: 1
Runs the same extraction multiple times—useful for testing stability or variance.

`--verbose`¶

Type: bool
Default: False
Prints additional diagnostic information during execution.

`--overwrite`¶

Type: bool
Default: False
If enabled, existing run results in the output folder will be overwritten. If disabled, the tool will skip processing if output already exists.

`--seed`¶

Type: int
Default: None
Random seed for reproducible behavior where possible.

`--model_name`¶

Type: str Default: "phi4" Name of the Ollama model to use (e.g., "phi4", "llama3.3", "deepseek-r1:8b"). See Ollama models for available options.

`--embedding_model`¶

Type: str Default: "nomic-embed-text" Name of the embedding model to use for few-shot example selection via semantic similarity. Only used when --num_examples > 0. See Ollama models for available embedding models (e.g., "mxbai-embed-large", "nomic-embed-text").

`--temperature`¶

Type: float Default: 0.0
Controls randomness in generation: - 0.0 = deterministic
- Higher values = more creative output

`--top_k`¶

Type: int
Default: None
Restricts sampling to the top‑K highest‑probability tokens.

`--top_p`¶

Type: float
Default: None
Nucleus sampling: sample from the smallest token set whose cumulative probability ≥ p.

`--num_predict`¶

Type: int
Default: 512
Maximum number of tokens to generate for the model’s output.

`--max_context_len`¶

Type: str or int
Default: "max"
Controls context length policy: - "max" — use maximum available length
- "split" — split dataset in two by input size
- integer — explicitly set context length

`--reasoning_model`¶

Type: bool
Default: False
Enable this for models like DeepSeek‑R1 and Qwen3 that output chain‑of‑thought before JSON.
Enabling this flag allows the model to emit reasoning steps prior to the final answer extraction.

`--num_examples`¶

Type: int
Default: 0
Number of few‑shot examples to include in the prompt.
Requires setting Example_Path inside the task JSON file.

`--chunk_size`¶

Type: int
Default: None
Splits the dataset into chunks of this many documents for processing. Useful for very large datasets as the chunks are saved incrementally. If a crash occurs, only the current chunk needs to be reprocessed.

`--translate`¶

Type: bool
Default: False
If enabled, input is translated to English before extraction—adds an extra model step. Not recommended!

`--output_dir`¶

Type: Path
Default: output/
Where extracted results are written.

`--log_dir`¶

Type: Path
Default: output/
Location for logs; defaults to the output directory.

`--data_dir`¶

Type: Path
Default: data/
Directory containing datasets referenced by the Data_Path in task JSON files.

`--task_dir`¶

Type: Path
Default: tasks/
Folder containing task JSON files.

`--example_dir`¶

Type: Path
Default: examples/
Directory referenced by Example_Path in task JSON files.

`--translation_dir`¶

Type: Path
Default: translations/
Folder where translated versions of inputs are saved when using --translate.

3. Task Configuration Files¶

Task files define what to extract and how to parse it.
Files follow the pattern:

TaskXXX_name.json

Example:
Task001_products.json → task ID 1.

Required Fields¶

`Description`¶

Type: str
Short human‑readable explanation of the task.

`Data_Path`¶

Type: str
Relative path (from data_dir) to the dataset file.

`Input_Field`¶

Type: str
Column name or JSON key containing the text that should be extracted.

`Parser_Format`¶

Type: str
Filename of the parser module inside tasks/parsers/ that defines a Pydantic OutputParser model.

OutputParser is the schema Extractinator validates the LLM output against.

Optional Fields¶

`Example_Path`¶

Type: str
Relative path (from example_dir) to few‑shot examples.
Required only if using --num_examples > 0.

4. Additional Commands¶

`build-parser`¶

Launches a Streamlit tool for interactively building Pydantic parser models.

`launch-extractinator`¶

Opens the Streamlit GUI for assembling datasets, parsers, and tasks.

Settings & Flags Reference¶

1. CLI Flags Overview (Summary)¶

2. Detailed CLI Flag Descriptions¶

--task_id¶

--run_name¶

--n_runs¶

--verbose¶

--overwrite¶

--seed¶

--model_name¶

--embedding_model¶

--temperature¶

--top_k¶

--top_p¶

--num_predict¶

--max_context_len¶

--reasoning_model¶

--num_examples¶

--chunk_size¶

--translate¶

--output_dir¶

--log_dir¶

--data_dir¶

--task_dir¶

--example_dir¶

--translation_dir¶