Settings & Flags Reference¶
This page provides a complete overview of all configuration options in LLM Extractinator.
It follows a professional documentation pattern:
- A quick summary table for fast scanning
- Detailed per‑flag descriptions for deeper understanding
1. CLI Flags Overview (Summary)¶
| Flag | Default | Description |
|---|---|---|
--task_id |
required | Selects which task JSON file to run. |
--run_name |
"run" |
Name used in logs and output folders. |
--n_runs |
1 |
Number of times to repeat the task. |
--verbose |
False |
Enables detailed logging. |
--overwrite |
False |
Overwrites existing outputs if enabled. |
--seed |
None |
Random seed for reproducibility. |
--model_name |
"phi4" |
Model used via Ollama. |
--embedding_model |
"nomic-embed-text" |
Embedding model for few‑shot selection. |
--temperature |
0.0 |
Sampling randomness. |
--top_k |
None |
Top‑K sampling. |
--top_p |
None |
Nucleus sampling. |
--num_predict |
512 |
Maximum generated tokens. |
--max_context_len |
"max" |
Context length strategy. |
--reasoning_model |
False |
Enables reasoning‑model mode. |
--num_examples |
0 |
Number of few‑shot examples. |
--chunk_size |
None |
Chunk size for long inputs. |
--translate |
False |
Translate input to English first. |
--output_dir |
output/ |
Output location. |
--log_dir |
output/ |
Log location. |
--data_dir |
data/ |
Input data directory. |
--task_dir |
tasks/ |
Task JSON directory. |
--example_dir |
examples/ |
Few‑shot example directory. |
--translation_dir |
translations/ |
Translation output directory. |
2. Detailed CLI Flag Descriptions¶
--task_id¶
Type: int
Default: required
Selects which task JSON file to run, based on its numeric prefix
(e.g., Task001_*.json → --task_id 1).
--run_name¶
Type: str
Default: "run"
Human‑friendly name used to structure log and output folders.
--n_runs¶
Type: int
Default: 1
Runs the same extraction multiple times—useful for testing stability or variance.
--verbose¶
Type: bool
Default: False
Prints additional diagnostic information during execution.
--overwrite¶
Type: bool
Default: False
If enabled, existing run results in the output folder will be overwritten. If disabled, the tool will skip processing if output already exists.
--seed¶
Type: int
Default: None
Random seed for reproducible behavior where possible.
--model_name¶
Type: str
Default: "phi4"
Name of the Ollama model to use (e.g., "phi4", "llama3.3", "deepseek-r1:8b"). See Ollama models for available options.
--embedding_model¶
Type: str
Default: "nomic-embed-text"
Name of the embedding model to use for few-shot example selection via semantic similarity. Only used when --num_examples > 0. See Ollama models for available embedding models (e.g., "mxbai-embed-large", "nomic-embed-text").
--temperature¶
Type: float
Default: 0.0
Controls randomness in generation:
- 0.0 = deterministic
- Higher values = more creative output
--top_k¶
Type: int
Default: None
Restricts sampling to the top‑K highest‑probability tokens.
--top_p¶
Type: float
Default: None
Nucleus sampling: sample from the smallest token set whose cumulative probability ≥ p.
--num_predict¶
Type: int
Default: 512
Maximum number of tokens to generate for the model’s output.
--max_context_len¶
Type: str or int
Default: "max"
Controls context length policy:
- "max" — use maximum available length
- "split" — split dataset in two by input size
- integer — explicitly set context length
--reasoning_model¶
Type: bool
Default: False
Enable this for models like DeepSeek‑R1 and Qwen3 that output chain‑of‑thought before JSON.
Enabling this flag allows the model to emit reasoning steps prior to the final answer extraction.
--num_examples¶
Type: int
Default: 0
Number of few‑shot examples to include in the prompt.
Requires setting Example_Path inside the task JSON file.
--chunk_size¶
Type: int
Default: None
Splits the dataset into chunks of this many documents for processing. Useful for very large datasets as the chunks are saved incrementally. If a crash occurs, only the current chunk needs to be reprocessed.
--translate¶
Type: bool
Default: False
If enabled, input is translated to English before extraction—adds an extra model step. Not recommended!
--output_dir¶
Type: Path
Default: output/
Where extracted results are written.
--log_dir¶
Type: Path
Default: output/
Location for logs; defaults to the output directory.
--data_dir¶
Type: Path
Default: data/
Directory containing datasets referenced by the Data_Path in task JSON files.
--task_dir¶
Type: Path
Default: tasks/
Folder containing task JSON files.
--example_dir¶
Type: Path
Default: examples/
Directory referenced by Example_Path in task JSON files.
--translation_dir¶
Type: Path
Default: translations/
Folder where translated versions of inputs are saved when using --translate.
3. Task Configuration Files¶
Task files define what to extract and how to parse it.
Files follow the pattern:
TaskXXX_name.json
Example:
Task001_products.json → task ID 1.
Required Fields¶
Description¶
Type: str
Short human‑readable explanation of the task.
Data_Path¶
Type: str
Relative path (from data_dir) to the dataset file.
Input_Field¶
Type: str
Column name or JSON key containing the text that should be extracted.
Parser_Format¶
Type: str
Filename of the parser module inside tasks/parsers/ that defines a Pydantic OutputParser model.
OutputParser is the schema Extractinator validates the LLM output against.
Optional Fields¶
Example_Path¶
Type: str
Relative path (from example_dir) to few‑shot examples.
Required only if using --num_examples > 0.
4. Additional Commands¶
build-parser¶
Launches a Streamlit tool for interactively building Pydantic parser models.
launch-extractinator¶
Opens the Streamlit GUI for assembling datasets, parsers, and tasks.