Extractor
LLM-Powered data extractor for one-time data extraction, unstructured data, and classic code-based scrapers.
extract
Extract structured data from a URL:
Endpoint: POST /v1/extractor/extract (alias: POST /v1/extract)
curl https://api.parsera.org/v1/extractor/extract \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
"url": "https://news.ycombinator.com/",
"prompt": "Extract news metadata",
"attributes": [
{
"name": "title",
"description": "News title"
},
{
"name": "points",
"description": "Number of points"
}
],
"proxy_country": "Germany"
}'Parameters:
Minimal payload requires url and either prompt or attributes.
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | - | URL of the webpage to extract data from |
prompt | string | "" | Prompt for initial scraping |
attributes | array | [] | A list of attribute objects with name and description fields to extract from the webpage. Also, you can specify Output Types |
mode | string | standard | Mode of the extractor: standard or precision |
proxy_country | string | UnitedStates | Proxy country, see Proxy Countries |
cookies | array | Empty | Cookies to use during extraction, see Cookies |
It's recommended to set the proxy_country parameter to a specific country since a page could be unavailable from some locations.
If some data is missing, you can retry with precision mode. See Precision Mode for details.
parse
Parse data from HTML or text content you already have, instead of fetching from a URL:
Endpoint: POST /v1/extractor/parse (alias: POST /v1/parse)
curl https://api.parsera.org/v1/extractor/parse \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
"content": <HTML_OR_TEXT_HERE>,
"prompt": "Extract news metadata",
"attributes": [
{
"name": "title",
"description": "News title"
},
{
"name": "points",
"description": "Number of points"
}
]
}'Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
content | string | - | Raw HTML or text content to extract data from |
prompt | string | "" | Prompt for initial scraping |
attributes | array | - | A list of attribute objects with name and description fields to extract from the webpage. Also, you can specify Output Types |
mode | string | standard | Mode of the extractor: standard or precision |
extract_markdown
Get clean markdown from a URL:
Endpoint: POST /v1/extractor/extract_markdown (alias: POST /v1/extract_markdown)
curl https://api.parsera.org/v1/extractor/extract_markdown \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
"url": "https://news.ycombinator.com/",
"proxy_country": "UnitedStates"
}'Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | - | URL of the webpage to extract data from |
proxy_country | string | UnitedStates | Proxy country, see Proxy Countries |
cookies | array | Empty | Cookies to use during extraction, see Cookies |
Manage Scrapers
The Extractor API also lets you create and manage classic code-based scrapers. See Manage Scrapers for the full CRUD reference (create, generate, get, delete).
More Features
- Manage Scrapers — Create and manage classic scrapers
- Code Mode — Code generation details and benefits
- Precision Mode — Extract data hidden in HTML tags
- Specify Output Types
- Setting Proxy
- Setting Cookies
