Extractor
LLM-Powered data extractor for one-time data extraction, unstructured data, and extractor scrapers.
extract
Extract structured data from a URL:
Endpoint: POST /v1/extractor/extract (alias: POST /v1/extract)
curl https://api.parsera.org/v1/extractor/extract \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
"url": "https://news.ycombinator.com/",
"prompt": "Extract news metadata",
"attributes": [
{
"name": "title",
"description": "News title"
},
{
"name": "points",
"description": "Number of points"
}
],
"proxy_country": "Germany"
}'Parameters:
Minimal payload requires url and either prompt or attributes.
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | - | URL of the webpage to extract data from |
prompt | string | "" | Prompt for initial scraping |
attributes | array | [] | A list of attribute objects with name and description fields to extract from the webpage. Also, you can specify Output Types |
mode | string | standard | Mode of the extractor: standard |
proxy_country | string | UnitedStates | Proxy country, see Proxy Countries |
cookies | array | Empty | Cookies to use during extraction, see Cookies |
It's recommended to set the proxy_country parameter to a specific country since a page could be unavailable from some locations.
parse
Parse data from HTML or text content you already have, instead of fetching from a URL:
Endpoint: POST /v1/extractor/parse (alias: POST /v1/parse)
curl https://api.parsera.org/v1/extractor/parse \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
"content": <HTML_OR_TEXT_HERE>,
"prompt": "Extract news metadata",
"attributes": [
{
"name": "title",
"description": "News title"
},
{
"name": "points",
"description": "Number of points"
}
]
}'Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
content | string | - | Raw HTML or text content to extract data from |
prompt | string | "" | Prompt for initial scraping |
attributes | array | - | A list of attribute objects with name and description fields to extract from the webpage. Also, you can specify Output Types |
mode | string | standard | Mode of the extractor: standard |
extract_markdown
Get clean markdown from a URL:
Endpoint: POST /v1/extractor/extract_markdown (alias: POST /v1/extract_markdown)
curl https://api.parsera.org/v1/extractor/extract_markdown \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
"url": "https://news.ycombinator.com/",
"proxy_country": "UnitedStates"
}'Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | - | URL of the webpage to extract data from |
proxy_country | string | UnitedStates | Proxy country, see Proxy Countries |
cookies | array | Empty | Cookies to use during extraction, see Cookies |
Manage Scrapers
The Extractor API also lets you create and manage extractor scrapers. See Manage Scrapers for the full CRUD reference (create, generate, get, delete).
More Features
- Manage Scrapers — Create and manage extractor scrapers
- Specify Output Types
- Setting Proxy
- Setting Cookies
