Extractor
LLM-Powered data extractor is ideal for one-time data extraction and unstructured data.
extract
Paste your API key to the X-API-KEY header to send the request to the extract endpoint:
curl https://api.parsera.org/v1/extract \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
"url": "https://news.ycombinator.com/",
"prompt": "Extract news metadata",
"attributes": [
{
"name": "title",
"description": "News title"
},
{
"name": "points",
"description": "Number of points"
}
],
"proxy_country": "Germany"
}'Parameters:
Minimal payload requires url and either prompt or attributes.
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | - | URL of the webpage to extract data from |
prompt | string | "" | Prompt for initial scraping |
attributes | array | [] | A list of attribute objects with name and description fields to extract from the webpage. Also, you can specify Output Types |
mode | string | standard | Mode of the extractor, standard or precision. For details, see Precision mode |
proxy_country | string | UnitedStates | Proxy country, see Proxy Countries |
cookies | array | Empty | Cookies to use during extraction, see Cookies |
It's recommended to set the proxy_country parameter to a specific country since a page could be unavailable from some locations.
If some data is missing, you can retry with precision mode, which looks into data hidden in HTML tags. For details, see Precision mode.
parse
In addition to extract, there is a parse endpoint that can be used to parse data generated on your side instead of one from url.
There is a content attribute for passing data, which accepts both raw HTML and string:
curl https://api.parsera.org/v1/parse \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
"content": <HTML_OR_TEXT_HERE>,
"prompt": "Extract news metadata",
"attributes": [
{
"name": "title",
"description": "News title"
},
{
"name": "points",
"description": "Number of points"
}
]
}'Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
content | string | - | Raw HTML or text content to extract data from |
prompt | string | "" | Prompt for initial scraping |
attributes | array | - | A list of attribute objects with name and description fields to extract from the webpage. Also, you can specify Output Types |
mode | string | standard | Mode of the extractor, standard or precision. For details, see Precision mode |
extract_markdown
You can get a markdown from URL with the extract_markdown endpoint:
curl https://api.parsera.org/v1/extract_markdown \
--header 'Content-Type: application/json' \
--header 'X-API-KEY: <YOUR_API_KEY>' \
--data '{
"url": "https://news.ycombinator.com/",
"proxy_country": "UnitedStates"
}'Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | - | URL of the webpage to extract data from |
proxy_country | string | UnitedStates | Proxy country, see Proxy Countries |
cookies | array | Empty | Cookies to use during extraction, see Cookies |
More features
Check out further documentation to explore more features: