Scrape API
Get the web data you need in real-time
Overview
Our /scrape
endpoint allows you to get the web data you need in real-time.
Endpoint
POST https://api.datafiniti.co/v4/scrape
Input
A json object containing the urls
you want to scrape data from. The json object is structured like so:
{
"urls": []
}
You can either provide the URL you want to scrape as a string or you can choose to provide an object that contains the domain
along with an identifier
(e.g. SKU, ASIN) unique to the domain provided.
The following snippet contains an example of the accepted URL formats described above:
{
"urls": [
"https://www.amazon.com/dp/B07ZMGJCC4?productDetails&th=1&psc=1",
{
"domain": "amazon.com",
"identifier": "B07ZMGJCC4"
},
"https://www.newegg.com/p/1Z4-002P-015K6"
]
}
Which URL format should I use?
identifier
corresponds to fields in our Product Data Schema which helps users that are already integrated with Datafiniti. If you are unfamiliar with our schema or if identifiers are providing any confusion, simply provide URLs as strings instead.
The urls
array must contain 1 URL at minimum with a maximum of 10. The URLs provided do not have to be from the same domain. We will scrape all URLs provided in the request body in parallel so long as we support them. The sources that we currently support can be found Here. Please contact customer support if the source you want to scrape is not present in the hyperlinked section.
Rate Limiting
You may only make one request to the
/scrape
endpoint at a time.
Output
The response body returned to you is a json object that either contains a data
array or an errors
array. These arrays are mutually exclusive.
The presence of an errors
array indicates that a fatal error occurred and none of the URLs provided in your request body were processed. The errors
array will contain strings that indicate why we could not service your request. Please contact customer support and provide us with the request body that yielded an errors
response should this occur to you during your usage of this feature. Please provide us with the response body returned to you as well.
The presence of a data
array indicates that we were able to attempt scraping the URLs you provided in your request body. Each element in the data
array will correspond to one of the URLs you provided in your request. These elements are json objects and will always contain the following fields:
input
: An individual URL from the URLs array provided in your request body.finalized_url
: The resulting URL from generated from the URL you provided.results
: An array containing the data we scraped from the URL.errors
: An array containing error messages if any were encountered while attempting to scrape a specific URL. This errors array will always be present. It will be empty if no errors were encountered.credits_charged
: The number of scrape credits charged for the url. You will only get charged when the scrape is successful.
The following snippet is an example response body:
{
"data": [
{
"input": {
"domain": "amazon.com",
"identifier": "B07ZMGJCC4"
},
"final_url": "https://www.amazon.com/dp/B07ZMGJCC4?productDetails&th=1&psc=1",
"results": [
{
"dataType": "product",
"asins": "B07ZMGJCC4",
"brand": "CornBorn",
"name": "CornBorn Choose Your Design - ISU Long Sleeve T-Shirt",
"prices": [
{
"availability": "true",
"color": "Black",
"size": "X-Large",
"merchant": "amazon.com",
"shipping": "+ 6.99 shipping",
"sourceURLs": [
"https://www.amazon.com/dp/B07ZMGJCC4?productDetails&th=1&psc=1"
],
"dateSeen": [
"2021-01-13T16:48:00.000Z"
],
"currency": "USD",
"amountMin": 19.99,
"amountMax": 19.99
}
]
}
],
"errors": [],
"credits_charged": 1
}
]
}
We send the data scraped from the URLs you provide through our normalization pipeline in an effort to provide you with data that's easier to work with. It's possible that this pipeline rejects all fields collected from the URL. In these cases, the results
array will contain the following:
{
"error": "Scraped data failed to meet our validation standards."
}
The reason that error
is present in the results
array versus data.errors
is because more than 1 result can be returned from a given URL (e.g. scraping a listing page on amazon that contains many products)
Supported Sources
Source | Domain | Identifier | Health |
---|---|---|---|
Amazon | amazon.com | asins | ✅ |
Newegg | newegg.com | skus.value | ✅ |
Bestbuy | bestbuy.com | skus.value | ✅ |
Walmart | walmart.com | websiteIDs | ✅ |
What does health mean?
Health Indicates whether we're currently able to access a given source. There are times when our requests to a given source will fail for a short period which is when you would see an X
Usage
const axios = require('axios');
const DF_TOKEN = "INPUT YOUR DF TOKEN HERE";
(async () => {
const dfAPI = axios.create({
baseURL: 'https://api.datafiniti.co/v4',
headers: {
'Authorization': `Bearer ${DF_TOKEN}`,
'Content-type': 'application/json'
}
})
try {
const resp = await dfAPI.post('/scrape', {
urls: ['https://www.amazon.com/dp/B07ZMGJCC4?productDetails&th=1&psc=1']
});
console.log(JSON.stringify(resp.data, null, 4));
} catch (error) {
const { response } = error;
if (response) {
console.log(`Error ${(response.status)} Error: ${JSON.stringify(response.data, null, 4)}`);
} else {
console.log(`Error: ${error.stack}`);
}
}
})();
Updated 8 months ago