Creating Singer Tap to Capture Ebay Completed Items
What is Singer
The Singer specification bills itself as "the open-source standard for writing scripts that move data". PipelineWise and Meltano are popular open source platforms that use the Singer specification to accommodate ingest and replication of data from various sources to various destinations.
Singer describes how data extraction scripts—called "taps" —and data loading scripts—called "targets"— should communicate, allowing them to be used in any combination to move data from any source to any destination. Send data between databases, web APIs, files, queues, and just about anything else you can think of.
ETL Pipeline in Singer, image credit: panoply
Completed Items on eBay
For over 20 years, eBay has allowed users to search by keyword for auctions and Buy It Now listings that have recently ended. This is very helpful to provide buyer and sellers with a directional idea of how much comparable items sell for.
On the left side of any search page (near the bottom of the search results), users have the ability to filter by Completed Items as shown below:
The screen below shows a search for the search term "iphone 14":
URL: https://www.ebay.com/sch/i.html?LH_Complete=1&_nkw=iphone+14
The Challenge
For a long time I've wanted a standard method to be able to capture data about items that have sold on eBay and have an easy + repeatable way to save the data (to files, a database, etc.). I have pieced together various scripts in different languages (python, powershell, C#, php) at different times to accomplish this, but I wanted it to be based off a more standard and extensible framework.
eBay offers developer APIs to query their data which I have used. While I could use the API's findCompletedItems end point, it is a far more straight forward implementation and faster learning opportunity for me to use some common python libraries to get the data from eBay's public web site.
tap-ebaycompleted
Code for this Singer Tap is posted on github (click on image below):
Project Scope
So the goal is to create a Singer tap that will allow us to generate out properly formatted JSON data with the details of the listing, that can be consumed by a Singer target (such as writing to a .csv file, an API or PostGres database). Below is a visual illustration:
Helpful links to get background on Developing Singer Taps
- Singer provides getting started docs on creating taps.
- Meltano provides code samples for developing taps
- singer.write_schema - responsible for defining the JSON schema that data will be outputted
- singer.write_records - responsible for outputting each record (data row)
Setting up development for Singer Tap
In order to develop a tap, we need to install the Singer library:
1
pip install singer-python
Next we'll install cookiecutter and download the tap template to give us a starting point:
1
2
3
4
pip install cookiecutter
cookiecutter https://github.com/singer-io/singer-tap-template.git
project_name [e.g. 'tap-facebook']: tap-ebaycompleted
package_name [tap_ebaycompleted]:
Defining the schema
The JSON schema defines the structure that the data will be outputted by the tap. Below I am writing a small inline python function to do so:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def get_schema():
schema = {
"properties": {
"search_term": {"type": "string"},
"title": {"type": "string"},
"price": {"type": "string"},
"bids": {"type": "string"},
"buy_it_now": {"type": "boolean"},
"condition": {"type": "string"},
"image": {"type": "string"},
"link": {"type": "string"},
"end_date": {"type": "string"}
"has_sold": {"type": "boolean"}
"id": {"type": "string"}
}
}
return schema
We later call this function to create the schema object:
1
2
schema = get_schema()
singer.write_schema("completed_item_schema", schema, "id")
Getting data from eBay's Completed Items Search Results
In python, there are a few different libraries to parse the contents of a web page. In this tap, I use:
- requests for getting the HTML of the page
- BeautifulSoup 4 for parsing the elements on the page
While I added some extra logic in the actual tap, below will provide the basic idea of how we can parse the page if you are new to these libraries:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
search_term = 'iphone 13'
url = "https://www.ebay.com/sch/i.html?LH_Complete=1&_sop=13&_nkw={search_term}"
response = requests.get(url)
html_content = str(response.content)
soup = BeautifulSoup(html_content, "html.parser")
# Each completed item is within a <li> tag with class=s-item s-item__pl-on-bottom
listings = soup.find_all("li", class_="s-item s-item__pl-on-bottom")
# Iterate over completed listings
for listing in listings:
title = listing.find("div", class_="s-item__title").text
price = listing.find("span", class_="s-item__price").text
condition = listing.find("span", class_="SECONDARY_INFO").text
image= listing.find("img")['src']
link = listing.find("a", class_="s-item__link")['href']
id = link[0:link.index("?")].replace("https://www.ebay.com/itm/", "")
bids = ""
try:
bids = listing.find("span", class_="s-item__bids s-item__bidCount").text
except:
bids = ""
buy_it_now = False
try:
if listing.find("span", class_="s-item__dynamic s-item__buyItNowOption").text == "Buy It Now":
buy_it_now = True
except:
buy_it_now = False
has_sold = False
try:
if listing.find("div", class_="s-item__title--tag").find("span", class_="clipped").text == "Sold Item":
has_sold = True
end_date=listing.find("div", class_="s-item__title--tag").find("span", class_="POSITIVE").text
else:
end_date = listing.find("div", class_="s-item__title--tag").find("span", class_="NEGATIVE").text
except:
end_date=""
record = {
"search_term": search_term,
"title": title,
"price": price,
"condition": condition,
"image": image,
"link": link,
"id": id,
"bids": bids,
"buy_it_now": buy_it_now,
"end_date": end_date,
"has_sold": has_sold
}
We can then pass the schema ("completed_item_schema") that we defined earlier and the record to singer.write_records:
1
singer.write_records("completed_item_schema", [record])
Building and Configuring the Tap
The template created a directory named tap-ebaycompleted. Since this is a very simple tap, I removed the contents and created a single file called __init__.py. In the main function, I provided the logic specified in the sections above with a little bit of additional logic.
Create a configuration file
There is a template you can use at config.json.example, just copy it to config.json in the repo root and update the following values:
1
2
3
4
5
6
{
"search_terms": ["iphone 13", "iphone 14"],
"page_size": 240,
"min_wait": 2.1,
"max_wait": 4.4
}
Variable | Description |
search_terms | list of terms that the tap will search for REQUIRED |
page_size | number of records to return (values can be 240,120,60), default is 120 |
min_wait | minimum amount of time between searches, default is 2 seconds |
max_wait | maximum amount of time between searches, default is 5 seconds |
Running the Tap
Let's create a virtual environment to run our tap within:
1
2
3
4
5
6
7
8
9
cd tap-ebaycompleted
python3 -m venv ~/.virtualenvs/tap-ebaycompleted
source ~/.virtualenvs/tap-ebaycompleted/bin/activate
git clone git@github.com:jaygrossman/tap-ebaycompleted.git
cd tap-ebaycompleted
pip3 install requests
pip3 install BeautifulSoup4
pip3 install .
deactivate
We can run our tap with the following command:
1
~/.virtualenvs/tap-ebaycompleted/bin/tap-ebaycompleted -c config.json
Below is a sample record representing a completed item:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"type": "RECORD",
"stream":
"completed_item_schema",
"record": {
"search_term": "iphone 13",
"title": "Apple iPhone 13 - 128GB - Midnight",
"price": "$411.00",
"condition": "Pre-Owned",
"image": "https://shorturl.at/bqrsK",
"link": "https://www.ebay.com/itm/314645218752",
"id": "314645218752",
"bids": "",
"buy_it_now": false
"end_date": "Jun 21, 2023"
"has_sold": True
}
}
Running the Tap with a Target to Export the Data to a .csv File
Install target-csv:
1
2
3
4
python3 -m venv ~/.virtualenvs/target-csv
source ~/.virtualenvs/target-csv/bin/activate
pip3 install target-csv
deactivate
We can run our tap piped to target-csv with the following command:
1
2
~/.virtualenvs/tap-ebaycompleted/bin/tap-ebaycompleted -c config.json |
~/.virtualenvs/target-csv/bin/target-csv
Below is the first couple of lines of the .csv file the command generated:
search_term,title,price,condition,image,link,id,bids,buy_it_now,end_date,has_sold
iphone 13,Apple iPhone 13 - 128GB - Midnight (Verizon),$170.00,Parts Only,https://i.ebayimg.com/thumbs/images/g/AKkAAOSw55lkhLVt/s-l300.jpg,https://www.ebay.com/itm/225623521919,225623521919,0 bids,False,"Jun 21, 2023",True
Conclusions and What's Next
- You should not consider this a production quality tap. eBay often changes the format of the HTML on their search results page and this Tap will need to be updated when this occurs.
The tap-ebay project (developed by Drew Banin of Fishtown/dbt) is well done. It uses the fulfillment API to allow a seller to query for their orders received via eBay's marketplace. I may refactor this tap using some of Drew's concepts. - Singer is very robust and powerful. It has pretty slick support for Authentication to data source and State management that I did not need in this very simple idempotent example.
- This python code only returns the first page of results for the provided search terms. It would be beneficial to add in support to iterate through multiple pages of results.
Update 2023-06-27: Added support to configure the maximum search result pages to capture records from. - I'd like to extend the tap with the option of retrieving the list of search terms getting pulled from web url or an API call.
Update 2023-06-28: Added support to configure a public url where search terms and corresponding SKUs can be fed in from. - I will probably build a new Singer target to persist the data from this tap to my private API . And then I can orchestrate the pipeline on a cron with either Dagster or try out Pipelinewise.
Update 2023-06-30: Created a configurable singer target with blog post at Creating Singer Target to Send Data to Web Endpoint