Jay Grossman

Creating Singer Tap to Capture Ebay Completed Items

Creating Singer Tap to Capture Ebay Completed Items

What is Singer

The Singer specification bills itself as "the open-source standard for writing scripts that move data". PipelineWise and Meltano are popular open source platforms that use the Singer specification to accommodate ingest and replication of data from various sources to various destinations.

Singer describes how data extraction scripts—called "taps" —and data loading scripts—called "targets"— should communicate, allowing them to be used in any combination to move data from any source to any destination. Send data between databases, web APIs, files, queues, and just about anything else you can think of.


ETL Pipeline in Singer, image credit: panoply

Completed Items on eBay

For over 20 years, eBay has allowed users to search by keyword for auctions and Buy It Now listings that have recently ended. This is very helpful to provide buyer and sellers with a directional idea of how much comparable items sell for.

On the left side of any search page (near the bottom of the search results), users have the ability to filter by Completed Items as shown below:

The screen below shows a search for the search term "iphone 14":


URL: https://www.ebay.com/sch/i.html?LH_Complete=1&_nkw=iphone+14

The Challenge

For a long time I've wanted a standard method to be able to capture data about items that have sold on eBay and have an easy + repeatable way to save the data (to files, a database, etc.). I have pieced together various scripts in different languages (python, powershell, C#, php) at different times to accomplish this, but I wanted it to be based off a more standard and extensible framework.

eBay offers developer APIs to query their data which I have used. While I could use the API's findCompletedItems end point, it is a far more straight forward implementation and faster learning opportunity for me to use some common python libraries to get the data from eBay's public web site.

 

tap-ebaycompleted

Code for this Singer Tap is posted on github (click on image below):

Project Scope

So the goal is to create a Singer tap that will allow us to generate out properly formatted JSON data with the details of the listing, that can be consumed by a Singer target (such as writing to a .csv file, an API or PostGres database). Below is a visual illustration:

Helpful links to get background on Developing Singer Taps

There are 2 functions in the singer python library for my simple tap that we must care about:
  1. singer.write_schema - responsible for defining the JSON schema that data will be outputted
  2. singer.write_records - responsible for outputting each record (data row)

 

Setting up development for Singer Tap

In order to develop a tap, we need to install the Singer library:

1
pip install singer-python

 Next we'll install cookiecutter and download the tap template to give us a starting point:

1
2
3
4
pip install cookiecutter
cookiecutter https://github.com/singer-io/singer-tap-template.git
project_name [e.g. 'tap-facebook']: tap-ebaycompleted
package_name [tap_ebaycompleted]:

Defining the schema

The JSON schema defines the structure that the data will be outputted by the tap. Below I am writing a small inline python function to do so: 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def get_schema():
schema = {
    "properties": {
        "search_term": {"type": "string"},
        "title": {"type": "string"},
        "price": {"type": "string"},
        "bids": {"type": "string"},
        "buy_it_now": {"type": "boolean"},
        "condition": {"type": "string"},
        "image": {"type": "string"},
        "link": {"type": "string"},
        "end_date": {"type": "string"}
        "has_sold": {"type": "boolean"}
        "id": {"type": "string"}
        }
    }
return schema

We later call this function to create the schema object:

1
2
schema = get_schema()
singer.write_schema("completed_item_schema", schema, "id")

Getting data from eBay's Completed Items Search Results

In python, there are a few different libraries to parse the contents of a web page. In this tap, I use:

While I added some extra logic in the actual tap, below will provide the basic idea of how we can parse the page if you are new to these libraries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
search_term = 'iphone 13'
url = "https://www.ebay.com/sch/i.html?LH_Complete=1&_sop=13&_nkw={search_term}"

response = requests.get(url)
html_content = str(response.content)
soup = BeautifulSoup(html_content, "html.parser")

# Each completed item is within a <li> tag with class=s-item s-item__pl-on-bottom
listings = soup.find_all("li", class_="s-item s-item__pl-on-bottom")

# Iterate over completed listings
for listing in listings:
    title = listing.find("div", class_="s-item__title").text
    price = listing.find("span", class_="s-item__price").text
    condition = listing.find("span", class_="SECONDARY_INFO").text
    image= listing.find("img")['src']
    link = listing.find("a", class_="s-item__link")['href']
    id = link[0:link.index("?")].replace("https://www.ebay.com/itm/", "")
    bids = ""
    try:
        bids = listing.find("span", class_="s-item__bids s-item__bidCount").text
    except:
        bids = ""
    buy_it_now = False
    try:
        if listing.find("span", class_="s-item__dynamic s-item__buyItNowOption").text == "Buy It Now":
            buy_it_now = True
    except:
        buy_it_now = False
    has_sold = False
    try:
        if listing.find("div", class_="s-item__title--tag").find("span", class_="clipped").text == "Sold Item":
            has_sold = True
            end_date=listing.find("div", class_="s-item__title--tag").find("span", class_="POSITIVE").text
        else:
            end_date = listing.find("div", class_="s-item__title--tag").find("span", class_="NEGATIVE").text
    except:
        end_date=""

    record = {
        "search_term": search_term,
        "title": title,
        "price": price,
        "condition": condition,
        "image": image,
        "link": link,
        "id": id,
        "bids": bids,
        "buy_it_now": buy_it_now,
        "end_date": end_date,
        "has_sold": has_sold
    } 

We can then pass the schema ("completed_item_schema") that we defined earlier and the record to singer.write_records:

1
singer.write_records("completed_item_schema", [record])

Building and Configuring the Tap

The template created a directory named tap-ebaycompleted. Since this is a very simple tap, I removed the contents and created a single file called __init__.py. In the main function, I provided the logic specified in the sections above with a little bit of additional logic.

Create a configuration file

There is a template you can use at config.json.example, just copy it to config.json in the repo root and update the following values:

1
2
3
4
5
6
{
    "search_terms": ["iphone 13", "iphone 14"],
    "page_size": 240,
    "min_wait": 2.1,
    "max_wait": 4.4
} 
Variable Description
search_terms list of terms that the tap will search for REQUIRED
page_size number of records to return (values can be 240,120,60), default is 120
min_wait minimum amount of time between searches, default is 2 seconds
max_wait maximum amount of time between searches, default is 5 seconds

 

Running the Tap

Let's create a virtual environment to run our tap within:

1
2
3
4
5
6
7
8
9
cd tap-ebaycompleted
python3 -m venv ~/.virtualenvs/tap-ebaycompleted
source ~/.virtualenvs/tap-ebaycompleted/bin/activate
git clone git@github.com:jaygrossman/tap-ebaycompleted.git
cd tap-ebaycompleted
pip3 install requests
pip3 install BeautifulSoup4
pip3 install .
deactivate 

We can run our tap with the following command:

1
~/.virtualenvs/tap-ebaycompleted/bin/tap-ebaycompleted -c config.json

Below is a sample record representing a completed item:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
    "type": "RECORD", 
"stream": 
"completed_item_schema", 
"record": {
    "search_term": "iphone 13", 
    "title": "Apple iPhone 13 - 128GB - Midnight", 
    "price": "$411.00", 
    "condition": "Pre-Owned", 
    "image": "https://shorturl.at/bqrsK", 
    "link": "https://www.ebay.com/itm/314645218752", 
    "id": "314645218752", 
    "bids": "", 
    "buy_it_now": false
    "end_date": "Jun 21, 2023"
    "has_sold": True
    }
}

Running the Tap with a Target to Export the Data to a .csv File

Install target-csv:

1
2
3
4
python3 -m venv ~/.virtualenvs/target-csv
source ~/.virtualenvs/target-csv/bin/activate
pip3 install target-csv
deactivate

We can run our tap piped to target-csv with the following command:

1
2
~/.virtualenvs/tap-ebaycompleted/bin/tap-ebaycompleted -c config.json |
~/.virtualenvs/target-csv/bin/target-csv 

Below is the first couple of lines of the .csv file the command generated:

search_term,title,price,condition,image,link,id,bids,buy_it_now,end_date,has_sold

iphone 13,Apple iPhone 13 - 128GB - Midnight (Verizon),$170.00,Parts Only,https://i.ebayimg.com/thumbs/images/g/AKkAAOSw55lkhLVt/s-l300.jpg,https://www.ebay.com/itm/225623521919,225623521919,0 bids,False,"Jun 21, 2023",True

Conclusions and What's Next

  • You should not consider this a production quality tap. eBay often changes the format of the HTML on their search results page and this Tap will need to be updated when this occurs.  

    The tap-ebay project (developed by Drew Banin of Fishtown/dbt) is well done. It uses the fulfillment API to allow a seller to query for their orders received via eBay's marketplace.  I may refactor this tap using some of Drew's concepts.

  • Singer is very robust and powerful. It has pretty slick support for Authentication to data source and State management that I did not need in this very simple idempotent example.

  • This python code only returns the first page of results for the provided search terms. It would be beneficial to add in support to iterate through multiple pages of results.

    Update 2023-06-27: Added support to configure the maximum search result pages to capture records from.

  • I'd like to extend the tap with the option of retrieving the list of search terms getting pulled from web url or an API call.

    Update 2023-06-28: Added support to configure a public url where search terms and corresponding SKUs can be fed in from.

  • I will probably build a new Singer target to persist the data from this tap to my private API . And then I can orchestrate the pipeline on a cron with either Dagster or try out Pipelinewise.

    Update 2023-06-30: Created a configurable singer target with blog post at Creating Singer Target to Send Data to Web Endpoint