Predicting eBay Auction Sales with Machine Learning
Abstract:
Online auctions are one the most popular methods to buy and sell items on the internet. With more than 100 million active users globally (as of Q4 2011), eBay is the world's largest online marketplace, where practically anyone can buy and sell practically anything. The total value of goods sold on eBay was $68.6 billion, more than $2,100 every second. This kind of volume produces huge amounts of data that can be utilized to provide services to the buyers and sellers, market research, and product development.
In this analysis, I collect historical auction data from eBay and use machine learning algorithms to predict sales results of auction items. I describe the features used and formulations used for making predictions. Using the sports autograph category on eBay, the algorithms used can be relatively accurate and can result in a useful set of services for buyers and sellers.
1) Introduction
I run a subscription based community with 30,000+ collectors of sports autographs and memorabilia – http://www.sportscollectors.net.
Figure 1. Homepage of SportsCollectors.net
Since eBay is the world’s largest marketplace for sports autographs, the vast majority of the site’s membership uses it to buy and/or sell items via auction format. The ability to provide a method to estimate auction sale prices is desirable to this community.
Members of most communities related to collectibles have reported they most often try to predict how much an auction would sell for by performing a search for item and manually calculating the average sales price shown in the completed listings page (shown in Figure 2 below).
Figure 2. Example of eBay’s completed listings page for autographed baseballs by Jim Rice. In this example, there have been 43 sales via auction format with an average sale price of $32.05 ending in the past 60 days.
To best serve this audience, I am interested in 2 things:
- Determine whether an auction listing will result in a sale.
- Predict final sale prices for auctions.
2) Data Collection
I run an automated process that collects fixed price and auction listings information available on eBay. The process queries for listings at product sku level, defined by the combination of:
- Player’s reference data from SportsCollectors.Net - every player to have played pro baseball, football, basketball, and hockey since 1948
- eBay autograph category by sport (shown in Table 1):
Sport | Baseball | Basketball | Football | Hockey |
Categories | -Balls -Bats -Hats -Helmets -Index Cards -Jerseys -Lithographs, Posters, & Prints -Magazines -Other Autographed Items -Photos -Plaques -Plates -Postcards -Programs -Ticket Stubs -Trading Cards |
-Balls -Floor, Floorboard -Index Cards -Jerseys -Lithographs, Posters, & Prints -Magazines -Other Autographed Items -Photos -Trading Cards |
-Balls -Hats -Helmets -Index Cards -Jerseys -Lithographs, Posters, & Prints -Magazines -Other Autographed Items -Photos -Plaques -Programs -Ticket Stubs -Trading Cards |
-Index Cards -Jerseys -Magazines -Other Autographed Items -Photos -Pucks -Sticks -Trading Cards |
Table 1. Breakdown of categories of autographed products by sport
3) Features
3.1 Auction Features
Table 2. Features extracted from the auction’s meta data:
Feature | Description |
Price | Final price the auction. If the listing does not result in a sale, the Price will be equal to the StartingBid. |
StartingBid | Minimum bid for the auction |
BidCount | Number of bids made for the auction |
Title | Auction title |
QuantitySold | The number of items sold in the listing. Represented by a 0 or 1. |
SellerRating | Seller’s eBay rating |
SellerAboutMePage | Whether the seller has an eBay About Me page |
StartDate | The beginning date and time of the auction |
EndDate | The ending date and time of the auction |
PositiveFeedbackPercent | The percent of positive feedback (of all the feedback) received by the seller |
HasPicture | Indicates the seller included a picture with the listing Represented by a 0 or 1. |
MemberSince | The date the seller created their online marketplace user account |
HasStore | Indicates the seller has an eBay Store Represented by a 0 or 1. |
SellerCountry | The country of the seller |
BuyItNowPrice | The optional price to buy the item instantly |
HighBidderFeedbackRating | Highest bidder’s eBay rating |
ReturnsAccepted | Whether the seller accepts returns. Represented by a 0 or 1. |
HasFreeShipping | Whether the seller provides free shipping. Represented by a 0 or 1. |
3.2 Derived Features
Table 3. Features derived from the auction’s meta data:
Feature | Description |
IsHOF | Whether the player in their sport’s Hall of Fame. Represented by a 0 or 1. |
IsAuthenticated | Whether the received third party authentication. Represented by a 0 or 1. Determined by inspecting the auction’s title and description details for a whitelisted set of keywords and ruling out a blacklisted set of keywords. |
HasInscription | Whether the item has an inscription. Represented by a 0 or 1. Determined by inspecting the auction’s title and description details for a whitelisted set of keywords and ruling out a blacklisted set of keywords. |
AvgPrice | The average sale price by sku |
MedianPrice | The median sale price by sku |
AuctionCount | The number of auctions listed by sku |
SellerSaleToAveragePriceRatio | The ratio of the sale price realized by a specific seller divided by the average price of the same skus |
SellerAuctionSaleCount | The number of sales the seller has made |
SellerItemSellPercent | The ratio of the number of sales divided by number of auctions listed by seller |
StartDayOfWeek | The day of the week (number) that the auction Started |
EndDayOfWeek | The day of the week (number) that the auction Ended |
AuctionDuration | The number of days the auction lasted |
StartingBidPercent | The ratio of the StartingBid divided by sku’s AvgPrice |
SellerClosePercent | The ratio of the number of auctions resulting in sale for a seller divided by total number of auctions the seller listed |
ItemAuctionSellPercent | The ratio of the number of auctions resulting in sale for a sku divided by total number of auctions the listed for the sku |
4) Training and Test Data
Data to determine whether an auction listing will result in a sale:
Query Criteria | Records | Mean Sale Price | Median Sale | PriceRange Sale Price | |
Training Set | All Auctions ending in April 2013 | 258,588 | $28.96 | $9.99 | $0.01-$300.00 |
Test Set | All Auctions ending in first week of May 2013 | 37,460 | $24.65 | $9.99 | $0.01-$300.00 |
Data to predict final sale prices for auctions:
Query Criteria | Records | Mean Sale Price | Median Sale | PriceRange Sale Price | |
Training Subset | Auctions ending with a sale in April 2013 | 79,732 | $33.04 | $14.99 | $0.01-$300.00 |
Test Subset | Auctions ending with a sale in first week of May 2013 | 9,392 | $29.17 | $12.55 | $0.01-$297.50 |
Filter Criteria
- Only Standard auction format.
- Only items signed by a single player.
5) Analysis and Prediction
Since this analysis is trying to answer two questions, this section details the methodologies for solving each problem individually.
5.1 Determine whether an auction listing will result in a sale.
This is a binary classification problem, as the goal is to optimally predict QuantitySold (containing values of 0 or 1) as the target feature.
The model chosen to create the classification predictions is Logistic regression. Logistic regression uses a set of covariates to predict probabilities of class membership. I achieved an optimized prediction model based on a set of 5 derived features with standardized values.
Prediction via Logistic Regression | Baseline: Prediction of sale when AvgPrice is less than SalePrice |
85.97% | 42.74% |
5.2 Predict final sale prices for auctions.
Figure 3. Histogram of Price feature in Training Subset.
Figure 3 shows a high concentration of sales under $20.00. Since sports autographs on eBay are not a commoditized item (as compared to consumer electronics or books), I have seen some pretty interesting ranges in sale price for the same items. This leads to one of the challenges for this analysis, that my Training Subset was not Gaussian. Figure 4 below represents the Price feature from training data after a log transform. We can see that the graph is skewed with a very high Min (first quartile). The test also set has a very similar distribution.
Figure 4. Histogram of log(Price) feature in Training Subset after log transformation
There are 2 different approaches used to solving price prediction as a machine learning problem:
5.2.1 Price Prediction by Regression
Classification and Regression Trees (CART)
Classification and regression trees are machine-learning methods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values.
I created an optimized decision tree using the Training Subset, and then used it to create predictions against the Test Subset. The Root Mean Squared Error (RMSE) of the CART predictions is 4.30.
Figure 5. Plot of CART Predictions vs. Observed Sale Prices from the Test Subset. The green line represents Prediction=SalePrice and the blue line is the smoothed actual.
Figure 5 shows that the model does a pretty good job for the majority of the listings and then becomes significantly less accurate.
I created subsets of the Test Subset with different maximum predictions. I saw that $50 was the optimal balance in terms of RMSE and keeping a high portion of records of the original data set.
Baseline
I created a analysis baseline using the CART model. I created a decision tree based the log(Price) and AvgPrice features in the Training Subset. The tree is used along with the Test Subset to create predictions. Using this method, the predictions had an RMSE of 5.30.
5.2.2 Multi-Class Classification
Using the Training Subset, I divided the Sale Price (target variable) into $5 intervals and created discrete categories. Each auction is assigned to one category. This allows for a multiclass classification problem in which case the output is a $5 range instead of the specific price.
Prediction using K-Nearest Neighbors (KNN)
Wikipedia defines KNN as a non-parametric method for classifying objects based on closest training examples in the feature space.
I was able to use the KNN to create a model based on the Training Subset, using the Price Interval as the factor and k=3. This model was combined with the Test Subset to generate predictions for what $5 interval each item would fall into. Below are the results:
- accurately predicted: 50.92%
- predictions within one group: 76.68%
- predictions within two groups: 86.15%
5.2.3 Filtered Prediction
I combined the two methods to get optimal results:
- Use predictions from CART with only out predictions under $50.00.
- Use KNN classification predictions to limit outliers. Filter out auctions where predicted $5 interval is greater than 2 from the predicted price.
Method | RMSE | Sale-Predition | Standard Deviation | % of Subset |
Baseline CART (using AvgPrice) | 5.30 | +21% | 40 | 100% |
CART | 4.30 | -15% | 28 | 100% |
CART for predictions under $50.00 | 3.52 | -14% | 15 | 88% |
CART for predictions under $50.00 and within 2 price intervals |
0.84 | -7% | 12 | 65% |
6) Conclusions
How reliable is using the average price for predicting sales?
I would not be comfortable using the average price to predict auction sales. It has a higher RMSE, higher difference between actualized prices, and a very high standard deviation.
Can we determine whether an auction listing will result in a sale?
Since logistic regression provided an almost 86% success rate for predicting if the auction would result in a sale, I would be comfortable considering the model’s prediction.
This would be quite useful from a seller’s perspective, to help minimize:
- The listing fees a seller would accumulate due to unsold listings.
- The time invested in listing unsuccessful items.
Can we predict final sale prices for auctions?
The combination of the CART prediction using KNN for eliminating obvious outliers does a good job predicting final sale price when predictions are under $50 (about 65% of the observed cases).
Since I have yet to find a service or commercial product related to predicting eBay auction results, this analysis could be valuable in offering services for the following applications:
- Buying recommendations/arbitrage
- Listing Optimization
- Product Sourcing and Logistics
- Price Estimation for Complimentary Services (Shipping, Insurance providers)
7) Future Exploration
There are some additional features I can consider to potentially enhance these models:
- Time of year, month, major events (Super Bowl, Spring Training)
- Bid Timing, clusters around bids near end of auctions.
Chou et al 2007 have done analysis on predicting price based on bidding patterns A Simulation-Based Model for Final Price Prediction in Online Auctions (http://www.jem.org.tw/content/pdf/Vol.3No.1/01.pdf) - Measure of Player’s Popularity/Demand/Interest (on eBay, sportcollectors.net, twitter, espn, etc.)
- Semantic parsing auction description for most relevant keywords or phrases
Other areas to explore with this data:
- Sku representations for items signed by multiple players
- Guidance around 3rd party authentication services
- Product recommendations and predicting arbitrage scenarios.
- More Verticals:
Other Collectibles (sports cards, coins, stamps, toys)
Car parts
Consumer Electronics