Jay Grossman
Published on: 10 Jun 2013

Predicting eBay Auction Sales with Machine Learning

Abstract:

Online auctions are one the most popular methods to buy and sell items on the internet. With more than 100 million active users globally (as of Q4 2011), eBay is the world's largest online marketplace, where practically anyone can buy and sell practically anything. The total value of goods sold on eBay was $68.6 billion, more than $2,100 every second. This kind of volume produces huge amounts of data that can be utilized to provide services to the buyers and sellers, market research, and product development.

In this analysis, I collect historical auction data from eBay and use machine learning algorithms to predict sales results of auction items. I describe the features used and formulations used for making predictions. Using the sports autograph category on eBay, the algorithms used can be relatively accurate and can result in a useful set of services for buyers and sellers.

1) Introduction

I run a subscription based community with 30,000+ collectors of sports autographs and memorabilia – http://www.sportscollectors.net.

Figure 1. Homepage of SportsCollectors.net

Since eBay is the world’s largest marketplace for sports autographs, the vast majority of the site’s membership uses it to buy and/or sell items via auction format. The ability to provide a method to estimate auction sale prices is desirable to this community.

Members of most communities related to collectibles have reported they most often try to predict how much an auction would sell for by performing a search for item and manually calculating the average sales price shown in the completed listings page (shown in Figure 2 below).

Figure 2. Example of eBay’s completed listings page for autographed baseballs by Jim Rice. In this example, there have been 43 sales via auction format with an average sale price of $32.05 ending in the past 60 days.

To best serve this audience, I am interested in 2 things:

Determine whether an auction listing will result in a sale.
Predict final sale prices for auctions.

2) Data Collection

I run an automated process that collects fixed price and auction listings information available on eBay. The process queries for listings at product sku level, defined by the combination of:

Player’s reference data from SportsCollectors.Net - every player to have played pro baseball, football, basketball, and hockey since 1948
eBay autograph category by sport (shown in Table 1):

Sport	Baseball	Basketball	Football	Hockey
Categories	-Balls -Bats -Hats -Helmets -Index Cards -Jerseys -Lithographs, Posters, & Prints -Magazines -Other Autographed Items -Photos -Plaques -Plates -Postcards -Programs -Ticket Stubs -Trading Cards	-Balls -Floor, Floorboard -Index Cards -Jerseys -Lithographs, Posters, & Prints -Magazines -Other Autographed Items -Photos -Trading Cards	-Balls -Hats -Helmets -Index Cards -Jerseys -Lithographs, Posters, & Prints -Magazines -Other Autographed Items -Photos -Plaques -Programs -Ticket Stubs -Trading Cards	-Index Cards -Jerseys -Magazines -Other Autographed Items -Photos -Pucks -Sticks -Trading Cards

Table 1. Breakdown of categories of autographed products by sport

3) Features

3.1 Auction Features

Table 2. Features extracted from the auction’s meta data:

Feature	Description
Price	Final price the auction. If the listing does not result in a sale, the Price will be equal to the StartingBid.
StartingBid	Minimum bid for the auction
BidCount	Number of bids made for the auction
Title	Auction title
QuantitySold	The number of items sold in the listing. Represented by a 0 or 1.
SellerRating	Seller’s eBay rating
SellerAboutMePage	Whether the seller has an eBay About Me page
StartDate	The beginning date and time of the auction
EndDate	The ending date and time of the auction
PositiveFeedbackPercent	The percent of positive feedback (of all the feedback) received by the seller
HasPicture	Indicates the seller included a picture with the listing Represented by a 0 or 1.
MemberSince	The date the seller created their online marketplace user account
HasStore	Indicates the seller has an eBay Store Represented by a 0 or 1.
SellerCountry	The country of the seller
BuyItNowPrice	The optional price to buy the item instantly
HighBidderFeedbackRating	Highest bidder’s eBay rating
ReturnsAccepted	Whether the seller accepts returns. Represented by a 0 or 1.
HasFreeShipping	Whether the seller provides free shipping. Represented by a 0 or 1.

3.2 Derived Features

Table 3. Features derived from the auction’s meta data:

Feature	Description
IsHOF	Whether the player in their sport’s Hall of Fame. Represented by a 0 or 1.
IsAuthenticated	Whether the received third party authentication. Represented by a 0 or 1. Determined by inspecting the auction’s title and description details for a whitelisted set of keywords and ruling out a blacklisted set of keywords.
HasInscription	Whether the item has an inscription. Represented by a 0 or 1. Determined by inspecting the auction’s title and description details for a whitelisted set of keywords and ruling out a blacklisted set of keywords.
AvgPrice	The average sale price by sku
MedianPrice	The median sale price by sku
AuctionCount	The number of auctions listed by sku
SellerSaleToAveragePriceRatio	The ratio of the sale price realized by a specific seller divided by the average price of the same skus
SellerAuctionSaleCount	The number of sales the seller has made
SellerItemSellPercent	The ratio of the number of sales divided by number of auctions listed by seller
StartDayOfWeek	The day of the week (number) that the auction Started
EndDayOfWeek	The day of the week (number) that the auction Ended
AuctionDuration	The number of days the auction lasted
StartingBidPercent	The ratio of the StartingBid divided by sku’s AvgPrice
SellerClosePercent	The ratio of the number of auctions resulting in sale for a seller divided by total number of auctions the seller listed
ItemAuctionSellPercent	The ratio of the number of auctions resulting in sale for a sku divided by total number of auctions the listed for the sku

4) Training and Test Data

Data to determine whether an auction listing will result in a sale:

	Query Criteria	Records	Mean Sale Price	Median Sale	PriceRange Sale Price
Training Set	All Auctions ending in April 2013	258,588	$28.96	$9.99	$0.01-$300.00
Test Set	All Auctions ending in first week of May 2013	37,460	$24.65	$9.99	$0.01-$300.00

Data to predict final sale prices for auctions:

	Query Criteria	Records	Mean Sale Price	Median Sale	PriceRange Sale Price
Training Subset	Auctions ending with a sale in April 2013	79,732	$33.04	$14.99	$0.01-$300.00
Test Subset	Auctions ending with a sale in first week of May 2013	9,392	$29.17	$12.55	$0.01-$297.50

Filter Criteria

Only Standard auction format.
Only items signed by a single player.

5) Analysis and Prediction

Since this analysis is trying to answer two questions, this section details the methodologies for solving each problem individually.

5.1 Determine whether an auction listing will result in a sale.

This is a binary classification problem, as the goal is to optimally predict QuantitySold (containing values of 0 or 1) as the target feature.

The model chosen to create the classification predictions is Logistic regression. Logistic regression uses a set of covariates to predict probabilities of class membership. I achieved an optimized prediction model based on a set of 5 derived features with standardized values.

Prediction via Logistic Regression	Baseline: Prediction of sale when AvgPrice is less than SalePrice
85.97%	42.74%

5.2 Predict final sale prices for auctions.

Figure 3. Histogram of Price feature in Training Subset.

Figure 3 shows a high concentration of sales under $20.00. Since sports autographs on eBay are not a commoditized item (as compared to consumer electronics or books), I have seen some pretty interesting ranges in sale price for the same items. This leads to one of the challenges for this analysis, that my Training Subset was not Gaussian. Figure 4 below represents the Price feature from training data after a log transform. We can see that the graph is skewed with a very high Min (first quartile). The test also set has a very similar distribution.

Figure 4. Histogram of log(Price) feature in Training Subset after log transformation

There are 2 different approaches used to solving price prediction as a machine learning problem:

5.2.1 Price Prediction by Regression

Classification and Regression Trees (CART)

Classiﬁcation and regression trees are machine-learning methods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and ﬁtting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classiﬁcation trees are designed for dependent variables that take a ﬁnite number of unordered values, with prediction error measured in terms of misclassiﬁcation cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values.

I created an optimized decision tree using the Training Subset, and then used it to create predictions against the Test Subset. The Root Mean Squared Error (RMSE) of the CART predictions is 4.30.

Figure 5. Plot of CART Predictions vs. Observed Sale Prices from the Test Subset. The green line represents Prediction=SalePrice and the blue line is the smoothed actual.

Figure 5 shows that the model does a pretty good job for the majority of the listings and then becomes significantly less accurate.

I created subsets of the Test Subset with different maximum predictions. I saw that $50 was the optimal balance in terms of RMSE and keeping a high portion of records of the original data set.

Baseline

I created a analysis baseline using the CART model. I created a decision tree based the log(Price) and AvgPrice features in the Training Subset. The tree is used along with the Test Subset to create predictions. Using this method, the predictions had an RMSE of 5.30.

5.2.2 Multi-Class Classification

Using the Training Subset, I divided the Sale Price (target variable) into $5 intervals and created discrete categories. Each auction is assigned to one category. This allows for a multiclass classification problem in which case the output is a $5 range instead of the specific price.

Prediction using K-Nearest Neighbors (KNN)

Wikipedia defines KNN as a non-parametric method for classifying objects based on closest training examples in the feature space.

I was able to use the KNN to create a model based on the Training Subset, using the Price Interval as the factor and k=3. This model was combined with the Test Subset to generate predictions for what $5 interval each item would fall into. Below are the results:

accurately predicted: 50.92%
predictions within one group: 76.68%
predictions within two groups: 86.15%

5.2.3 Filtered Prediction

I combined the two methods to get optimal results:

Use predictions from CART with only out predictions under $50.00.
Use KNN classification predictions to limit outliers. Filter out auctions where predicted $5 interval is greater than 2 from the predicted price.

Method	RMSE	Sale-Predition	Standard Deviation	% of Subset
Baseline CART (using AvgPrice)	5.30	+21%	40	100%
CART	4.30	-15%	28	100%
CART for predictions under $50.00	3.52	-14%	15	88%
CART for predictions under $50.00 and within 2 price intervals	0.84	-7%	12	65%

6) Conclusions

How reliable is using the average price for predicting sales?

I would not be comfortable using the average price to predict auction sales. It has a higher RMSE, higher difference between actualized prices, and a very high standard deviation.

Can we determine whether an auction listing will result in a sale?

Since logistic regression provided an almost 86% success rate for predicting if the auction would result in a sale, I would be comfortable considering the model’s prediction.

This would be quite useful from a seller’s perspective, to help minimize:

The listing fees a seller would accumulate due to unsold listings.
The time invested in listing unsuccessful items.

Can we predict final sale prices for auctions?

The combination of the CART prediction using KNN for eliminating obvious outliers does a good job predicting final sale price when predictions are under $50 (about 65% of the observed cases).

Since I have yet to find a service or commercial product related to predicting eBay auction results, this analysis could be valuable in offering services for the following applications:

Buying recommendations/arbitrage
Listing Optimization
Product Sourcing and Logistics
Price Estimation for Complimentary Services (Shipping, Insurance providers)

7) Future Exploration

There are some additional features I can consider to potentially enhance these models:

Time of year, month, major events (Super Bowl, Spring Training)
Bid Timing, clusters around bids near end of auctions.
Chou et al 2007 have done analysis on predicting price based on bidding patterns A Simulation-Based Model for Final Price Prediction in Online Auctions (http://www.jem.org.tw/content/pdf/Vol.3No.1/01.pdf)
Measure of Player’s Popularity/Demand/Interest (on eBay, sportcollectors.net, twitter, espn, etc.)
Semantic parsing auction description for most relevant keywords or phrases

Other areas to explore with this data:

Sku representations for items signed by multiple players
Guidance around 3rd party authentication services
Product recommendations and predicting arbitrage scenarios.
More Verticals:
Other Collectibles (sports cards, coins, stamps, toys)
Car parts
Consumer Electronics

R code used for this analysis:

https://github.com/jaygrossman/eBaySalesPrediction

10 Jun 2013

« Show my startup the money…or not. Powershell Module for transferring files via SFTP »

JayGrossman.com

Predicting eBay Auction Sales with Machine Learning

Popular Tags →