SQL Server 2005 Data Mining


452 93 282KB

English Pages 10 Year 2005

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

SQL Server 2005 Data Mining

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

SQL Server 2005 Data Mining

Page 1

asp:feature

Subscription Services Subscribe Renew Subscription asp.netNOW Newsletter Change of Address Pay An Invoice Subscription Packages asp.netPRO Articles Affiliate Spotlight 411asp.net Directory New Products Book Reviews Product Reviews Opinion Back Issues Reprints/E-prints Search Downloads Premium Downloads Informant Contact Us Advertise with Us Write For Us

LANGUAGES: VB.NET ASP.NET VERSIONS: ALL

SQL Se rve r 2005 Da t a M ining Create a Web Crosssell Application By Raman Iyer and Jesper Lind The concept of cross-sell is familiar to most of us. W hat your friendly neighborhood McDonald’s salesperson does when you order a cheeseburger is exactly what Amazon.com or buy.com are doing online when you add items to your shopping cart and you get a list of other items you might also like. You can add this functionality to your ASP.NET page by employing the power of data mining, using simple SQL-like queries to produce high-quality recommendations. Microsoft SQL Server 2005, currently available to over 200,000 MSDN subscribers in Beta 2, includes advanced data mining capabilities that are available programmatically via standard interfaces like ADO.NET. This will allow you to integrate crosssell into your W eb store application with minimal effort. Before developing a W eb cross-sell application, we need to build the server-side intelligence that will enable the application to come up with smart product recommendations. This process involves:

Latest Features • NUnitASP • Under the Hood • Dynamic Dropdowns • Validate User-entered Data • Total Recall Article Rating Rate this article on a scale from 0 to 5

j 5 Best lk m n j 4 lk m n j 3 lk m n j 2 lk m n j 1 lk m n j 0 Worst lk m n Submit Email

Tell a friend about this article!

Preparing the data you already have about past customers for mining; Designing a mining model for the purpose of making recommendations to new customers; Deploying the model to Analysis Server and training it with the data prepared earlier; and Setting up security to allow ASP.NET to query the trained model.

http://www.aspnetpro.com/newsletterarticle/2004/10/asp200410ri_l/asp200410ri_l.asp

09/03/2005 17:34:27

SQL Server 2005 Data Mining

Page 2

The first part of this article will explain key concepts and walk you through the above process using the SQL Server 2005 Data Mining tools. In the latter half we’ll demonstrate the ease with which you can integrate the predictive power of the mining model into your ASP.NET W eb application using straightforward database access code.

Before You Start Install Microsoft SQL Server 2005 Analysis Services Beta 2. This will set up the server components as well as the design and management tools that we’ll use in this article: Business Intelligence Studio and SQL Server Management Studio, respectively. W e also recommend you go over the data mining tutorial included with Beta 2. Note: W e’ll refer to your running instance of SQL Server 2005 Analysis Services as “Analysis Server” in the rest of the article. The service shows up as “Analysis Services (MSSQLSERVER)” under Services in Administrative Tools.

Mining Your Customer Purchase Data Here we outline the process of building the back-end framework that mines your historical customer movie purchase transactions and extracts the knowledge needed to make cross-sell recommendations to new customers. This knowledge is embedded in the mining model that we’ll design. Data preparation is an important aspect of any data mining process. W ith SQL Server Data Mining, however, it is also possible to mine the transaction data in your relational database directly. For simplicity, we’ll take this approach and assume that a single transaction table contains your customer purchase information, as shown in Figure 1.

Figure 1: Movie purchase data used by sample application.

The Analysis Services project in SQL Server 2005 Business Intelligence Development Studio provides the framework for modeling data and building a mining model that learns customer buying patterns from existing data gathered from prior purchases. W e then use the trained model to generate recommendations for new customers. The first step is to identify the entity whose behavior we are interested in analyzing for the purpose of our crosssell application. A case represents all information (also referred to as “attributes”) known about this entity. In this scenario, each distinct customer in the CustomerMovies table and the set of movies they purchased forms a case. SQL 2005 Data Mining uses the concept of a nested table to represent a variable-length collection of attributes of the same kind associated with a case. For each customer there is a set of rows containing the list of movies purchased, which can be represented as a nested table (as shown in Figure 2).

Figure 2: The mining model’s view of the customer movie purchase data.

http://www.aspnetpro.com/newsletterarticle/2004/10/asp200410ri_l/asp200410ri_l.asp

09/03/2005 17:34:27

SQL Server 2005 Data Mining

Page 3

The definition of a case and its associated attributes is known as a mining structure in SQL 2005 Data Mining. The next step is to build a mining model by selecting a mining algorithm and specifying how the columns in the mining structure will be used by the algorithm to process the input data and extract useful knowledge from it. The Association Rules algorithm is a good fit for our scenario. It learns which items are likely to be bought together and utilizes that information to predict other items given the items the customer has selected. Marking the Movies nested table as both Input and Predictable enables the model to make predictions using the movies provided as input. Note that the mining model in SQL Server 2005 Data Mining is a database object that holds both the definition of the input to the knowledge extraction process and the output, which consists of patterns or rules learned by applying the selected data mining algorithm to process the input data. As we’ll see, the Mining Model W izard in SQL Server 2005 Business Intelligence Development Studio builds a mining structure and a model using the algorithm selected on the first page. Next, the mining model definition is sent as part of a deployment package to the server where it is trained. Training cases consist of information we have collected from past purchases. The deployment package is generated from the Analysis Services project that we design the model in, and it includes bindings to the data source that Analysis Server uses for obtaining the training cases. To improve the quality of recommendations made for new customers the model can be periodically re-trained as more customer data is added to the transaction database. For large datasets this will typically be carried out during off-peak hours or against a replica of the transaction database. SQL Server Data Transformation Services (DTS) can be used to set up a package to perform such periodic updates. Finally, we must set up permissions so our application can query the trained model.

Building a Cross-sell Mining Model in SQL Server 2005 See the end of this article for information about downloading the complete Analysis Server project built using the steps outlined here for producing the mining model we’ll utilize in the ASP.NET code sample later in this article: 1) Create a new Analysis Services project named MovieRecommendations in Business Intelligence Development Studio. 2) Add a new DataSource pointing to MovieData.mdb, the Access database included with the sample project (available for download; see end of article for details). 3) Add a DataSource View based on the DataSource. Select the only table, CustomerMovies, present in the DataSource. 4) Right-click on the Mining Models collection and select New Mining Model to launch the Mining Model W izard. 5) Pick From existing relational database or data warehouse on the Select Definition Method page and click Next. 6) Pick Microsoft Association Rules as the data mining technique to use on the next page. 7) Select the DataSource View created in step 3.

http://www.aspnetpro.com/newsletterarticle/2004/10/asp200410ri_l/asp200410ri_l.asp

09/03/2005 17:34:27

SQL Server 2005 Data Mining

Page 4

8) On the next page, the single CustomerMovies table present in our DataSource View is shown. Mark it as both Case and Nested. 9) Click Next to go to the Training Data page where we need to specify the columns we are interested in including from each selected table (see Figure 3). In our scenario, the CustomerMovies table serves as the source for our cases as well as the nested table data associated with each case (this is why it is listed twice). From the first instance pick CustomerId as the key since it uniquely identifies each customer. From the second instance pick Movie as Key, Input, and Predictable. 10) Click Finish to complete the wizard and build the Movie Recommendations mining structure and model (see Figure 4).

Figure 3: Selecting columns we are interested in modeling for cross-sell in the Data Mining Wizard.

Figure 4: The cross-sell mining model in Business Intelligence Development Studio.

The above steps create a definition of our cross-sell mining model and associated objects in the development environment. There are two steps to deploy the mining model to the Analysis Server and train it: 1) Right-click on the MovieRecommendations project in the Solution Explorer and select Properties. Verify that the Server property in the Deployment section of Configuration Properties points to the server hosting your Analysis Services instance. Close the dialog box. 2) Right-click again on the MovieRecommendations project in the Solution Explorer and select Deploy. This sends the client-side definitions to the server and initiates training of the mining model. W e must set up access permissions in Analysis Server for IIS using the SQL management tool for our ASP.NET application to use the trained mining model: 1) Open SQL Server Management Studio.

http://www.aspnetpro.com/newsletterarticle/2004/10/asp200410ri_l/asp200410ri_l.asp

09/03/2005 17:34:27

SQL Server 2005 Data Mining

Page 5

2) Click Connect in the Object Explorer, select Analysis Server, and connect to the Analysis Server hosting your model. 3) Locate the MovieRecommendations database in the Databases collection, right click on Roles, and select New Role. This brings up the Create Role dialog box. 4) On the General page enter Internet_User as the role name. Check the Read Definition checkbox under Set the database permissions for this role. 5) Click Membership in the left pane and select a page. Add your IIS user (the default is IUSR_machinename) to the role by clicking Add. 6) Now click Mining Structure in the left pane. This shows the Customer Movies mining structure and the Customer Movies mining model owned by the mining structure. Drop down the permission list under Access and select Read for both objects. Also check the Read Definition checkbox for both. 7) Click OK to add the Role with the above permission set.

Recommending Products Based on the User’s Shopping Basket Now we’re ready to produce movie recommendations in our W eb application by running a SQL-like query against the Analysis Server that holds our trained mining model. W e’ve put together a minimal application (shown in Figure 5) that demonstrates the ideas behind a real deployment, focusing on the generation of the prediction query for getting recommendations. The W eb customer is assumed to have one or more items in the shopping basket, and for simplicity we have a text box where items can be entered manually (separated by semicolons). Clicking Add Items to Cart displays the items in the shopping basket and also shows a list of recommendations.

Figure 5: A simple shopping basket application.

The code behind the button click is shown in Figure 6; you can see that it’s quite simple. Private Sub Button1_Click(ByVal sender As Object, _ ByVal e As System.EventArgs) ' Handles Me.Button1.Click ' Parse the input into an ArrayList of strings. Dim alInputItems As New ArrayList() Dim splitchar As Char() = {";"c} Dim szInputItems As String() = Me.TextBox1.Text.Split(splitchar, 20) Dim i As Integer For i = 0 To szInputItems.Length - 1 alInputItems.Add(szInputItems(i).Trim()) Next i ' Add items to the shopping basket. dgShoppingBasket.DataSource = alInputItems dgShoppingBasket.DataBind() ' Get top 5 recommendations. Dim alRecommendedItems As New ArrayList(5)

http://www.aspnetpro.com/newsletterarticle/2004/10/asp200410ri_l/asp200410ri_l.asp

09/03/2005 17:34:27

SQL Server 2005 Data Mining

Page 6

GetRecommendations(alInputItems, alRecommendedItems) ' Display recommendations. dgRecommendations.DataSource = alRecommendedItems dgRecommendations.DataBind() End Sub 'Button1_Click Figure 6: Populate shopping basket and recommendations.

The method builds an ArrayList of strings from the items in the shopping basket and passes it to the GetRecommendations subroutine, requesting the top five recommendations based on the input items. W e use two DataGrid objects: dgShoppingBasket, to hold the items in the user’s shopping basket; and dgRecommendations, to display the generated recommendations. The real workhorse is the GetRecommendations subroutine. The core of the GetRecommendations subroutine is the construction of the prediction join query (see Figure 7) that gets sent to Analysis Server and returns the list of five recommendations. SELECT FLATTENED TopCount(Predict([Customer Movies], INCLUDE_STATISTICS), $AdjustedProbability, 5) FROM [Movie Recommendations] NATURAL PREDICTION JOIN ( SELECT ( SELECT 'Star Wars' AS [Movie] UNION SELECT 'The Matrix' AS [Movie] ) AS [Customer Movies] ) AS t Figure 7: Obtain recommendations using DMX prediction join against mining model.

The SQL-like query language supported by Analysis Server for querying mining models is called DMX. The DMX query in Figure 7 performs a prediction join that takes the two movies from the user’s shopping basket, forms a case, and joins it with the mining model to produce an output rowset containing a list of predicted recommendations. The GetRecommendations subroutine, shown in Figure 8, stores most of the query in compile-time string templates. W hat needs to be filled in is the input data. Private Shared Sub GetRecommendations( _ ByVal vInputItems As ArrayList, _ ByRef vRecommendedItems As ArrayList) ' Templates for generating DMX prediction join statement. Dim strDMX1 As String = _ "SELECT FLATTENED TopCount(" + _ "Predict([Customer Movies], INCLUDE_STATISTICS)," + _ "$AdjustedProbability, 5) From [Movie Recommendations] " + _ "NATURAL PREDICTION JOIN (SELECT (" Dim strDMX2 As String = ") AS [Customer Movies]) AS t" ' Iterate shopping basket and produce input case. Dim cItems As Integer = vInputItems.Count Dim strDMX As String = "" Dim i As Integer For i = 0 To cItems - 1 Dim item As String = vInputItems(i).ToString() item = item.Replace("'", "''") strDMX += "SELECT " + "'" + item + "' AS "+ "[Movie]" If i < cItems - 1 Then strDMX += " UNION " End If Next i

http://www.aspnetpro.com/newsletterarticle/2004/10/asp200410ri_l/asp200410ri_l.asp

09/03/2005 17:34:27

SQL Server 2005 Data Mining

Page 7

' Put together DMX prediction query to ' get 5 recommendations. strDMX = strDMX1 + strDMX + strDMX2 ' Connect to Analysis Server and execute query. Dim asSession As New AnalysisServerSession asSession.Connect() If False = asSession.ExecuteAndFetchResult(strDMX) Then Return End If ' Read prediction results and build list ' of recommendations. vRecommendedItems.Clear() While asSession.asDataReader.Read() Dim type As String = asSession.asDataReader.GetDataTypeName(0) If type = "DBTYPE_WVARCHAR" Or type = "String" Then Try Dim val As String = asSession.asDataReader.GetString(0) vRecommendedItems.Add(val) Catch e As Exception Console.WriteLine(e.Message) End Try End If End While ' Disconnect from Analysis Server. asSession.DisConnect() End Sub ' GetRecommendations. Figure 8: Convert shopping basket to input case for mining model prediction and get recommendations.

As explained briefly during the mining model design phase, data is passed to mining models in the form of cases. Each case represents all the information we have about a single customer. W hen we make a prediction for a new customer, the known information about the customer is passed as an input case to the mining model. In this scenario the input case is the list of movies in the customer’s shopping basket. Note that the list of movies was represented as a nested table in the mining model definition; therefore, we need to supply the input case in the same form. This is accomplished by the nested sub-select after the NATURAL PREDICTION JOIN clause in Figure 7. The UNION operation is used to add each movie as a separate row in the nested table. W hen we supply cases inline like this, we don’t need to provide a value for the case key, CustomerId, since that is not used for the prediction operation. W ith the above input case data format in mind it’s now easy to walk through the GetRecommendations code and understand what it’s doing. W e start the query generation process by declaring templates for the DMX statement. W e then iterate over the input items, the contents of the user’s shopping basket, to produce an input case with a nested table format as described above. The input case is appended to the rest of the template to produce the query that will be sent to the server. Next, an instance of the AnalysisServerSession class is instantiated to connect to Analysis Server and execute the query against the trained mining model. The code for this class is included in the download. The AnalysisServerSession class uses ADOMD.NET, the managed provider for Analysis Server, to connect to the server. ADOMD.NET implements the same connection, command, and data reader interfaces as standard ADO.NET (System.Data.Oledb), so the rest of the code

http://www.aspnetpro.com/newsletterarticle/2004/10/asp200410ri_l/asp200410ri_l.asp

09/03/2005 17:34:27

SQL Server 2005 Data Mining

Page 8

looks exactly like ADO.NET code — but offers better performance. (To use ADOMD.NET you’ll need to add a reference to Microsoft.AnalysisServices.AdomdClient.dll, installed by the SQL Server 2005 Beta 2 Client Components Setup. If it is not listed under .NET in the Add Reference dialog box in Visual Studio, browse to Program Files\Microsoft.NET\ADOMD.NET\90 and add it from that location.) The query results are fetched, again using a standard data reader interface exposed by ADOMD.NET and the array list of output items populated with the recommendations returned by the prediction join. Note that the DMX query we generate uses the FLATTENED keyword to avoid having to perform hierarchical result navigation to fetch the results.

Under the Hood In this section we explain what happens on the server when it receives a prediction query and how the knowledge acquired by the mining model may be explored further. During the training process, the Association Rules model learns a set of rules that are used to generate product recommendations. If a rule such as “[Camera, Film] -> Batteries” was discovered and the customer’s shopping basket contains Camera and Film, then this rule fires. Of course, there may be other rules that predict Batteries as well, in which case the rule with the highest score is used. The score assigned to a rule, also known as its Importance, takes into account and compensates for the fact that the probability for a rule may be high just because the target item is popular in the dataset. Assume that 10% of customers buy Star Wars irrespective of other purchases, and Blade Runner is bought by just 3%. If two rules predict Star Wars and Blade Runner with the same probability, the score for Star Wars will be lower since it’s so popular. However, the advanced user can fine tune the score using algorithm parameters. Finally, the top n highest scoring rules are used to generate the recommendations. The last parameter in the TopCount function (again, see Figure 7) sets an upper limit on the number of items returned by the prediction algorithm. The rules are organized by the items predicted and sorted in descending order based on the score. The prediction algorithm can avoid looking at very large sets of rules to achieve good prediction performance. Additional insight into a customer’s purchasing behavior may be gained by using the viewers supplied as part of the Business Intelligence Development Studio. These include the Association Rules Viewer (shown in Figure 9) for browsing the rules and the Dependency Net Viewer that graphically shows the relationship between items. The graph layout algorithm illustrates how the strongest correlated products are clustered.

http://www.aspnetpro.com/newsletterarticle/2004/10/asp200410ri_l/asp200410ri_l.asp

09/03/2005 17:34:27

SQL Server 2005 Data Mining

Page 9

Figure 9: Viewing model content.

The Association Rules viewer displays the importance for each rule and ranks the rules by this score. As explained earlier, importance serves as a better measure than probability for finding interesting rules in your data.

Conclusion This article merely scratches the surface of the possibilities that SQL Server 2005 Data Mining presents for your ASP.NET applications. One idea for extending this sample would be to add other customer attributes such as demographics as inputs to potentially improve the quality of the recommendations. Other possible W eb applications include: Targeted ads based on analysis of browsing behavior using the Sequence_Clustering algorithm. An online Help desk that finds the most appropriate answer for a user’s query by using the Neural_Network algorithm in conjunction with text mining technologies available in SQL Server 2005 Data Transformation Services.

The data mining tutorial for SQL Server 2005 Analysis Services is available on the Readiness Kit CD included with the Beta 2 package. Visit the Microsoft beta newsgroup at microsoft.beta.yukon.analysisservices.datamining or the Data Mining forum at http://www.sqljunkies.com/Forums/ ShowForum.aspx?ForumID=38 if you have questions about SQL Server 2005 Data Mining. This article offers an outline of the simplified mining process used for this specific scenario. For a generalized version of this methodology, refer to CRISP-DM at http://www.crispdm.org. The sample code in this article is available for download . Raman Iyer is a Software Design Engineer at Microsoft Corp. and a founding member of the SQL Server Data Mining development team there. He can be reached at mailto:[email protected]. Jesper Lind is a Research Software Design Engineer at Microsoft Research and a member of the Machine Learning and Statistics team. He can be reached at mailto:[email protected] .

http://www.aspnetpro.com/newsletterarticle/2004/10/asp200410ri_l/asp200410ri_l.asp

09/03/2005 17:34:27

SQL Server 2005 Data Mining

Page 10

Informant Communications Group, Inc. 5105 Florin Perkins Road Sacramento, CA 95826 Phone: (916) 379-0609 • Fax: (916) 379-0610 Copyright © 2005 Informant Communications Group. All Rights Reserved. • Site Use Agreement • Send feedback to the Webmaster • Important information about privacy

http://www.aspnetpro.com/newsletterarticle/2004/10/asp200410ri_l/asp200410ri_l.asp

09/03/2005 17:34:27