Friday, February 24, 2012

Design question from data mining newbie

Hi,

We currently have about 900 stored procedures which have logic to group healthcare claims into different 'Edit Groups' depending on the logic within each 'Edit' stored procedure.

Examples of the logic for the Edit stored procedures would be something like:

Edit1: Find all claims from same patient and same provider (matching on SubsriberID and ProviderID) which has a procedure code in (P1, P2, P3....P345) and a diagnosis code in (D1, D2,...D123) and does NOT have a modifer code in (M1, M2, M3)

Edit2: Find all claims from same patient and same provider (matching on SubsriberID and ProviderID) which has a procedure code in (P7, P8, P9....Pxxx) and a diagnosis code in (D1, D2,...Dyyy) and has a modifer code in (M3, M4, M7), which are dated within 120 days of each other.

Do you think one of the SQL Server 2005 Data mining algorithms (Clustering or Classification or Association Rules) could play some part in this? Most of the 900 stored procs can be grouped based on logic, I mean the logic is similar for each group and only the parameters (in brackets above) vary for each stored proc within the same group.

We're totally new to data mining, although we do have some moderately complex cubes running. Which algorithm (if any) would be the most appropriate for our needs?

Thanks for any help,

JGP

Hi,

From you examples, you data is of multimensional nature. It is definitely helpful to use Microsoft Analysis Services to create OLAP cube(s) to efficiently support browse and explore your edit groups. If you are only interested in querying your edit groups, you don't need data mining.

Data mining is good at generalizing knowledge(such as patterns) from data. It can then use learned knowldege to analyze your (new) data. For example, if you have a given set of claims described by the following table:

ClaimID Income HaveInurance City Fraud

1 x0,000 Yes FairyLand No

2 y0,000 No FraudLand Yes

.......

Suppose you can collect the above data from your claim database. Now let's say that you are interested in predict whether or not a new claim is fraud. You can train a model with one of Microsoft data mining algorithms (such as Microsoft Decision Trees). After training, you can use your model to predict new claims like this:

ClaimID Income HaveInurance City

10001 a0,000 Yes FairyLand

1002 b0,000 No FraudLand

.......

Depends on the query you use, you can get some result like this:

ClaimID Income HaveInurance City Predicted Result of Fraud

10001 a0,000 Yes FairyLand No

1002 b0,000 No FraudLand Yes

.......

In general, data mining can play a role when you need to learn patterns from your data, and then apply the patterns to analyze (new) data.

|||

Thanks for the example.

Are you saying that Data mining is relevant only in predicting stuff and not in finding relationships based on pre-defined rules, like what I initially explained?

I did see come cases where classification was done, where based on a bunch of parameters, historic claims can be classified into Edit1(cluster1) , Edit2(cluster2) etc.

Isn't this possible?

Thanks,

JGP

|||

My post is just using predicting as an example. Data mining is relavant to predicting as well as finding relation among data. For example, you can use clustering algorithm to cluster you data, and check whether there is a natural mapping between your edit groups and the cluster1 you found (as you mentioned above). This process can help you understand you data better. For example, if an Edit group can be naturally mapped to some cluster, this Edit group can be considered as well defined, since it really maps to some existing grouping (or cluster) of your data. On the other hand, you might consider merge a few groups if they belong to the same cluster, etc.

On the other hand, data mining works on the basis of probability. In other words, it can not be 100% correct most of the times. Say, for a problem with a set of rules, you can already classify each case (such as each claim) into each target group 100% correct. You don't want to use data mining, because you can not do any better than 100% correct, and data mining does not come free. But, if you need to find something unknown about your data, such as a claim is/isn't fraud or if it belongs to some unknown group, you should resort to data mining.

Good luck,

|||

Thanks, I think I'm getting the picture now.

Just to get some hands on, would you happen to know of any good tutorials\books for clustering\classification that is available for newbies?

Preferably one that works with 'well-defined' groups....

|||

A good book is Data Mining Techniques by Berry and Linhoff.

I think in your situation you can use data mining for data discovery to learn quite a bit about your data sets. It sounds like each of your "edits" are a fairly complicated set of rules and it may be difficult to determine which rules end up being more important, or which "edits" are related (if I am correct, a single record could have multiple "edits", no?)

With this is mind, you could "reverse engineer" the edits with a simple classification model -e.g. trees - to predict what factors are the "most important" in determining an edit. If all of your edits are "and" conditions, this won't do much, but if you have any with "or" conditions, you may find that a majority of records recieve an edit for only a few of the possible conditions. Using this to classify old records, as Yimin notes, is not going to be 100% accurate though, since you already have an encoding of the 100% accurate rules.

If you created a table that had for each record all of the record data and all of the possible "edits" for that record (you would likely need a nested table) you could predict "edits" based on the record data plus other "edits". This would end up in a relationship diagram describing how record data and edits are related. If you made such a model including only the edits, you would see how they are interrelated independent of record data. You could perform a similar operation using Clustering, to see if groups of edits cluster together.

In the end you could end up with a greater understanding of your data - potentially removing redundant code or streamlining in other ways. The good thing about data mining, it that it's painless - it's kind of fun to play with and you can get some good insights, but it doesn't cause any harm in the meantime....

Enjoy, and feel free to post any follow up questions.

-Jamie

No comments:

Post a Comment