Sunday, February 19, 2012

design question - so many columns....

Got a situation where our main, core data in our main tables consist
of 15-20 normal columns (dates, integers, varchar, etc.) and then
several "sets" of booleans. Example - health risk factors....high
blood pressure, diabetes, depression....up to 20 risk factors, let's
say. The user can choose none, all, or any combo in between in these
sets of booleans. Well each of those fundamentally is just a bit
column, with a zero or one, attached to the record concerning the
individual person/event record. Well what if I have 8 or 9 "sets" of
these boolean questions? This results in 150-200+ columns in my
table. I know sql can handle up to 1024, and these data really belong
on this record with this individual person/event.
I just wanted to see what others thought from the design perspective.
Any suggestions?
On Jun 15, 1:05 am, CoreyB <unc27...@.yahoo.com> wrote:
> Got a situation where our main, core data in our main tables consist
> of 15-20 normal columns (dates, integers, varchar, etc.) and then
> several "sets" of booleans. Example - health risk factors....high
> blood pressure, diabetes, depression....up to 20 risk factors, let's
> say. The user can choose none, all, or any combo in between in these
> sets of booleans. Well each of those fundamentally is just a bit
> column, with a zero or one, attached to the record concerning the
> individual person/event record. Well what if I have 8 or 9 "sets" of
> these boolean questions? This results in 150-200+ columns in my
> table. I know sql can handle up to 1024, and these data really belong
> on this record with this individual person/event.
> I just wanted to see what others thought from the design perspective.
> Any suggestions?
That's quite easy. Let's say you have one column called 'Risk Factor'
that can take 20 possible values. You can store values like
'Diabetes,High Blood Pressure,Depression'. This is one solution and I
would prefer this. Another solution is you can have a varchar(20) with
values like 10001000 etc where you set a bit for particular risk
factor. You have to know the ordinal position for a particular risk
and if that bit is set or not.
|||On Jun 15, 10:09 am, SB <othell...@.yahoo.com> wrote:
> On Jun 15, 1:05 am, CoreyB <unc27...@.yahoo.com> wrote:
>
> That's quite easy. Let's say you have one column called 'Risk Factor'
> that can take 20 possible values. You can store values like
> 'Diabetes,High Blood Pressure,Depression'. This is one solution and I
> would prefer this. Another solution is you can have a varchar(20) with
> values like 10001000 etc where you set a bit for particular risk
> factor. You have to know the ordinal position for a particular risk
> and if that bit is set or not.
and of course if you want to be relational then you have a row for
each 'Risk Factor'. So instead of growing the table horizontally you
grow them vertically. So if a person has 'Diabetes' and 'High Blood
Pressure' there are 2 rows for that person. Then you can group by a
particular 'Risk Factor'. For example, find all patients where 'Risk
Factor' is 'Diabetes' and sum how much they spent etc on medication
etc.
|||I've seen that approach used in some other places in reading online,
mainly with surveys & questionarres. But some issues I have with it
are....
1 - The relational form of a tall skinny table, with each answer as a
row seems good if you may not have data for every question/element, so
you save space on the questions that aren't answered. We will likely
have data for each element. Which means we'll have billions of
rows.....per year. The size will start to become an issue after
several years.
2 - Storage....At first glance the wide tables seem like they'll be
larger. But a lot of the columns are bits, which are optimized for
storage, so if I have a wide table with 32 bit columns, they'll only
take up 4 bytes of storage. But if I store the same answers (1 or 0)
in another table as rows in a generic catch-all varchar column, and
then have two IDs of int or bigint tying them back to the question &
respondent, that's already 8 or 9 bytes per row minimum * 32 answers =
250+ bytes just for the one respondent. And there's way more than 32
- probably 100 or so.
3 - Data integrity. If everything has its own column in a wide table,
then you can make sure that a bit is a bit, and a date is a datetime,
and an integer is an int. But if everything is put in one generic
column, then you give up a little bit of ground on the data integrity,
and then I'm depending on the ETL process, or the developer to
validate all data types.
As crappy as it sounds, the wide table looks better to me in this
situation. Unless someone out here can talk me out of it.
On Jun 15, 12:29 am, SB <othell...@.yahoo.com> wrote:
> On Jun 15, 10:09 am, SB <othell...@.yahoo.com> wrote:
>
>
>
>
> and of course if you want to be relational then you have a row for
> each 'Risk Factor'. So instead of growing the table horizontally you
> grow them vertically. So if a person has 'Diabetes' and 'High Blood
> Pressure' there are 2 rows for that person. Then you can group by a
> particular 'Risk Factor'. For example, find all patients where 'Risk
> Factor' is 'Diabetes' and sum how much they spent etc on medication
> etc.- Hide quoted text -
> - Show quoted text -
|||SB wrote:
> On Jun 15, 1:05 am, CoreyB <unc27...@.yahoo.com> wrote:
> That's quite easy. Let's say you have one column called 'Risk Factor'
> that can take 20 possible values. You can store values like
> 'Diabetes,High Blood Pressure,Depression'. This is one solution and I
> would prefer this. Another solution is you can have a varchar(20) with
> values like 10001000 etc where you set a bit for particular risk
> factor. You have to know the ordinal position for a particular risk
> and if that bit is set or not.
>
Well, bit fields create a searching an filtering nightmares (ask me how
I know :-).)
I don't believe there is a bullet-proof solution.
I would consider even a de-normalized 1-to-1 relationship architecture
option.
For example, having a Patient table (PatientID + personal data), then
RiskFactors table (PatientID + 20+ risk factor bit fields), etc. Since
in most cases you search either for a singe patient or for group of
patients that meet certain criteria, massive multi-table joins will not
be required too often.
|||On Jun 15, 6:41 pm, CoreyB <unc27...@.yahoo.com> wrote:
> I've seen that approach used in some other places in reading online,
> mainly with surveys & questionarres. But some issues I have with it
> are....
> 1 - The relational form of a tall skinny table, with each answer as a
> row seems good if you may not have data for every question/element, so
> you save space on the questions that aren't answered. We will likely
> have data for each element. Which means we'll have billions of
> rows.....per year. The size will start to become an issue after
> several years.
> 2 - Storage....At first glance the wide tables seem like they'll be
> larger. But a lot of the columns are bits, which are optimized for
> storage, so if I have a wide table with 32 bit columns, they'll only
> take up 4 bytes of storage. But if I store the same answers (1 or 0)
> in another table as rows in a generic catch-all varchar column, and
> then have two IDs of int or bigint tying them back to the question &
> respondent, that's already 8 or 9 bytes per row minimum * 32 answers =
> 250+ bytes just for the one respondent. And there's way more than 32
> - probably 100 or so.
> 3 - Data integrity. If everything has its own column in a wide table,
> then you can make sure that a bit is a bit, and a date is a datetime,
> and an integer is an int. But if everything is put in one generic
> column, then you give up a little bit of ground on the data integrity,
> and then I'm depending on the ETL process, or the developer to
> validate all data types.
> As crappy as it sounds, the wide table looks better to me in this
> situation. Unless someone out here can talk me out of it.
> On Jun 15, 12:29 am, SB <othell...@.yahoo.com> wrote:
>
>
>
>
>
> - Show quoted text -
I think the first solution should work where you store the values as
coma (or any other delimiter) separated values. For example for
certain risk factor, for a customer values can be: diabetes,high blood
pressure.

No comments:

Post a Comment