Hello,
We are storing largish (about 1 million rows) datasets which are imported
from CSV files.
We do no manipulation of the statistical content, so records are stored as
follows:
Year Gender Age Value
1998 Male 18 100
1998 Male 19 150
1998 Female 18 45
1998 Female 19 60
So there were 100 males aged 18 in this particular survey.
The gender and age are being stored as varchar fields, and the value as an
integer field.
The problem is that some surveys have invalid data - such as a n.a. (not
available), n.p. (not published), and n.y.a (not yeat available). So we may
need to store:
1997 Female 19 n.a
I need to reflect this invalid data to the user when he is completing
dynamic cross-table reports of the data Eg the user can filter just too 1998,
or to all years - if all years were selected, I need to return 1997 Female 19
as n.a.
Can anybody offer some advice to me for which is the best way to store this?
I considered using NULL's, but I had two issues; ideally I'd like the
aggregation routine to give me a NULL if a NULL is aggreated, however it
treats it as a 0; and there would be no easy way for me to differentiate
between the different types of invalid data.
Sorry about the long post.
Any advice would be greatly appreciated.
Mark
NULL is the obvious way to record the unknown amounts. The additional
"status" code for invalid data could go in another column. For example:
CREATE TABLE Surveys (yr INTEGER NOT NULL CHECK (yr BETWEEN 1990 AND 2100),
gender CHAR(1) NOT NULL CHECK (gender IN ('M','F')), age INTEGER NOT NULL
CHECK (age BETWEEN 0 AND 150), pop_count INTEGER NULL, status CHAR(3) NOT
NULL CHECK (status IN ('NA','NP','NYA','OK')), CHECK ((pop_count IS NOT NULL
AND status = 'OK') OR (pop_count IS NULL AND status <> 'OK')), PRIMARY KEY
(yr,gender,age))
(I used "Pop_count" just because "Value" isn't a very informative name for a
column. Value also a reserved word.)
I'm not sure what you mean when you say that the "aggregation routine"
treats NULLs as zero. The SUM aggregate in SQL will ignore NULLs and return
a total without them. If all values in an aggregation are NULL then the
result is NULL. If you want to return a NULL sum if *any* value in the
aggregation is NULL then you can use CASE like this:
SELECT CASE WHEN COUNT(pop_count)=COUNT(*) THEN SUM(pop_count) END
FROM Surveys
David Portas
SQL Server MVP
|||I use Null as well for numeric fields, for character fields I use either
null or NA whichever the users prefer.. About the only way you can aggregate
null values is to count them...
Wayne Snyder, MCDBA, SQL Server MVP
Mariner, Charlotte, NC
www.mariner-usa.com
(Please respond only to the newsgroups.)
I support the Professional Association of SQL Server (PASS) and it's
community of SQL Server professionals.
www.sqlpass.org
"mallen" <mallen@.discussions.microsoft.com> wrote in message
news:2D445832-520C-4AD1-AC0F-9176B63DF483@.microsoft.com...
> Hello,
> We are storing largish (about 1 million rows) datasets which are imported
> from CSV files.
> We do no manipulation of the statistical content, so records are stored as
> follows:
> Year Gender Age Value
> 1998 Male 18 100
> 1998 Male 19 150
> 1998 Female 18 45
> 1998 Female 19 60
> So there were 100 males aged 18 in this particular survey.
> The gender and age are being stored as varchar fields, and the value as an
> integer field.
> The problem is that some surveys have invalid data - such as a n.a. (not
> available), n.p. (not published), and n.y.a (not yeat available). So we
may
> need to store:
> 1997 Female 19 n.a
> I need to reflect this invalid data to the user when he is completing
> dynamic cross-table reports of the data Eg the user can filter just too
1998,
> or to all years - if all years were selected, I need to return 1997 Female
19
> as n.a.
> Can anybody offer some advice to me for which is the best way to store
this?
> I considered using NULL's, but I had two issues; ideally I'd like the
> aggregation routine to give me a NULL if a NULL is aggreated, however it
> treats it as a 0; and there would be no easy way for me to differentiate
> between the different types of invalid data.
> Sorry about the long post.
> Any advice would be greatly appreciated.
> Mark
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment