Tuesday, February 14, 2012

Design Discussion - How would YOU do this?

I'm working with data collection software and our current SQL solution is
simply not performing... well. I'm not a DB Admin, but I'm hoping to inject
useful suggestions into our next design meeting - hopefully improving things
across the board.
The situation is this:
I need to collect data about entities participating in a scenario. There can
be any number of participants, but a typical run might include 500-1000
unique entities. Each entity is streaming updates (XML) to a central
collector that is then moving the data into SQL server. While the actual
number may vary, we can safely assume that there will be at least one update
every second for each entity involved in the scenario.
The present solution is to examine the data as it is coming to the collector
and if it is new, adding it to a master table to maintain a 'current picture
'
of every entity in the scenario. If the entity is already in the master, we
simply update the record in the master table. Regardless of the action taken
,
all messages go into a history table that basically represents the entire
scenario.
The problems (I've not experienced them yet) described to me is that after
the tables become lare (again, this has yet to be defined) inserts are going
very slow and appear to be related to the size of the tables involved.
Furthermore, a decision was made to remove *ALL* index from the tables to
increase insert speed. This is (obviously) bad for a client application to
look at this data because queries are now resorting to table scans on large
tables.
What should I look to do to increase performance? Obviously it is important
that we capture all of the data and that it be inserted as quickly as
possible, but on the other hand if it takes hours to execute queries then
what good is it to have the data in the first place?
Has anyone seen case studies regarding a similiar scenario?
I've briefly reviewed documentation about partitioned views and such after
previously posting a simliar question, however I'm not sure how well that
would work as I can't know prior to the start of a scenario any data to
logically separate the tables.
Ideas? I'd love to hear them. As a developer for the client portion of this
software, I'm tired of suffering through table scans that can take an
inordinate amount of time to complete. (Indexing would help, but I get shot
down becase it makes inserts slow - or so the admin is telling me.)
Thanks for reading.
ChrisWhile you're busy preparing DDL and sample data, I'd just like to say that
removing indexes is no solution.
Based on what indexes you need to efficiently access your data you should
consider the fact how they are populated to allow for fast inserts (updates)
.
Basically what you need to do is - in your case where changes are propagated
to your tables at a high frequency - allow your index pages to be populated
with as little of repagination as possible.
See PAD_INDEX and FILLFACTOR under CREATE INDEX in Books Online.
Another issue is the issue of the clustered index. IMHO in a database where
inserts are frequent and it is of great importance that insert/update
performance is at its highest - clustered indexes should be created on
columns where any newly inserted value is larger than any previous value, so
that there is no need to repaginate the clustered index pages. I find the
identity column a wonderful candidate for the clustered index when based on
a
single column. If your clustered index (for reasons not known to me) needs t
o
be based on multiple columns try using columns, where at least half of those
have incremental values.
You haven't mentioned whether removing indexes actually resulted in an
increase of insert/update performance. Maybe you simply need new hardware :)
I guess we could say more after we see some DDL.
ML|||Disk space is cheap. Add a BIGINT IDENTITY column on the history table and
put a unique clustered index with a 100% fill factor on it. There is no
need as far as performance goes to partition this table, because inserts
will perform nearly the same if you have a million rows or if you have a
billion rows. Add a second table to keep track of the last history row
processed. Add a third table for reporting that has indexes to improve
query performance. Finally add a stored procedure that periodically
processes the new rows in the history table to populate the reporting
table.
Inserts into the history table with the clustered index will be nearly as
fast as inserts without any indexes because rows are only added to the end
of the table and the B-tree index maintenance is minimal. A new index page
is added when all index pages at the same level are full (which only occurs
every 400 or so rows) An index page is only updated when the last leaf page
fills up. Note that there isn't any page splitting going on if there is
only a clustered index on an IDENTITY column. The index grows up because
both leaf pages and index pages are always appended. There will be a few
more writes because of the BIGINT column and because of the occasional
addition of a new index page. If this is a problem, you could add a new
filegroup consisting of a separate set of mirrored disks just to store the
history table.
There will be a delay between the time that a row is inserted in the history
table and the time the data is available for reporting, but the lack of
indexes on the master table and the consequent table scans will almost
certainly cause query results to be delayed anyway. It takes on average 1.5
seconds on my development box (Dual 3GHz XEON, 4GB RAM, RAID-1) to scan a
typical table with 175,000 rows. Your table will have between 1.8 and 3.6
million rows inserted or updated every hour, so scans will definitely take a
lot more time. In any case, updates will perform abysmally without indexes
as the table grows because a scan is required to find the row to update.
You can make the periodic stored procedure populate the reporting table
using set-based operations to maximize performance. Set based operations
perform much better than row based operations because triggers are only
fired once, writes to the log are minimized, and index maintenance is
optimized.
I would hazard a guess that the delay incurred by caching the inserts and
updates will be far less than the delay caused by the table scans.
"Chris" <Chris@.discussions.microsoft.com> wrote in message
news:55221D59-F4D9-4BD8-8FE3-41C64128DD45@.microsoft.com...
> I'm working with data collection software and our current SQL solution is
> simply not performing... well. I'm not a DB Admin, but I'm hoping to
inject
> useful suggestions into our next design meeting - hopefully improving
things
> across the board.
> The situation is this:
> I need to collect data about entities participating in a scenario. There
can
> be any number of participants, but a typical run might include 500-1000
> unique entities. Each entity is streaming updates (XML) to a central
> collector that is then moving the data into SQL server. While the actual
> number may vary, we can safely assume that there will be at least one
update
> every second for each entity involved in the scenario.
> The present solution is to examine the data as it is coming to the
collector
> and if it is new, adding it to a master table to maintain a 'current
picture'
> of every entity in the scenario. If the entity is already in the master,
we
> simply update the record in the master table. Regardless of the action
taken,
> all messages go into a history table that basically represents the entire
> scenario.
> The problems (I've not experienced them yet) described to me is that after
> the tables become lare (again, this has yet to be defined) inserts are
going
> very slow and appear to be related to the size of the tables involved.
> Furthermore, a decision was made to remove *ALL* index from the tables to
> increase insert speed. This is (obviously) bad for a client application to
> look at this data because queries are now resorting to table scans on
large
> tables.
> What should I look to do to increase performance? Obviously it is
important
> that we capture all of the data and that it be inserted as quickly as
> possible, but on the other hand if it takes hours to execute queries then
> what good is it to have the data in the first place?
> Has anyone seen case studies regarding a similiar scenario?
> I've briefly reviewed documentation about partitioned views and such after
> previously posting a simliar question, however I'm not sure how well that
> would work as I can't know prior to the start of a scenario any data to
> logically separate the tables.
> Ideas? I'd love to hear them. As a developer for the client portion of
this
> software, I'm tired of suffering through table scans that can take an
> inordinate amount of time to complete. (Indexing would help, but I get
shot
> down becase it makes inserts slow - or so the admin is telling me.)
> Thanks for reading.
> Chris
>|||You might coniser a tool designed for this kind of problem. Kx and
StreamBase are DBMS designed for capturing streaming data and doing
work on it while the data is streaming.|||Hi all
What will be the performance if logshipping is used instead.The Destination
server will be used for reporting purpose. Though log shipping is generally
used in lieu of back up, I feel we can reduce the latency and boost the
performance as the read/writes are done on different machines. Of course
adding another machine is a problem, but we need to do that for back up any
way.
any thoughts on this
Regards
R.D
"--CELKO--" wrote:

> You might coniser a tool designed for this kind of problem. Kx and
> StreamBase are DBMS designed for capturing streaming data and doing
> work on it while the data is streaming.
>|||Apologies for not responding earlier, I've been out of the net for a w or
so.
Thank you for the information - it really is confirming what I had thought
to be the case with how SQL server works. I could not understand why it woul
d
make a difference how many rows are in a table as everything I have read
indicated that it would not matter.
Next w will be interesting as I push back and try to get some of these
things implemented. In the meantime, if anyone has additional input to the
scenario, I'd love to hear from you.
Chris
"Brian Selzer" wrote:

> Disk space is cheap. Add a BIGINT IDENTITY column on the history table an
d
> put a unique clustered index with a 100% fill factor on it. There is no
> need as far as performance goes to partition this table, because inserts
> will perform nearly the same if you have a million rows or if you have a
> billion rows. Add a second table to keep track of the last history row
> processed. Add a third table for reporting that has indexes to improve
> query performance. Finally add a stored procedure that periodically
> processes the new rows in the history table to populate the reporting
> table.
> Inserts into the history table with the clustered index will be nearly as
> fast as inserts without any indexes because rows are only added to the end
> of the table and the B-tree index maintenance is minimal. A new index pag
e
> is added when all index pages at the same level are full (which only occur
s
> every 400 or so rows) An index page is only updated when the last leaf pa
ge
> fills up. Note that there isn't any page splitting going on if there is
> only a clustered index on an IDENTITY column. The index grows up because
> both leaf pages and index pages are always appended. There will be a few
> more writes because of the BIGINT column and because of the occasional
> addition of a new index page. If this is a problem, you could add a new
> filegroup consisting of a separate set of mirrored disks just to store the
> history table.
> There will be a delay between the time that a row is inserted in the histo
ry
> table and the time the data is available for reporting, but the lack of
> indexes on the master table and the consequent table scans will almost
> certainly cause query results to be delayed anyway. It takes on average 1
.5
> seconds on my development box (Dual 3GHz XEON, 4GB RAM, RAID-1) to scan a
> typical table with 175,000 rows. Your table will have between 1.8 and 3.6
> million rows inserted or updated every hour, so scans will definitely take
a
> lot more time. In any case, updates will perform abysmally without index
es
> as the table grows because a scan is required to find the row to update.
> You can make the periodic stored procedure populate the reporting table
> using set-based operations to maximize performance. Set based operations
> perform much better than row based operations because triggers are only
> fired once, writes to the log are minimized, and index maintenance is
> optimized.
> I would hazard a guess that the delay incurred by caching the inserts and
> updates will be far less than the delay caused by the table scans.
> "Chris" <Chris@.discussions.microsoft.com> wrote in message
> news:55221D59-F4D9-4BD8-8FE3-41C64128DD45@.microsoft.com...
> inject
> things
> can
> update
> collector
> picture'
> we
> taken,
> going
> large
> important
> this
> shot
>
>

No comments:

Post a Comment