Entry tags:
database design question
Dear LJ Brain Trust,
I'm too new to this to know if this is a "DB 101" question or more interesting, but Googling isn't getting me anywhere and I'm not getting a clear understanding from casual interrogation of coworkers, so I'm trying y'all.
Let's say I have a dataset that has half a dozen(ish) types of data -- posts, and users, and comments, and a few others. Logically, each of these types is a table, and I can join as necessary across them (to, say, find all posts by a given user). That's fine.
Now suppose I have, oh, 250 such datasets (same schemas, different data), and I'm going to load them all into a single database because I want to be able to do queries across datasets. There are two ways I might approach this:
1. Make 250 sets of tables. Each table is named for its dataset (e.g. Set1.Posts, Set1.Users, ... Set250.Posts, Set250.Users). If I want to do a query across all the users, everywhere, I, um... can I do the moral equivalent of "JOIN *.Posts"? How would that work?
2. Make big tables. I have one Posts table, and one Users table, and so on; at data load I cons up values for a new "dataset" column to say where each row came from. If I only want to look at one site I use a WHERE on that column to restrict the SELECT; if I want everything, it's all there for the taking.
Let me dispense with one possible concern: these datasets are updated from time to time, so at times I'll want to refresh the data from a dataset because an update came out. (An update is a new, complete copy, not a delta.) My database supports data partitioning, so deleting just the rows from that dataset before loading in a fresh copy is not a problem. (In option 1, you just drop the table.)
It seems to me that there is everything to gain and (probably?) nothing to lose by doing #2, if my database can handle that many rows. Because my database is in the "big-data analytics" space, I believe it can handle that. I mean, this database can operate on petabytes of data; my li'l pile of datasets will be well under a terabyte. (In fact, #1 didn't even occur to me, until I mentioned this project in passing and somebody suggested it.)
Is that reasonable, or am I missing something big and obvious? What factors should I be considering that I haven't mentioned? And if I'm wrong and I should be doing #1, how do I write a JOIN that crosses all the datasets?
(All the Stack Exchange data, in case you're wondering.)
I'm too new to this to know if this is a "DB 101" question or more interesting, but Googling isn't getting me anywhere and I'm not getting a clear understanding from casual interrogation of coworkers, so I'm trying y'all.
Let's say I have a dataset that has half a dozen(ish) types of data -- posts, and users, and comments, and a few others. Logically, each of these types is a table, and I can join as necessary across them (to, say, find all posts by a given user). That's fine.
Now suppose I have, oh, 250 such datasets (same schemas, different data), and I'm going to load them all into a single database because I want to be able to do queries across datasets. There are two ways I might approach this:
1. Make 250 sets of tables. Each table is named for its dataset (e.g. Set1.Posts, Set1.Users, ... Set250.Posts, Set250.Users). If I want to do a query across all the users, everywhere, I, um... can I do the moral equivalent of "JOIN *.Posts"? How would that work?
2. Make big tables. I have one Posts table, and one Users table, and so on; at data load I cons up values for a new "dataset" column to say where each row came from. If I only want to look at one site I use a WHERE on that column to restrict the SELECT; if I want everything, it's all there for the taking.
Let me dispense with one possible concern: these datasets are updated from time to time, so at times I'll want to refresh the data from a dataset because an update came out. (An update is a new, complete copy, not a delta.) My database supports data partitioning, so deleting just the rows from that dataset before loading in a fresh copy is not a problem. (In option 1, you just drop the table.)
It seems to me that there is everything to gain and (probably?) nothing to lose by doing #2, if my database can handle that many rows. Because my database is in the "big-data analytics" space, I believe it can handle that. I mean, this database can operate on petabytes of data; my li'l pile of datasets will be well under a terabyte. (In fact, #1 didn't even occur to me, until I mentioned this project in passing and somebody suggested it.)
Is that reasonable, or am I missing something big and obvious? What factors should I be considering that I haven't mentioned? And if I'm wrong and I should be doing #1, how do I write a JOIN that crosses all the datasets?
(All the Stack Exchange data, in case you're wondering.)

no subject
Thanks for the tip about UNION. And yeah, having to enumerate it would defeat the purpose; #1 only seemed plausible to me if in fact there were wildcarding, but so far I've seen no evidence of that (but I am not anything like a SQL expert).
a MSWord mail-merge
*shudder* :-)
ETA: Oops, didn't notice you were editing when I posted that reply (thus cutting off further edits). Thanks for the tip about table inheritance; I haven't heard of that and will look to see if my DB supports it.
no subject
Oh, and if I were going to implement a parent table with 250 child tables, I'm pretty confident I would be automating the generation of something, possibly in some other language, possibly in pl/pgsql, possibly doing something creative with trigger functions. I'm pretty sure you can create new tables in trigger functions, so....
ETA: No worries! Now that I think of it, heck, in PostgreSQL, you could have a trigger function ON INSERT on the "users" parent table, so you could just attempt to INSERT what should be "users_0014" right into "users", and the trigger function looks at the "dataset" column, sees "0014", checks and sees there's no "users_0014" child table and makes it, and inserts the data there instead.
no subject
Yeah, agreed. Since I control where the data gets downloaded, I can control file/directory names and then script it. :-)
Thanks for the tips on parent tables and trigger functions! I don't know what all that means yet, but I now have some vocabulary with which to go looking.
no subject
no subject
no subject
What it does is create a (temporary?) table #sitestats, and then iteratively run the approximately same (dynamically generated) query over each table and stow the results in #sitestats. And then at the end, query #sitestats.
So basically, it's building #2 on the fly.
no subject
Thank you for translating that query for me. I assume the table is temporary; they let anybody (who cares to create an OpenID-based account) create and run queries against their DB, so I assume everything's sandboxed.