cellio: (avatar)
Monica ([personal profile] cellio) wrote2014-10-20 11:17 pm

database design question

Dear LJ Brain Trust,

I'm too new to this to know if this is a "DB 101" question or more interesting, but Googling isn't getting me anywhere and I'm not getting a clear understanding from casual interrogation of coworkers, so I'm trying y'all.

Let's say I have a dataset that has half a dozen(ish) types of data -- posts, and users, and comments, and a few others. Logically, each of these types is a table, and I can join as necessary across them (to, say, find all posts by a given user). That's fine.

Now suppose I have, oh, 250 such datasets (same schemas, different data), and I'm going to load them all into a single database because I want to be able to do queries across datasets. There are two ways I might approach this:

1. Make 250 sets of tables. Each table is named for its dataset (e.g. Set1.Posts, Set1.Users, ... Set250.Posts, Set250.Users). If I want to do a query across all the users, everywhere, I, um... can I do the moral equivalent of "JOIN *.Posts"? How would that work?

2. Make big tables. I have one Posts table, and one Users table, and so on; at data load I cons up values for a new "dataset" column to say where each row came from. If I only want to look at one site I use a WHERE on that column to restrict the SELECT; if I want everything, it's all there for the taking.

Let me dispense with one possible concern: these datasets are updated from time to time, so at times I'll want to refresh the data from a dataset because an update came out. (An update is a new, complete copy, not a delta.) My database supports data partitioning, so deleting just the rows from that dataset before loading in a fresh copy is not a problem. (In option 1, you just drop the table.)

It seems to me that there is everything to gain and (probably?) nothing to lose by doing #2, if my database can handle that many rows. Because my database is in the "big-data analytics" space, I believe it can handle that. I mean, this database can operate on petabytes of data; my li'l pile of datasets will be well under a terabyte. (In fact, #1 didn't even occur to me, until I mentioned this project in passing and somebody suggested it.)

Is that reasonable, or am I missing something big and obvious? What factors should I be considering that I haven't mentioned? And if I'm wrong and I should be doing #1, how do I write a JOIN that crosses all the datasets?

(All the Stack Exchange data, in case you're wondering.)

[identity profile] eub.livejournal.com 2014-10-21 07:43 am (UTC)(link)
It seems to me that there is everything to gain and (probably?) nothing to lose by doing #2, if my database can handle that many rows.

Exactly. In terms of meanings, #2 is strictly more powerful; any query against a list of #1's tables can be mechanically transformed to a WHERE clause for querying #2, but not the other way around. Or #2 can do a "GROUP BY instance" to compute over each of the subsites and, say, order them by trollitude.

Sometimes performance trumps cleanliness, but it doesn't sound like the person suggesting #1 is so much an expert in your database's performance characteristics. If your system needs you to manually ask for an index or something to make #2 fast, you'll learn. :)

[identity profile] eub.livejournal.com 2014-10-22 06:39 am (UTC)(link)
It can be inefficient (scan the whole table to find one section), if execution goes wrong, but this is good practice in telling your db not to be stupid. :)