Saving Wikipedia to SQLite
(1) By Andreas Kupries (andreas-kupries) on 2021-04-07 08:48:12 [link] [source]
Funny story about that....
In 2006, I acquired an iRex iLiad ebook reader, one of the first consumer products with an e-ink display. It ran Linux, and they released an SDK. It was an amazing piece of hardware, way ahead of its time.
During the 2006 holiday break, I was trying to figure out a project I could do with it, and struck upon the idea of writing an off-line Wikipedia app. This was inspired by some SF movie or TV show I had recently seen that showed someone using a tablet/ebook reader with different things (like dictionaries, encyclopedias, etc.) stored on rod-shaped data storage units, kind of like small pencils (anyone remember what show/movie that was from?).
The issue was that the iLiad only supported SD cards (like, actual, original SDs, not SDHD or SDXD), so was limited to 4GB. That meant figuring out a way to squeeze all of Wikipedia into 4GB, in a format that would be easily accessible.
Since Wikipedia makes MySQL dumps available, I downloaded the latest (at work) and wrote some scripts to extract the dumps and load them into SQLite. I did some cool things like used SQLite's type system to save articles either as text or a zlib-compressed BLOB, depending on which was smaller, and have automatic access functions to unpack the data. Starting with the formatted text, and only saving the current editing of actual articles, without discussions or meta-articles (and using a custom salt dictionary for zlib), I did manage to get it down to 4GB. Writing a front-end app for the data turned out to be way more trouble than it was worth, however, and never happened, but the database was pretty cool.
During this process, at some point I got really frustrated with the SQLite APIs, and the exact sequence they needed to be called, and the lack of good tutorials. At some moment, I actually shouted at the ceiling, "Someone should write a book about all this!" At in the most cartoon fashion, for a moment I swear an actual lightbulb appeared above my head.
You see, I'd been in contact with editors at O'Reilly for a few years at that point in my career, having done numerous tech-reviews and proof-reads. I had started a book in 2004 ("Distributed Computing with Mac OS X: Building Clusters and Grids"), but the book was canceled (along with about 70% of their in-progress titles) when O'Reilly went though a big business shift and the CFO took over the CEO position from Tim, and effectively saved the company (wonderful lady; we talked at a few events and she's absolutely amazing). It took the company a year or so to recover, but they were looking for other book ideas, and my editor was really pushing me to try my hand at writing something again. We were working on an idea or two about networking, but the idea wouldn't gel, and I didn't want to do it without laser focus, so we were trying to come up with another book topic.
And suddenly I had one. The world needed an(other) SQLite book. I wrote up a proposal, told my editor it was coming his way, got everything polished and ready to go for the editors' review meeting. THAT morning, Tim O'Reilly emailed my editor asking if they had anyone looking at "this SQLite thing," as it was spiking on Google Trends. "Yeah, we have something in the works." That afternoon my idea got pitched and was approved. 2.5 years later, in August of 2010, "Using SQLite" was published. I still make about USD$60/month in royalties.
And it all started because I wanted Wikipedia in SQLite on an SD card.
I had wondered whether the book was worth your time. I am glad that it seems so.
I learned a lot from it, but the one specific I remember was not SQLite-specific. It was the more general observation that having a lot of tables with the same schemas (schemata?) might be a bad idea.
It took me a bunch of hours to fix it, but my code ended up much cleaner.
Thank you, Jay.
I had wondered whether the book was worth your time.
That's a hard question to answer, because it depends on what you consider "worth." Overall, I think I've made about $10,000 from the book. A nice sum, but noticeably less than my real job pays me in a month. Not a great deal, especially when you consider the book represents well over 1000 hours of work, that was all fit in while working a full-time job at a game development company. So if you just want to look at dollars and hours, not really.
But that's not why you take on a project like that. While it doesn't mean as much these days, 10 or 15 years ago, to say you'd written "an animal book" was a bit of a pinnacle. It's freaking golden on a resume. It was also nice author perk to have unlimited access to O'Reilly's catalog. At one point I broken their online system book system because I had more than 200 titles in my account.
Perhaps more important, it really taught me how to write, and that's something I pass on to younger developers: learn to write approachable, readable tech documents. The best idea is worthless if you can't express it to someone else. It's also true that "They who controls the design document, controls the team," and that can be the most junior person on the team, as long as they're willing to stand up and do the work-- because chances are nobody else wants to.
I learned a lot from it, but the one specific I remember was not SQLite-specific.
Yes, we realized that most people that are new to SQLite are programmers looking for a database, not DBAs looking for something small and lightweight. As such, while the API chapters and such are only about SQLite, they're actually a pretty small part of the book. Much of the book spends a lot of time approaching the subject of how to use and utilize relational systems, but approaching the topic from the mindset of a developer, not a database expert. But it does spend a lot of time on general database topics.
You might be interested to know that the academic book field matches the story JK tells here. Income from the books doesn't amount to much when divided by the time spent on the work. If you're furloughed or unemployed, it's income, which is not to be sneezed at. But if you have another source of income it's not much.
The advantage comes from experience, shoe-in-the-door, and name profile. JK is now a "well-known technical author". If he wants to write another book, or move into documentation or teaching, Using SQLite looks terrific on his CV. If SQLite gets in the news (Big leak from a SQLite database ? Someone publishes a ton of useful information as a SQLite database ? News media picks up on "over a billion installations" ?) JK can earn money from appearing on TV and perhaps consultancy.
When an academic writes a textbook they get a great buzz from being able to teach from a textbook which has exactly what they want in it. For a few years, until new discoveries or new ways of thinking about their subject change what they would have written. In contrast SQLite is changing, but not much (on purpose).
(4) By ThanksRyan on 2021-04-08 03:39:06 in reply to 2 [link] [source]
Thank you for your efforts to bring an(other) SQLite book into the world! A great learning experience for you as well. It's been a few years since I've thumbed through your book, but I could tell you had passion writing it and wanted to educate the reader. Some authors clearly want to cure the reader's insomnia, but yours wanted to share the good news of SQLite with the world.
While we can't talk you into another SQLite book, maybe an article or two a month would be nice. This story you shared was quite enjoyable. Just recently on the forum there's been a couple different people asking about using SQLite with lots of users, and then there's the story the OP shared. SQLite reaches far and wide.
Thank you for your story and thank you, drh and your team, for continuing to develop SQLite.