April 2012

39 posts

Listen

I am having fun!

Play
“Companies die from not being eaten by their competitors, but from self-inflicted wounds. They don’t have discipline. Their best people get frustrated. They chase all these shiny objects that aren’t core to the business. They become complacent because of early success.” —Drew Houston, Dropbox (via sequoiacapital)
Blake Matheny: Tumblr Firehose - The Gory Details tumblr.mobocracy.net

mobocracy:

Back in December I started putting some thought into the tumblr firehose. While the initial launch was covered here, and the business stuff surrounding it was covered by places like techcrunch and AllThingsD, not much has been said about the technical details.

First, some back story. I knew in December that a product need for the firehose was upcoming and had simultaneously been spending a fair amount of time thinking about the general tumblr activity stream. In particular I had been toying quite a bit with trying to figure out a reasonable real-time processing model that would work in a heterogenous environment like the one at Tumblr. I had also been quite closely following some of the exciting work being done at LinkedIn by Jay Kreps and others on Kafka and Databus, by Eric Sammer from Cloudera on Flume, and by Nathan Marz from Twitter on Storm.

I had talked with some of the engineers at twitter about their firehose and knew some of the challenges they had overcome in scaling it. I spent some time reading their fantastic documentation and after reviewing some of these systems came up with the system I actually wanted to build, much of it completely influenced by the great work being done by other people. My ‘ideal’ firehose, from the consumer/client side, had the following properties:

  • Usable via curl
  • Allows a client to ‘rewind’ the stream in case of missed events or maintenance
  • If a client disconnects, they should pick up the stream where they left off
  • Client concurrency/parallelism, e.g. multiple consumers getting unique views of the stream
  • Near real-time is good enough (sub 1s from an event emitted to consumed)

From an event emitter (or producer) perspective, we simply wanted an elastic backend that could grow and shrink based on latency and persistence requirements.

What we ended up with accomplishes all of these goals and ended up being fairly simple to implement. We took the best of many worlds (a bit of kafka, a bit of finagle, some flume influences) and created the whole thing in about 10 days. The internal name for this system is Parmesan which is both a cheese as well as an arrested development character (Gene Parmesan, PI).

The system is comprised of 4 primary components.

  • A ZooKeeper cluster, used for coordinating Kafka as well as stream checkpoints
  • Kafka, which is used for message persistence and distribution
  • A thrift process, written with scala/finagle, which the tumblr application talks to
  • An HTTP process, written with scala/finagle, which consumers talk to

The Tumblr application makes a Thrift RPC call containing event data to parmesan. These RPC calls take about 5ms on average, and the client will retry unless it gets a success message back. Parmesan batches these events and uses Kafka to persist them to disk every 100ms. This functionality is all handled by the thrift side of the parmesan application. We also implemented a very simple custom message serialization format so that parmesan could completely avoid any kind of message serialization/deserialization overhead. This had a dramatic impact on GC time (the serialization change wasn’t made until it was needed) which in turn had a significant impact on average connection latency.

On the client side, any standard HTTP client works and requires (besides a username and password) an application ID and an optional offset. The offset is used for determining where in the stream to start reading from, and is specified either as Oldest (7 days ago), Newest (from right now), or an offset in seconds from the current time in UTC. Up to 16 clients with the same application ID can connect, each viewing a unique partition of the activity stream. Stream partitioning allows you to parallelize your consumption without seeing duplicates. This is a great feature for instance if you took your app down for maintenance and want to quickly catch back up in the stream.

Kafka doesn’t easily (natively) support this style of rewinding so we just persist stream offsets to ZooKeeper. That is, periodically clients with a specific application ID will say, “Hey, at this unixtime I saw a message which had this internal Kafka offset”. By periodically persisting this data to Kafka, we can ‘fake’ this rewind functionality in a way that is useful, but imprecise (we basically have to estimate where in the Kafka log to start reading from).

We use 4 ‘queue class’ (tumblr speak for a box with 72GB of RAM and 2 mirrored disks) machines, capable of supporting roughly 100k messages per second each, to support the entire stream. Those 4 machines provide a message backlog of 1 week, allowing clients to drop into the stream anywhere in the past week.

As I mentioned on twitter, I’m quite proud of the software and the team behind it. Many thanks to DerekDanielle and Wiktor for help and feedback.

If you’re interested in this kind of distributed systems work, we’re hiring.

I didn’t do much but sit back and cheer from the sidelines. Blake is the special unicorn!

Listen
HNterest | Pinterest for Hacker News hnterest.com

This makes me appreciate the Hacker News UI. Oooph

Get By Talib Kweli

michael:

camillionaire:

Talib Kweli - Get By

This was such a hot album

“Offer them what they secretly want and they of course immediately become panic-stricken. ” —Jack Kerouac, On the Road
Listen

YES!

Listen

Terri is the Japanese version of Dick Dale. I think he still plays - I think. I saw him on TV a while back and was quite impressed. 

Listen

This is a great song all the way through but I love the opening 2 seconds.  The little scream makes it perfect for me. 

Listen

I have seen PE in concert twice. Once when they were an opening act and almost bood off stage. It was actually my first concert - it was a great night as I remember.  I later saw them in Tokyo and that was quite memorable as well. They were much better received that night.  They seemed like a very ‘dangerous’ group back in the day. 

Listen

I liked Waylon even before he did the narration for ‘The Dukes of Hazard’ TV show. 

“Right now, there is a limit to 50 search instances. An extra large search instance can handle approximately 8 Million 1K documents. It appears that assumption is that the documents are quite small (e.g. product documents). To put it in perspective, an rough rule of thumb for web documents is approximately 10k. Given this, it translates into roughly 800k web documents per server * 50 servers = 40 million web documents. This is not for building large-scale web search, yet. However, it should be more than enough for most enterprise e-commerce and site-search applications.” Jeff’s Search Engine Caffè: Amazon CloudSearch, Elastic Search as a Service
Listen
Listen

I love seeing new features roll out and we’ve had a couple really productive weeks. We’ve launched the new Log in / Sign up pages, revamped publishing to Facebook to include Timeline and OpenGraph support, delivered an amazing new Android app and today we pushed out our Spotify integration.  I love working with these amazing people.  Here’s to keepin it neat and making it a double. We’ve only just begun ….

Listen

staff:

Listen to this! We just teamed up with Spotify so you can post tracks, playlists, and full albums from their very extensive library. Search for tracks or paste a Spotify link to embed your music — without the daily limit. :)

To celebrate, we’ve put together this playlist featuring musicians from theTumblr Spotlight. Hit play and enjoy!

Don’t have Spotify yet? Get it here!

“Now there is the smell of spring ” —Man sitting next to me on the subway
  • Reporter: How did you meet David?
  • Derek: I was in prison for hacking.
  • Derek: David was part of a "scared straight" program.
Job - Tumblr API Lead tumblr.theresumator.com

dashbuddy:

Fantastic to see this job up, I really wish I had the skills because I can’t imagine a more engaging job right now!

Back To Top