The Core

Why We Are Here => Hardware & Technology => Topic started by: ukgimp on April 14, 2011, 04:17:33 PM

Title: Designing Data Intensive Applications
Post by: ukgimp on April 14, 2011, 04:17:33 PM
I have the seed of an idea, I would like to make sure it scales.

I could be looking at millions of members, each with profiles, that feed out into various outlets (FB Twitter etc)

Is there a best practice or a way of calculating the best dev platform. Whilst it wont be instant, it could get that way with a year or two, so I have to forward think.

I like php/mysql, but will that scale, has anyone studied large scale LAMP web projects that can handle this sort of work.

I think it could be conservative on 4M feeds out per day, or they could be grouped to about 100K.

Any pointers appreciated.

Cheers

Rich



Title: Re: Designing Data Intensive Applications
Post by: PaulH on April 14, 2011, 04:48:04 PM
Facebook was LAMP for a long time - but all their cash went on servers.

Think they way forward is the cloud. Will be moving a lot of our stuff to cloud computing.

Know of, but never tried "NoSQL"
http://en.wikipedia.org/wiki/NoSQL_(concept)
Quote
Real-world NoSQL deployments include Digg's 3 TB for green badges (markers that indicate stories upvoted by others in a social network),[8] Facebook's 50 TB for inbox search, and eBay's 2 PB overall data.

Restrict usage while you scale - makes it more selective and creates demand, but crucially it allows you to grow at own pace and not lose everything due to some massive outage - been reading too many books :)
Title: Re: Designing Data Intensive Applications
Post by: BoL on April 14, 2011, 05:38:50 PM
I think the problem people with with MySQL is that they don't optimize it for their needs, or don't design the table structure 'a better way'.

Changing key_buffer for MyISAM and innodb_buffer_pool for InnoDB can result in huge performance gains as data is left in memory and doesn't touch the disk.

QuoteFacebook was LAMP for a long time

I saw a video that talked about how they optimize MySQL and what tracking tools they use to spot problems and bottlenecks ... and how they kill slow queries.
Title: Re: Designing Data Intensive Applications
Post by: ukgimp on April 14, 2011, 05:54:34 PM
http://highscalability.com/blog/2010/11/4/facebook-at-13-million-queries-per-second-recommends-minimiz.html

Looks like good design and a steady mean response time can win the day
Title: Re: Designing Data Intensive Applications
Post by: Torben on April 14, 2011, 08:20:45 PM
PHP is good enough for Facebook so it's probably good enough for you

>Do NOT build a scaleable system for the future at the cost of getting the system up and running.
PHP is a great choice because you can start simple and then grow in to a more advanced soltion.

An inexperienced programmer can get the worst out of any programming language. In fact the biggest problem with PHP is that there is much bad code out there.

Make sure you get an experienced PHP programmer from the start.

NoSQL is the new black - for the right job that is. It depends entirely on your application.

Title: Re: Designing Data Intensive Applications
Post by: inbound on April 15, 2011, 03:46:20 AM
Although these books won't give you an definitive answer for your application - they do give you a good grounding on what to think of before you start:

The following 2 books are a good start when thinking of the challenges the back end may throw up:

High Performance MySQL - O'Reilly
Building Scalable Web Sites - O'Reilly

Don't forget the front end:

High Performance Web Sites - O'Reilly
Even Faster Web Sites - O'Reilly

Then there's the OS, CentOS is great if you want a RHEL clone:

Foundations of CentOS Linux - Apress
The definitive Guide to CentOS - Apress

Finally, Apache can be tweaked in many ways:

The 4 books from the Apache Software Foundation

That's about 3,000 pages! The way I use books is to skim them for interesting sections, then read further to find out things I should research further, although you could probably find all that info elsewhere a good set of books can save you a lot of time. There are loads of things that I would have missed if I was just using the web to research stuff.

Very important:

I would caution you about using cloud services when you are concerned about extreme performance - there are many hardware tweaks that will not be available to you if you use a cloud environment (the most obvious of those being the ability to utilise enterprise level Solid State Storage in a physical server - something that cloud servers don't have currently).

I would highly recommend looking at the advantages that decent Solid State Storage can offer (although consumer SSD products can be used creatively to get around the - reducing - issues around write latency/wearout).