Welcome to Hadoop
Hadoop is a platform for distributed computations based on Google's Map Reduce (paper).
Tasks are submitted to a JobTracker daemon which runs once per cluster. The JobTracker does some preprocessing then divides up the job and passes on smaller tasks to a number of TaskTracker daemons on other (or the same) machine. The TaskTrackers continually update the JobTracker as to their progress and when they're done, their results.
There is a distributed filesystem which is available to all the Hadoop nodes. This filesystem is run by a NameNode (which is another daemon, usually run on the same machine as the JobTracker ). Files are boken up into Blocks which are replicated spread out among many DataNodes . The NameNode handles requests for files and locates which DataNodes hold the appropriate blocks for each request. It then directs the appropriate DataNodes to provide the requested Blocks.
A description of some of the files the system uses can be found here.
Links
Official Links
- Hadoop's home page
- Hadoop's license
- Hadoop's Wiki; mainly development updates.
- Hadoop's API/Code documentation; these are just the javadocs, but useful if you don't feel like reading the actual source.
- Apache, Hadoop's project sponser
- Lucene, Hadoop's parent project; less related to Hadoop, and more related to the free search engine Nutch.