How peer-to-peer file-sharing networks work
BitTorrent is a tremendously popular peer-to-peer file-sharing protocol designed to simplify and speed up the process of transferring large files over the Internet while drastically limiting the bandwidth consumption of any one server.
In a conventional file-transfer process, a file is stored on a server on a network such as the Internet. Other computers on the network send messages to the server, informing it that they would like to copy that file. When the two sides establish a connection, the other computers become clients to the server. As the number of clients increases, so do the demands on the server. And while each client might consume only a little bandwidth, the server can consume tremendous amounts. To reduce costs and prevent the server from crashing, the server’s owner will typically constrain the speed at which each client is allowed to download data or even limit the number of clients that can be served at one time.
In the original BitTorrent, one computer acted as a tracker to coordinate the peer-to-peer file-transfer process. The tracker maintained a list of which computers on the Internet were in the process of uploading or downloading pieces of the seed file. A trackerless BitTorrent system eliminates this central computer by distributing the tracker data amongst the swarm participants.
Napster and Gnutella
Peer-to-peer file sharing eliminates the need for a central server to host files. The original Napster, however, still relied on a central server to keep track of connected computers and the files available on them. That’s how the service ran afoul of copyright laws and was eventually forced to shut down: Napster’s servers didn’t store copyrighted material, but the courts decided that Napster’s service violated the Digital Millennium Copyright Act because the company knowingly facilitated copyright infringement.
While the Napster lawsuit was underway, another peer-to-peer network named Gnutella sprang up and completely eliminated the centralized server. When you launch a Gnutella client, it immediately searches the Internet for other computers running Gnutella clients. Each of these peers is called a node. When you initiate a file search, the Gnutella client queries each node to determine if it’s hosting the file you’re looking for. If these nodes don’t have the file you’re searching for, they’ll send queries to the nodes they’re connected to. The node that does have the file will send a response message back to the node that initiated the search, and the user can then decide whether or not to download it.
Gnutella has two significant shortcomings: First, it relies on file transfers between just two peers. Since the most common means of consumer Internet access—cable and DSL—use asynchronous connections in which download speeds are much higher than upload speeds, the peer downloading the file is limited to whatever speed the peer uploading the file is capable of. Second, it depends on users to reciprocate, but it can’t force anyone who downloads files from other people’s computers to allow others to download files from their machines. Netiquette frowns on this practice, which is known as leeching, but Gnutella can’t prevent it.
BitTorrent cleverly avoids the legal and practical problems associated with peer-to-peer file-sharing networks like Gnutella and the original Napster. It allows one peer to rely on several others for file transfers, rendering the process both faster and cheaper for all the peers involved, and it has a reward system that encourages user reciprocation.
Rather than establish a relationship between just two peers, the BitTorrent protocol simultaneously gathers pieces of a file from several peers that already have the file or that are in the process of obtaining it. It then downloads these pieces to your computer and reassembles them on your hard drive when all the pieces have been acquired.
The BitTorrent protocol depends on at least one peer making the entire file available to the network; this is known as the “initial seed.” As other peers begin downloading this seed file, they simultaneously upload pieces of the file to other peers that are looking for it. Each peer is encouraged to continue making the file available after they’ve downloaded it in its entirety, in effect creating additional seeds. A BitTorrent client can facilitate this with a tit-for-tat scheme that rewards reciprocation by giving preference to peers that send data back.
To share a file, the user first creates a smaller file, called a “torrent,” that contains metadata about the file and the “tracker” computer that will coordinate the file distribution. The metadata inside the torrent file varies according to the BitTorrent client that created it, but the file will have an “announce” section that specifies the tracker computer’s URL, and an “info” section containing file names, file sizes, and a hash code for each piece of the file (more on this later).
Any peer that wants to download the file must first download the torrent file associated with it. The torrent will connect the peer to the appropriate tracker, which will in turn tell the peer which other peers are currently downloading the file. All the peers actively engaged in sharing a particular file are referred to as a “swarm.” The more peers in the swarm, the faster each peer will be able to download the file. In a conventional client-server relationship, a file in high demand can be slow to download because it presents a hardship for the servers hosting the file. With BitTorrent, a file’s popularity actually increases the speed at which it can be downloaded.
Each peer distributing a file breaks it into chunks ranging from 64KB to 4MB in size and creates a checksum for each chunk using a hashing algorithm. When another peer receives these chunks, it matches its checksum to the checksum recorded in the torrent file to verify its integrity.
A “trackerless” BitTorrent system has no central computer coordinating the file sharing; instead, every peer acts as a tracker. In this case, the BitTorrent client employs a distributed hash table to keep track of the location of the initial seeds, checksums, and peers actively engaged in the swarm.