Parallel course: Researchers help ease transition to parallel programming

October 23, 2009 by Larry Hardesty Parallel course

Enlarge

An image of a Quad-Core AMD Opteron processor. Courtesy of AMD

(PhysOrg.com) -- In 1995, a good computer chip had a clock speed of about 100 megahertz. Seven years later, in 2002, a good computer chip had a clock speed of about three gigahertz -- a 30-fold increase. And now, seven years later, a good computer chip has a clock speed of... still about three gigahertz.

Four or five years ago, chip makers realized that they couldn't make chips go much faster, so they adopted a new strategy for increasing computers' power: putting multiple "cores," or processing units, on each chip. In theory, a chip with two cores working in parallel can accomplish twice as much as a chip with one core. But software developers tend to see their programs as lists of sequential instructions, and they've had trouble breaking up those instructions in ways that take advantage of parallel processing. Professor of Computer Science and Engineering Saman Amarasinghe and his colleagues at MIT's Computer Science and Lab are giving them a hand.

In the past, computer science researchers had hoped that sequential programs could be converted into parallel programs automatically. "I spent a good part of my life trying to do that," says Amarasinghe. But Amarasinghe has now come to the conclusion that "if you want to get parallel performance, you have to start writing parallel code." At a high level, he believes, programmers will need to specify which tasks performed by their programs can run concurrently.

"Just writing anything parallel doesn't mean that it's going to run fast," Amarasinghe says. "A lot of parallel programs will actually run slower, because you parallelize the wrong place." And determining when to parallelize, and which cores to assign which tasks, is something that computers can do automatically, Amarasinghe believes.

Amarasinghe divides his group's work on multicore computing into two categories: tools to ease programmers' transition to parallel programming and tools to improve programs' performance once that transition is complete.

Predictable parallelism

The first set of tools addresses one of the central difficulties of parallel programming: if several tasks are assigned to separate cores, there's no way to be sure which will be completed first. Most of the time, that's not a problem. But occasionally, different cores have to access the same resource — updating the same stored value, for instance, or sending data to the monitor. Depending on which core reaches the resource first, the program's behavior could be quite different. Even given the exact same inputs, a multicore program may not execute in quite the same way twice in a row.

That's a nightmare for developers trying to debug their programs. To find a bug, developers run their programs over and over, with slight variations, to localize problems to ever-smaller blocks of code. But those variations don't convey any useful information unless the rest of the program behaves in the exact same way.

Amarasinghe and his grad students Marek Olszewski and Jason Ansel designed a system to make multicore programs more predictable. When a core tries to access a shared resource, it's assigned a priority based not on the time of its request but on the number of instructions it's executed. At any point during a parallel program's execution, any two cores will have executed about the same number of instructions. But if one core has had to, say, access a little-used value in a distant part of the computer's memory, it might have fallen slightly behind. In those cases, the researchers' system gives it a chance to catch up. Once it's reached the same instruction count, if it hasn't issued a higher-priority request for the shared resource, the first core's request is granted.

That waiting around means, of course, that the program as a whole will run more slowly. But in experiments, the researchers determined that, on average, a software implementation of their system increased program execution time by only about 16 percent. For developers, that's a small price to pay for reliable debugging. Moreover, if the ability to count instructions were built into the cores themselves, the system would be much more efficient. Indeed, Intel — which has demonstrated eight-core chips at trade shows — is talking with Amarasinghe about funding further research on the system.

"A lot of other people have this alternative approach," says Krste Asanovic, an associate professor at the Parallel Computing Lab at the University of California, Berkeley. "You kind of just run it [the parallel program] however it runs and then try and record what it did so then you can go back and replay it." But with the MIT system, he says, "You don't have to worry about recording how it executed because when you execute it, it will always run the same way. So it's strictly more powerful than replay."

Maximizing multicore

Amarasinghe's lab has two projects that fit into his second category — tools that optimize the performance of parallel programs. The first deals with streaming applications, such as video broadcast over the Web. Before your computer can display an Internet video, it needs to perform a slew of decoding steps — splitting the data stream into parallel signals, performing different types of decompression and color correction, recombining the signals, performing motion compensation, equalizing the signal, and so on. Traditionally, Amarasinghe says, video software will take a chunk of incoming data, pass it through all those decoding steps, and then grab the next chunk.

That approach, says Amarasinghe, squanders the inherent parallelism of signal-processing circuits. As one chunk of data exits each decoding step, the next chunk of data could be entering it. StreamIt, a language developed by Amarasinghe's group, allows programmers to specify the order of the steps but automatically determines when to pass how much data to each step.

"If you have some kind of specification document, they're all written in block diagrams" that illustrate a signal-processing path as a series of sequential steps, Amarasinghe says. "When they write to [the programming language] C, you lose those blocks." StreamIt, by contrast, "is very close to the original thinking," Amarasinghe says.

"Other people had talked about stream programming, but StreamIt was really putting everything together in a language and compiler tool chain so people could actually write streaming programs," says Asanovic. "So I think the language design and compiler tool chain that followed from that was pretty influential."

The group's other parallel-computing project helps programs adapt on the fly to changing conditions. Often, Amarasinghe explains, a programmer has several ways of tackling a particular task. There are, for example, many common algorithms for sorting data, with names like quicksort, insertion sort, radix sort, and the like. But the algorithms perform differently under different circumstances. Quicksort is often the best choice, but not "when the data is very small," Amarasinghe says. With huge data sets, however, radix sort may work even better than quicksort.

That variability is compounded when the algorithms are being assigned to different cores. So Amarasinghe's group has designed another language that asks the developer to specify four or five different methods for performing a given computational task. When the program is running, the computer automatically identifies the most efficient method.

It might seem that that approach would lead to bloated programs. But Amarasinghe points out that when a typical program executes, it devotes much more memory to storing data than it does to storing instructions. "Something like a sort is, like, five lines of code that work on arrays with billions of elements. So making five lines of code 25 is not a big issue," he says.

It does require some added work on the programmer's part. But "there's no free lunch: you won't get nice multicore performance by doing nothing," Amarasinghe says. "Believe me, we've been trying for 50 years."

Provided by Massachusetts Institute of Technology (news : web)


   
Rate this story - 4.5 /5 (16 votes)

Rank Filter

Move the slider to adjust rank threshold, so that you can hide some of the comments.


Display comments: newest first

  • googleplex - Oct 23, 2009
    • Rank: not rated yet
    I heard from a buddy that the issue with increasing clock speed above 3GHz is the electronic signal behaves more like an EM wave than an electrical pulse. Already wiring corners must be rounded or the singal leaves the board. This implies that the wiring is more like a wave guide than an electrical wire. Think of cable TV (coax)versus phone lines (copper pairs.
    The article should make clarify the difference between hyperthreading and a true parallel processor.
  • tpb - Oct 23, 2009
    • Rank: not rated yet
    quote:
    it's assigned a priority based not on the time of its request but on the number of instructions it's executed.

    Unfortunately it's not that simple because the parallel code sections may be accessing level 1, 2, or 3 cache memory or external dram so that the number of instructions executed has little to do with the time to execute them.
  • sender - Oct 23, 2009
    • Rank: not rated yet
    The inherent problem is in the interpreter rather than being an active runtime environment that handles functional objects as i/o applications it treats them as linear sequential scripts. An intelligent interpreter seeks the in's out's and id's of functions and connects the code between as relational databases do.

    The second major hurdle is pushing the front side bus to keep up with CPU instructions reducing buffer and increasing hardware access time.
  • HyperAnomaly - Oct 23, 2009
    • Rank: 5 / 5 (1)
    This is my understanding of the ~3 GHz limitation: for about 40 years fundamental circuit design has relied on of Kirchhoff's current law (i.e. sum(I_i)=0 for a circuit node), but this is an approximation of Maxwell's laws where drho/dt=0. This translates to the approximation holding when f>>c/L where L is the size of a circuit element. Using L = 1cm we get 3e10 Hz, so this starts to break down around 3e9 Hz.
  • dirk_bruere - Oct 23, 2009
    • Rank: not rated yet
    Why did they pick an example of parallel programming for which video cards with hundreds of cores are already available? What programs need massive parallelism that are not catered for by current GPU cores?
  • Magus - Oct 23, 2009
    • Rank: 5 / 5 (1)
    What programs need massive parallelism that are not catered for by current GPU cores?

    When I work on video games, I like the AI path-finding and the AI decision making to run at the same time since path-finding can be tedious when traversing a 3d environment. Physics engines is also something to have on a separate core. Render engine to run on its own core.
  • NickJohnson - Oct 23, 2009
    • Rank: not rated yet
    I can see some problems with the implementation of a cycle count for each processor. The operating system needs to be able to change which processors run which threads over the entire system; this will choke if the processors are not running on the same threads at the same time. Intel would also have to change iret to not have a miscount due to interrupt handlers, which would break all existing OSes.
  • plasticpower - Oct 24, 2009
    • Rank: not rated yet
    What programs need massive parallelism that are not catered for by current GPU cores?

    When I work on video games, I like the AI path-finding and the AI decision making to run at the same time since path-finding can be tedious when traversing a 3d environment. Physics engines is also something to have on a separate core. Render engine to run on its own core.


    Exactly. Path finding absolutely MUST run on a separate core, or at the very least in a separate thread. That is, if you want your FPS to stay above 30 and not just stop completely while your algorithm iterates through the available options. Even with something like A* you still can be iterating through a ton of data, often recursively. You would definitely prefer that to run in parallel.
  • Buyck - Oct 24, 2009
    • Rank: not rated yet
    I must say the standaard clock speeds of today are still very low. Indeed around 3Ghz or a little bit more. But new technology's like more cores or new materials and technology's we will be able to reach a 1 Thz and more. Lets say the next 5 years we can achieve 10 Ghz or more with 15nm and beyond.
  • dhughes - Oct 24, 2009
    • Rank: not rated yet
    >in 2002, a good computer chip had a clock speed of about >three gigahertz

    If that's referring to the average home system, in 1993 or 1994 I bought a computer with an Intel 486DX2 66MHz, I remember the Pentium had just or was about to be released and I think it was around 100MHz, but it was too expensive for me (my 486 system cost $3,500!!).

    In 2002 I built an AMD system with a 1.8GHz Athlon, I don't think it was top of the line but it was certainly a very good system. I don't think there was a 3.0GHz CPU available to the average home user in 2002, I doubt 3.0GHz CPU existed in 2002, I don't think there was even a PII or even a PIII CPU at a 3GHz clock speed, even the P4 barely passed 3GHz just a few years ago.
  • Nik_2213 - Oct 24, 2009
    • Rank: not rated yet
    Funny thing: I've just grabbed my copy of 'Programming in Occam' off the shelf. That was a language designed to run on multiple Transputer cores, and pipeline each thread according to available resource. Slow, fast, one, four, sixteen, twenty, whatever mix you got, it could use without re-compiling...

    The (c) date ? 1987...

October 23, 2009 all stories

Comments: 11

4.5 /5 (16 votes)

  • hide
  • Related Stories

  • 'Not so fast, supercomputers,' say software programmers
    created May 22, 2007 | popularity not rated yet | comments 0
  • Stanford, tech giants team up to enable software for parallel computers
    created May 01, 2008 | popularity not rated yet | comments 0
  • MIT, IBM team up on first PlayStation 3 course
    created May 08, 2007 | popularity not rated yet | comments 0
  • Next-generation, high-performance processor unveiled
    created Apr 24, 2007 | popularity not rated yet | comments 0
  • New approach eliminates software deadlocks using discrete control theory
    created Dec 02, 2008 | popularity not rated yet | comments 0



  • hide
  • Relevant PhysicsForums posts

  • Computer 5V or 0V output to Sensaphone Express II
    created Feb 04, 2010
  • Ti-89 ROM Image
    created Jan 29, 2010
  • TV ads
    created Jan 29, 2010
  • Apple introduces latest iNonsense
    created Jan 27, 2010
  • cheap scientific calculator that does matrix operations
    created Jan 27, 2010
  • Power consumption: Residential vs. Commercial
    created Jan 22, 2010
  • More from Physics Forums - Computing & Technology

Other News

The power of 'random'

The power of 'random': 'Seemingly loopy' technique could dramatically improve communications networks

Technology / Computer Sciences

created 15 hours ago | popularity 4.8 / 5 (8) | comments 5 | with audio podcast

A radical new approach to the design of communications networks, called "network coding," promises to make Internet file sharing faster, streaming video more reliable, and cell-phone reception better -- among ...


'Revolutionary' water treatment units on their way to Afghanistan

Technology / Engineering

created 10 hours ago | popularity 4.2 / 5 (5) | comments 3 | with audio podcast

The United States Army has taken delivery of the first two units of a "revolutionary" waste-water treatment system that will clean putrid water within 24 hours and leave no toxic by-products, according to scientists at Sam ...


Imec and Holst Centre achieve breakthrough in battery-less radios

Imec achieves breakthrough in battery-less radios

Technology / Semiconductors

created 10 hours ago | popularity 4.9 / 5 (9) | comments 0 | with audio podcast

At today's International Solid State Circuit Conference, Imec and Holst Centre report a 2.4GHz/915MHz wake-up receiver which consumes only 51΅W power. This record low power achievement opens the door to battery-less ...


Handling emergencies online

Technology / Internet

created 6 hours ago | popularity not rated yet | comments 0

Online social networking sites could solve many problems plaguing information dissemination and communications when disaster strikes, according to a report from US researchers in a recent issue of the International Journal of ...


Android

Google developing a translator for smartphones

Technology / Software

created 16 hours ago | popularity 4.7 / 5 (7) | comments 3 | with audio podcast report

(PhysOrg.com) -- Google is developing a translator for its Android smartphones that aims to almost instantly translate from one spoken language to another during phone calls.