C-CWEL Source Matching Work

From CoolWiki
Revision as of 18:21, 7 March 2013 by Rebull (talk | contribs)
Jump to navigationJump to search

Big Picture Introduction -- an analogy that might be too fanciful?

Oldmap1.png
Oldmap2.png

When westerners first discovered the Americas, they had largely set out with the goal of finding gold or other treasure. They were coming at the new continent from the perspective of someone in a boat, with minimal information about what the landforms really were, except for what they could see with their own eyes. Their maps look strange to those of us used to seeing images of these landforms from space, but we have a whole lot more information now than they did then.

The first thing that these early Western explorers were able to attempt to map was the coasts, because that's what they had the most information about...and the most immediate need to know. They needed to know where coral reefs were that might damage their ships, and where the big rivers emptied into the sea so that they could take on more fresh water. They also could learn about more land faster when boating up the rivers rather than walking.

As more and more boats explored the coasts, the maps got better, but they still seem distorted compared to the landforms we know today. In some of these early maps of the Americas, when Europe and Africa were included, even the African coast on the Mediterranean side doesn't look all that realistic, compared to what we know now.

As the westerners pushed further and further into the land (in the Americas or for that matter Africa) from the coasts, their knowledge deepened about what the continent actually looked like, aided by improvements in technology (such as more accurate ways of measuring longitude). Their knowledge of the land started in clumps around the rivers, again because that's what they needed that best enabled them to travel the furthest. But, their knowledge expanded as fast as they could expand. And their goals changed too -- certainly some were still looking for treasure (or freedom from persecution, religious or otherwise), but more in the earliest years were just trying to survive (here I'm thinking of Jamestown or Roanoke). They explored to find more food to eat (critters or plants).

The Native Americans, of course, had a perfectly good understanding of what their land looked like, but even so, most likely, one tribe only knew the land near them -- my guess is that the Powhatan tribe (in VA) had no idea whatsoever what the Sioux tribe's lands looked like, even if trade routes were such that items could move from the Dakotas to Virginia. But the Native Americans were observing the land in a different way, having lived there for a while and having their own methods of exploration. Once westerners realized that they could learn from the Native Americans (here I'm thinking of Lewis and Clark), their knowledge could expand even faster once they paid attention to what was already known.

There were, at nearly every stage of these early maps, regions that were sort of hazy and unexplored, e.g., "here be dragons". Someone might have a vague idea of what was there, but maybe only on the edges; no one (at least no one known to the map maker) had any detailed knowledge of what was there.

SO NOW.... here is a perhaps tortured analogy... We would like to go exploring in a particular region, making note of where the big landforms are, and we have a specific goal of finding edible animals and plants to support our efforts in further explorations. Some people have spent some time exploring parts of this region before. It will make our knowledge expand faster if we pay attention to what is already known before setting off on our own journey. Others have mapped different parts of the region using different methods of exploration before -- some on foot and some on horseback, and still others in boats. Some people just ran through this region identifying the big trees and big rocks. Some people wrote down what they learned in Algonquin (what the Powhatan spoke) and we need to translate it before it makes sense in the language we speak. Some people cared a lot about their tiny camp next to the river and they know that region really well, but beyond the borders of their camp, "here be dragons." There is some information about the area around the camp that we can obtain from other people and from exploring on our own, but we can also come back to this very well-known region and check what we think we know about the rest of the area by testing it on the well-known region. The people who know the region right next to the river really well also know that this animal or that plant is good food and won't make us sick. When we go exploring out further from the camp, if we find a critter or a plant that we think is the same as the stuff that the people next to the river know is ok to eat, we can bring it back to that camp to compare it and see if it is, in fact, the same or something new. We can also, among the animals and plants we find, put them in groups of apparently similar things -- these are all 4-footed furry critters, and those have feathers and wings.

Explicitly drawing lines between that analogy and reality:

analogy reality translation
We would like to go exploring in a particular region, making note of where the big landforms are, and we have a specific goal of finding edible animals and plants to support our efforts in further explorations. We have a goal of exploring a region on the sky, specifically looking for young stars.
Some people have spent some time exploring parts of this region before. It will make our knowledge expand faster if we pay attention to what is already known before setting off on our own journey. We need to read and understand the literature.
Other people have mapped different parts of the region using different methods of exploration before -- some on foot and some on horseback, and still others in boats. Other people have used different wavelengths to explore this region before.
Some people just ran through this region identifying the big trees and big rocks. Some people just identified the bright young stars, or the things bright in the wavelengths they were using.
Some people wrote what they learned down in Algonquin (what the Powhatan spoke) and we need to translate it before it makes sense in the language we speak. Some people wrote down poorly constrained coordinates in epoch B1950 coordinates and we need to translate it to accurate J2000 coordinates. (Actually, does not seem to be the case for BRC 38! Lucky us! We still need to match objects from the literature, though.)
Some people cared a lot about a tiny patch next to the river and know that region really well, but beyond the borders of their camp, "here be dragons." The Choudhury et al. group spent time sorting out the 5'x5' patch with 4-band IRAC coverage, but not the serendipitous data; we will have to use their results but also the rest of the data where possible.
There is some information about the area around the camp that we can obtain from other people and from exploring on our own... We can comb the literature and use the 2MASS+WISE data to help guide us in exploring the region.
...but we can come back to this very well-known region and check what we think we know about the rest of the area by testing it on the well-known region. We can use WISE to identify things with YSO-like colors in the region. Did we rediscover the YSOs that other people found, the Choudhury group using Spitzer data? If not, why not? Do the objects with YSO-like colors look like point sources in 2MASS or do they look like galaxies?
The people who know the region right next to the river really well also know that this critter or that plant is good food and won't make us sick. When we go exploring out further from the camp, if we find a critter or a plant that we think is the same as the stuff that the people next to the river know is ok to eat, we can bring it back to that camp to compare it and see if it is, in fact, the same or something new. We have a region of space that is very well studied with Spitzer, and some limited serendipitous Spitzer data nearby. We will use WISE over the whole region to find things that we think might be YSOs. We can check our hunch that some of the objects are in fact YSOs by comparing what we get to the Spitzer data where we can, and including that data in our analysis.
We can also, among the animals and plants we find, put them in groups of apparently similar things -- these are all 4-footed furry critters, and those have feathers and wings. Among the objects we find, we can put them in groups based on the shape of their SED, from 'really embedded' (class 0-I) down to 'not much of an IR excess' (class II-III).

We will not be able to get a comprehensive be-all-end-all understanding of the region (e.g., in the analogy, we will not go straight from Columbus or even Lewis and Clark to weather satellite views of the continent with a GPS in our car as we drive). We can, however, do the best that we can with the information we have, by learning from those who have gone before, learning as we go, and making intelligent guesses about what we don't know.

More specific introduction to source matching from the literature

Several people have done prior studies in BRC 38 before, but they have NOT found all the young stars! You tried to make sure we found all of these articles in the context of the proposal. Certainly investigator A working in BRC 38 in year X saw some of the same sources as investigator B working in that same region in year Y, as did investigator C in year Z. Now we actually have to do the work of figuring out which specific sources are which in all the papers - are the sources called out in paper 1 the same or different sources as paper 2?, etc., until all the papers are exhausted, and we have a complete catalog of all the previously studied sources in the region.

The thing that makes this somewhat complicated is that, even though everyone is reporting in RA and Dec, not everyone is using exactly the same system (though they are all J2000, there are some variations within that; WISE, 2MASS, Spitzer, and Chandra should all be well-matched), and not everyone has the same coordinate accuracies (some are working off of photographic plates, and some are working off large-format CCDs; Ogura+ 2002 is the toughest, but CWAYS already did this for you). And, what does it mean to have "the same" coordinates -- is within an arcsecond ok? 5 arcseconds? an arcminute? This is where it gets tricky, and where you have to apply your brain! Spitzer, WISE, and 2MASS are all using exactly the same, high-accuracy coordinate system -- it's all tied to 2MASS's J2000 coordinates -- but even then the position of the same object will not be EXACTLY the same in each image, in each catalog, because there is a limit to the precision with which we can identify the coordinates. Where possible, we need to update the old coordinates by comparing what the old papers say to the 2MASS data. Then we need to fold in the objects with newer coordinates into our collection of sources.

Part of the challenge here is bookkeeping -- writing down coordinates correctly, keeping track of which sources are which, and getting the correct data matched to the correct source.

In 2011, I thought this would be a relatively simple project that could be done before the summer visit. However, it turned out not to be the case. So, I pushed the 2012 team really hard, and they got through a tremendous amount of work prior to their visit. BRC 27 was far more of a complex problem than BRC 38, so it should be easier for us. I've pulled out and updated all of my best(?) explanations and descriptions here. IF IT DOESN'T MAKE SENSE, PLEASE ASK QUESTIONS. If this is done wrong, or only halfway done, it will make for a LOT of pain downstream. Trust me.


Venn Diagrams and Bookeeping

One of the difficulties we will have during this project is keeping all the source lists straight. It happens every year, and I don't know how to make it easier (in part because I am not sure exactly why it's so confusing), except for warning you that it will happen! Here is a Venn diagram explaining, roughly, the various source lists we will have before we are done, at minimum. This Venn diagram is meant to be a "big picture" sort of thing; this page on the source matching is meant to address just, well, the previously identified sources.

Brcvenn1.png

The source lists include:

  • All "bright enough" sources seen in the WISE maps (a conceptual list only)
  • Sources in the WISE catalog of photometry (to which we will add photometry from 2MASS, IPHAS, and Spitzer (and Akari) in the places where we have that data)
  • Sources in this general direction studied by anyone else, ever (the majority of those reported are also YSOs, but not all of them) -- this is the list we are trying to assemble here.

Out of those sets, our ultimate scientific goals mean that we are striving to identify:

  • YSO candidates we select from IR excess
  • YSOs that others identify that do not appear to have an IR excess.

The Venn diagram is even trying to correctly represent the relative sizes of the circles in that "all bright enough sources" ought to be darn close to "sources in the catalog" and that there will be some "sources in this general direction..." not in the regions we care about, and some of those sources that do not have IR excesses.

NOW we are going to work on the list of "Sources in this general direction studied by anyone else, ever".


Brc38venn.png

For this diagram, I tried to spatially represent the concepts behind what we're doing now (on this page), but I admit the circles are not as carefully constructed/laid out as the first one!

Each of the 7 papers studying things in the region of BRC38 (Ogura et al. 2002, Getman et al. 2007, Beltran et al. 2009, Choudhury et al. 2010, Chauhan et al. 2009, Barentsen et al. 2011, Nakano et al. 2012) looked in the direction of BRC38. Surely, then, they saw some of the same sources as each other, and as what we are seeing. For example, the Getman survey with Chandra and the Choudhury Spitzer data covered comparable regions but not exactly the same. Ogura saw some of the same sources that Getman did, but not all of them -- even within the same area, they did not see the same sources, because one survey was X-ray driven, and one was Halpha driven. They *will* see different sources, not only because they're looking at different wavelengths, but also because each survey is not infinitely deep -- the sensitivity of the surveys is limited, and as such will not see every source in this direction. (For example, Beltran et al. is working in JHK but deeper than 2MASS.) Same for each other pair of papers, and our survey.


The Goal

The goal here is to construct a list that is as clean as possible for each of the objects that these other folks studied, identifying which objects are truly the same between surveys, and identifying which of these objects are ones that those authors thought were actually young stars (as opposed to, e.g., background giants). We also want to carry along each of the relevant bits of information that these other authors provided -- the object is a lot easier to identify as clearly a young object or a contaminant if there is optical data, so if the other authors reported any optical measurements, we should keep track of those and tie them to the correct object in our analysis. We should also make note of any spectral types or other relevant information. The aim of this part of the project is thus:

  • Which objects from paper x are also seen in paper y?

and then, the next step we will take is

  • Which of these objects are seen in the WISE data? (for those that are missing, why are they missing?)

The Challenges

This would be an easy task if:

  • everyone provided their original images, either as a figure or as a fits file
  • everyone worked in the same coordinate system, by which i mean not just "J2000" vs "B1950" but "J2000 tied to 2MASS" as opposed to "J2000 tied to the pulsars seen by NRAO" or "J2000 as calibrated as best I can based on the HST Guide Stars I happen to see in my image".
  • the objects were all greater than 5 arcseconds apart from each other on the sky, such that each source that is detected was cleanly and uniquely detected in each survey.
  • and, of course, if we were guaranteed a match between surveys.

Working backwards up that list...

We've already talked above about how we are not guaranteed a match between surveys, because stars are different brightnesses at different bands, and because the surveys have limited sensitivity.

There are plenty of sources that are very close together. Even among just the YSO candidates, some are very close to each other, closer than 5 arcseconds.

If we had fabulous coordinates for everything, we could let the computer match them all up and not worry about it. It turns out that for most of these literature surveys in BRC 38, they are pretty well-matched, and the computer's automatic merging gets us a long way. But it's always good to remember that just because the computer says it doesn't mean it's right.

If we had images, we could line them up by eye and identify the same objects in each frame. I don't necessarily mean "line them up in ds9" (which would be the ideal case). But also, you can identify the objects simply by comparison between images they publish and images to which you have access (IRAC, 2MASS, POSS). This is what we are going to have to retreat to, in the tough cases.

Here are some notes on the 7 BRC 38 papers, in chronological order:

  • Ogura et al. 2002 - COORD NEED UPDATING, but CWAYS did this for you. Probably good to at least take a quick look at these, just in case. there are finding charts.
  • Getman et al. 2007 - x-rays. got all tables. includes all relevant results from Nisini et al. coord ok because they are from Chandra. Good to remember that Chandra's PSF changes the further you go off-axis - objects on the edge of the Chandra field are much poorer resolution than near the center.
  • Beltran et al. 2009 - NIR. got all tables. coord should be ok.
  • Choudhury et al. 2010 - IRAC, MIPS, optical. got all tables. coord should be ok.
  • Chauhan et al. 2009 - BVI, NIR, IRAC. coord ok. though some inconsistencies. NEED TO CHECK FOR BRC 38. got all tables.
  • Barentsen et al. 2011 - optical (r,i,Ha). coord ok. got all tables. Would be nice to get table of everything in region we care about, not just that which they are reporting on. Need to go to main IPHAS archive.
  • Nakano et al. 2012 - optical (r,i,Ha); IR (AKARI). coord ok. got all tables, i think. Might as well go and get table from AKARI of everything in the region, though I think WISE will be more powerful.


The mechanics of what we need to do

Links of interest:

For each of these papers, we need a machine-readable (read as "plain text file that the computer can parse into individual numbers rather than images of numbers or gobbledegook from microsoft") version of the relevant data tables. This is either:

  • obtained from the journal itself, in which case the data table is typically much longer than we need
  • obtained by typing in the coordinates of the objects in our fields from these older papers and then getting updated coordinates.

What we need to do for each Ogura+ 2002 BRC 38 object is : type the Ogura+ 2002 coordinates into FinderChart. Get 2mass images. Overlay a 2mass catalog. Look at what comes back. Compare to finding charts in Ogura+ 2002. Identify the object that Ogura was talking about. Make a note of the new coordinates.

The other papers all have pretty cood coordinates.

Then we will have a set of UPDATED, HIGH QUALITY coordinates, one per paper, and we can let the computer run through the list, finding the matches between papers. We then can generate one file that purports to have one line per literature object, with all the relevant data on that line. The difficulty comes in that inevitably, a few sources during this process end up tied to the same object, or identified in other ways as duplicates or incorrect matches.

The approach above to get updated coordinates for targets works ON THE ASSUMPTION THAT THERE ARE NOT "TOO MANY" SOURCES NEARBY. Some of this gets into resolution issues, which we will talk about in the future. As I say, this should work in MOST cases but not ALL of them.