Arbitrary Notations: 2014

Tuesday, December 23, 2014

Astronomers - Where are we going and who will we be when we get there?

Let’s talk about the US Bureau of Labor Statistics report on Physicists and Astronomers.

Here’s the link:
http://www.bls.gov/ooh/life-physical-and-social-science/physicists-and-astronomers.htm

    First note the median pay of a physicist or astronomer: $106,360 per year. Honestly I find this number laughably large. You can scroll down and find that the breakdown is actually physicists’ median = $106,840, while astronomers’ median = $96,460, which is still large, in my opinion. Let’s just focus on astronomers here. I think it is a bit misleading to lump physicists and astronomers together; I haven’t heard of many people switching between these two research fields — not without a great deal of effort.

    The “median”, by the way, is a kind of average of a set of numbers. It is defined as the number in the middle when you list the set in increasing order. You may be more familiar with the mean average (sum of the set divided by the number of items within the set). The median is less susceptible to being dragged up or down by large outliers in the group. For example, the median of 1, 5, and 10 is 5. If the set were 1, 5, and 10000, the median would still be 5, because it is the middle number. Whereas, the mean average would change a lot.

    So I wonder - what is going on here? Surely they are not counting post-docs among this group of astronomers.

    Let’s go to the “Work Environment” tab. Here we find that in 2012 the number of astronomers was 2,700 jobs (and physicists was 20,600). This immediately explains why the overall median salary is biased toward a higher number - the physicists make more money and there are more of them than astronomers. But still - 2,700 is a low number of jobs. I suspect that they are only counting faculty, let’s check the breakdown:

54% of astronomers work in colleges, universities, and professional schools.
21% work in research and development in the physical, engineering, or life sciences.
19% are employed by the federal government.

    This leaves about 6% unaccounted for. So, a little more than half of astronomers are actually faculty members. If that’s surprising at all, it’s because it seems low to me (excluding post-docs and grad students). About 1/5 of astronomers are employed in research and development. Presumably this is outside of colleges and universities, and outside government entities. I suppose this could include observatory staff astronomers, but my perception is that they make up more like 10% rather than 20% of the community. Is it possible that the post-docs are included in the 21% category? Post-docs are mentioned in the “How to Become One” tab:

“Many physics and astronomy Ph.D. holders who seek employment as full-time researchers begin their careers in a temporary postdoctoral research position, which typically lasts 2 to 3 years. During their postdoctoral appointment, they work with experienced scientists as they continue to learn about their specialties or develop a broader understanding of related areas of research. Their initial work may be carefully supervised by senior scientists, but as they gain experience, they usually do more complex tasks and have greater independence in their work.”

    All true, but let’s not keep post-docs down. In my opinion, a Ph.D. graduate is fully qualified to become a faculty member straight away. That’s not true for everyone, but for many I think it is. The fact that post-doc positions are common now is due to there not being enough permanent faculty spots to go around. There’s an oversupply of qualified people, and an under-supply of positions.
    As an aside there is probably demand for a whole new set of faculty; there are plenty of undergraduate students who want to learn physics and astronomy. There just isn’t enough funding to support those faculty. And everyone loses a little bit. Write your senators, folks.

“Pay” tab.

“The median annual wage for astronomers was $96,460 in May 2012. The lowest 10 percent earned less than $51,270, and the top 10 percent earned more than $165,300.”

    If you are lucky enough to get a job offer in astronomy, use this information to negotiate your salary. Be a little careful when quoting these numbers, though, because the breakdown shows a stark contrast between university employees and federal employees.

    Federal employees are the highest paid with a median of $139,000 per year, but only make up 19% of astronomers. University and college faculty, while making up the majority of astronomers, are only making a median of ~$78,000 per year. Still useful info for that initial salary negotiation.

    So realistically if you get to be a tenured faculty member, you still aren’t going to make the overall median salary, because the median is being brought up by the federal employees — despite the median being robust against outliers (~20% is not an outlier).

    Let’s take a look at the “Job Outlook” tab, and here’s the kicker. During the 10-year period from 2012-2022, the federal government expects the growth rate for astronomers to be 10%. With a population of 2,700, that means we can expect 270 new jobs in 10 years. That’s probably not counting the replacement of people who retire, so let’s be generous and throw those in there too. Call it 350 jobs over 10 years. That’s really generous, because the number that is actually projected on the website is 300 jobs. Also note that the projected employment in 2022 is 2,900, which is 200 greater than 2,700, not 300 as is listed in the 'Numeric' column. Hey Bureau of Labor Statistics, 200. / 2700. is 7.4%, not 10% - can we get a little better precision here? Lots of post-docs’ careers depend on this measurement. For being a bureau of statistics, they are remarkably inconsistent.

    You know what? There are probably more than 300 post-docs this year looking for permanent jobs. Good luck. You might get hired sometime during the next 10 years.

    If you are interested, you might consider browsing some other occupations described by the Bureau of Labor Statistics for comparison. For example, “computer and information research scientists” numbered around 27,000 (ten times that of astronomers) in 2012, and are expected to grow by about 4,000 (15%) between 2012 and 2022. Hey, they also make a median salary of $102,000 per year (~20-30k more than the majority of astronomers). Turns out they have a lot of overlapping skills, too. Just something to think about.

Information and quotes gathered from:
Bureau of Labor Statistics, U.S. Department of Labor, Occupational Outlook Handbook, 2014-15 Edition, Computer and Information Research Scientists, on the Internet at http://www.bls.gov/

Thursday, December 11, 2014

Visiting Yale

My current fellowship is sponsored by the Chilean national government. In particular I hold a FONDECYT Postdoctoral Fellowship. FONDECYT is an acronym for Fondo Nacional de Desarrollo Científico y Tecnológico, which translates to National Fund for Scientific and Technological Development. Naturally, I live and work in Concepción, Chile.

But, thanks to my swank fellowship, I get to be my own boss and decide when and where I go for collaboration. It happens that my current supervisor has close ties to Yale University, and so I’ve extended my network to Yale as well. It also happens that I am part of a two body academic orbit, so coming in close toward Yale is particularly rewarding on a personal level as well as an academic/professional level.

Right now it is early December in New Haven, CT, and I’m in for a nice snowy New England winter; I am staying through the month of February. On my to do list while I am here is to reduce and analyze some spectroscopy of a sample of Narrow-Line Seyfert 1 galaxies. These galaxies host active galactic nuclei (AGN), and are of interest to me because we expect them to have black holes that are smaller than those of other kinds of AGN. We can learn some really neat things about the connections between the AGN and their hosts by studying NLS1s.

I really like the department here. It is big enough that something is going on just about every day. There are lots of scientific talks, or journal clubs, or other social events to attend. I actually have more space here as a visitor, than I do at my home university - I guess being a private institution has benefits. So far I'm enjoying the environment, and I hope to get some good science done while I am here.

I’ll also be traveling to Seattle for the American Astronomical Society meeting in early January. I registered a bit late, and I am only able to present a poster on the last day of the conference, but it is better than nothing. Now I just need to make sure I have those data reduced and ready to show off before then!

In early March I’ll be headed back to Chile, just in time to travel to Puerto Varas for a science meeting. Puerto Varas is a beautiful lake town in the south of Chile, with great views of Volcan Osorno. I’m really looking forward to going back there, and I hope to present some good science results when I am there.

Bye for now!

Tuesday, December 2, 2014

Papers - Proposals to Publishing - Pt 1

I recently published my third first-author paper. It is always a great feeling when a paper gets accepted. As an academic researcher, papers are the primary product of my job. Yes, I have a “real job”. I get paid real money to perform a real service and I output a real product. And yes, this is a point I am defensive about. But to get back on track, this particular paper came out of my graduate work, so it is especially gratifying that it is published.

Publishing a paper is a lot of work. I suppose it is easier for some people or for some disciplines, but I have found it usually to take a significant amount of time and effort. Here is an outline of research from the proposal to the finished paper.

First you have to get an idea for your research project. Then you have to write proposals for either or both funding and data. This process itself can be quite intimidating, but it is often exciting. Writing a proposal is a good time to read papers/research from others and get ideas about what to investigate on your own. I often learn a lot at this stage. After you submit the proposal, it will usually take a few months before the results from the review committee are released. At that point, all you can do is wait - or, more realistically, busy yourself with all the other things you have to do.

Once time/funding/data have been granted you can start your actual research program. Going observing is one of my favorite experiences in astronomy, and maybe I can write up a how-to on observing runs in the future. But let’s say you go observing and collect a bunch of data. Then you get to go home and start analyzing, right? Wrong!

First you have to prepare your data. In astronomy we call this “reducing the raw data”. In other fields you might call it “cleaning the data”. You have to get rid of artifacts from the instrument(s) and perform calibrations so that you can interpret your data properly. The amount of time this takes is highly variable. It depends on several factors including what kind of data you have (imaging, spectroscopy, IFU), and also on how familiar you are with the instrument. It can help to have a pre-made pipeline for data reduction - or it can frustrate the bananas out of you trying to figure out how to run it.

After the data are cleaned you are ready for analysis! This can include an intermediate step, which is referred to by data scientists as “data wrangling”. You want to compare your measurements to some previous study, or mix data from two different sources. Often times these data are in various formats or calibrated to different standards. In order to make everything uniform, you have to perform transformations and/or recalibrate the data.

The analysis will obviously vary depending on the research program, but it might include spectral or image modeling, statistical analysis of data, error and uncertainty estimations, lots and lots of plots, etc. And after many long hours of work you get some results! In some cases this might reduce down to a single point on a single plot, or the results might span many pages of plots.

After all that work, you are ready to begin writing your paper. Some ambitious sorts may have already begun writing as they perform the analysis, but we’ll get to all of that next time on . . . Arbitrary Notations!

Kyle D Hiner

Tuesday, October 28, 2014

A Tour of an Active Galactic Nucleus

In my last post I showed an image of an active galactic nucleus. AGN are some of the strangest objects in the universe. Let’s take a tour and see what we find.

Here on Earth, far, far away from even the nearest AGN, what we often see in the sky are single points of light - objects that look much like stars. But these objects have very strange spectra. Spreading the light out over all its colors and wavelengths (spectroscopy), we see that these objects are not like stars at all. They have spectra that are very different from stars. Many show radio emission - not like stars at all. This led to these objects being names quasars - a mashup of ‘quasi stellar radio sources’.

Let’s see what happens if we get closer to the quasar. Astronomers have a couple ways of looking at objects in more detail. We can ‘zoom-in’ by getting better angular resolution; and we can ‘go deeper’ to see faint features in the object. If we do this, we often see that the quasars are just one part of an entire galaxy. It’s amazing - what we saw before - just one point of light - is brighter than the entire galaxy where it resides! And because astronomers like to classify and re-classify objects, we get to rename them. No longer are they ‘quasi-stellar’ - they appear in the center of entire galaxies. It’s a nucleus of a galaxy and it’s doing something, so a better name would be Active Galactic Nucleus (or Nuclei if plural). I know, we’re so creative :P

If we look a little more closely at the host galaxies, we can find entire swaths of gas that are ionized in a strange way. The atoms in the gas get ionized (loose electrons) when energetic photons (light particles) smash into them, and that happens around bright stars. But this gas is ionized in a way that stars can’t make happen. So what is ionizing this gas in the galaxies? It all seems to come from the active nucleus, so let’s get closer to that.

It is practically impossible to resolve the nucleus of AGN, but we can learn a lot about it using other methods. If we could zoom in, we would see some really funky stuff.

We’d see a large toroidal structure of dark, dusty clouds. Because the dust is shaped in a torus, like a doughnut or a fat bike tire, it blocks our vision of the very center of the AGN along some lines of sight, but not others. We think the torus isn’t exactly smooth - it’s more likely made up of lots of individual clouds that travel around the nucleus itself. There just may be fewer of these clouds around the polar regions of the AGN, allowing more light to pass through more lines of sight.

Within the torus we find the “broad line region”, where clouds are orbiting something very massive and very small at the center. These clouds can travel with velocities up to ten thousand kilometers per second. By comparison, the International Space Station travels at about 7.7 kilometers per second (thanks google), and the ISS flies all the way around the Earth in just 90 minutes. So these “broad line clouds” are traveling about 1000 times faster than the ISS.

Getting closer still to the center we find a very bright, very hot disk of material. This is the source of all the light energy that is shinning from the AGN. It is so hot that it glows in the ultraviolet wavelength range of the light spectrum. Just think - your cooking pan gets hot, but doesn’t glow. You’ve probably seen videos of hot metal on the internet that is heated to the point that it glows a bright orange color. And surely you’ve seen fires with blue flames. Well this gas is so hot that it glows in the ultraviolet.

But why is all this gas so hot and traveling with such high velocities anyway? Answer: Within the very center of the hot gas disk, there is a black hole. Not just any black hole - a supermassive one. As I mentioned in my last post, supermassive black holes can be up to a billion times more massive than our sun. And that black hole is pulling very hard on all the gas and dust in the disk surrounding it. The material of the disk is actually falling down onto the black hole itself, making it even more massive.

And if you zoom back out and think about all that gas and dust in the host galaxy that is experiencing the radiation from the active nucleus, you might think that supermassive black hole can have a big effect on the host galaxy. You might wonder if it affects the galaxy’s ability to form stars, or if it affects the shape of the galaxy in some way. And then you might have to get a PhD in astronomy to figure it all out . . . :D

That’s the tour of the AGN, see you next time!

Friday, October 10, 2014

Astro Notations: The M-σ Relation

This week I thought I would discuss the M-σ relation.

The M-σ relation is a puzzle of modern astronomy. You see, it shouldn’t exist, or rather - we don’t understand why it exists. The M-σ relation is a correlation between the mass of supermassive black holes (M) and the velocity dispersion of the stars in the host galaxy (σ).

What the heck does that mean? I hear you asking. Well I’ll tell you.

First, the stellar velocity dispersion. This is a measurement of the range of velocities that stars have while flying around the galaxy. An observer from Earth sees some of the stars flying toward her, while others are flying away, and many are traveling with velocities somewhere in-between directly toward and directly away. The range of those velocities is what we call the dispersion, and it is directly determined by the amount of mass within the orbits of the stars.

Now what about the mass of the black holes? Well, black holes are really, really massive. Some are so massive, they are dubbed supermassive, and those supermassive black holes rest at the center of nearly every galaxy in the universe. The supermassive ones have more mass than a million of our Suns, and some have more mass than a billion. A billion. And that means that they have a lot of gravity. Mass leads to gravity, and gravity binds galaxies together.

But as big as supermassive black holes are, they are nothing compared to the galaxies in which they reside. And all the stars and all the gas and dust in the galaxy (what we call baryonic matter) is still small compared to the dark matter in the halos of the galaxies.

But before this turns into a discussion about the largest structures of the universe, let’s scale back to the supermassive black holes. While those black holes are massive enough to influence the orbits of the stars that are very nearby, they are not massive enough to influence the orbits of stars that are further away in the bulge of the host galaxy. The galaxy is so much bigger than the black hole, that the black hole can only influence - can only pull on - the stars closest to it. And that is precisely why the M-σ relation should not exist.

We measure the velocity dispersion of the galaxy, which should not be influenced by the black hole, and we measure the mass of the black hole. When we examine the measurements from many different galaxies, we find that the more massive the black hole is, the bigger the velocity dispersion is in the host galaxies. But the galaxies are so much bigger than the black holes - how do they "know" about the mass of the black hole at the center? Why does that relationship even exist?

There are a couple explanations that might connect the supermassive black holes to the host galaxies. One involves smashing galaxies together and merging the black holes within them. The other adds a layer of feedback from black hole growth into the host galaxy. Neither of these ideas are definitively confirmed yet, but there is evidence for both. I’ll leave that for another day.

For now, check out this cool artwork of an active galactic nucleus. What is it? Where did it come from? What is it doing? Although, I have discussed the Teacup AGN in a previous post, I'll keep discussing AGN in the future. I'll even discuss how we can use AGN to study the M-σ relation.

Image credit: ESA/NASA (http://sci.esa.int/integral/49029-the-unified-model-of-agn/)

Sunday, September 28, 2014

But Wait . . . I’m Still an Astronomer

It’s been a few weeks since my last blog post; I am back in Chile, and there is a lot to discuss. But I’ll save you from the majority of it.

In September, I spent two weeks visiting Yale, which is a great place for astronomy. The department there is a good size, and there are always events to attend - talks, journal clubs, student/post-doc workshops, etc. This is pretty different from my previous experience at universities that had smaller departments. There is just a better sense of community once you get a department beyond some critical size.

The department also has a very positive attitude in regards to the academic culture. For me, the attitude of the community makes a world of difference. When I was a first-year graduate student I was surrounded by a lot of negativity. That continued into my second, and third year. It has taken me a very long time to work out of that. Having a more positive attitude about academia and astronomy gives me more motivation to keep working on things. It's a feeling I get when I attend astronomy meetings/conferences. And I’m pleased to report that I’m working on revising a paper that, hopefully, will get accepted for publication soon.

Also in the past month, I’ve had an undergraduate from UdeC emailing me requesting to work with me. I was hesitant at first, because I haven’t been sure where I would be in the near future. I decided to give him some reading, and now I am actually excited about giving him a project. We met in-person last week, and he said he wanted to work on some data. I think I have a mental sketch of a decent undergraduate project where he will get some experience and learn something useful. But now that I’ve taken him on as a student, it means more work and responsibility for me. This is all a good thing, because it means I continue to progress in my career.

I also had a conversation with my current faculty sponsor. We decided that it might make sense for me to apply to some faculty positions in order to help solve the Two-Body Problem. This is going to be very interesting; I’ve never applied to a faculty position before. After thinking about it for a couple days, I’ve convinced myself that I would make a good candidate, and I hope that my application(s) will reflect that.

For a while I have been pretty conflicted about how and where I might continue in my professional career. But since attending the data science workshop in August (a very positive experience) and since visiting Yale for another two weeks (another positive experience), I am a little less worried about where I may end up. I feel a renewed attitude, and I expect that whatever job/career opportunities arise will be rewarding.

Monday, September 1, 2014

Data Science is Not Magic

We’re at the end of August and close to the end of the S2DS program that I’ve been attending in London. The program has been great, and I have learned much more in a month than I would have simply by investigating on my own. I have had a good team, and I feel like we accomplished something useful.

As I mentioned in previous posts, my team is focused on unifying offline shopping data (specifically from grocery stores) and online data from facebook. I won’t tell you the actual results that we found - for that you’ll have to sit in on our presentation on Tuesday afternoon. But, I will describe some of the methods that we’ve used.

We investigated whether users who were similar to one another in one data set could be recovered by looking at their habits in the other data set. This is referred to as a Look-Alike Engine. We went about it by creating similarity matrices that assign a score to each user-pair based on how similar they are to one another using the input data. For shopping habits that means comparing users who bought similar items. For the social network data that means comparing users who have liked similar content.

There are a couple of different metrics one can use to determine similarity between two sets. A simple metric is the Jaccard Index, defined as the number of items in common between two sets divided by the number of total items. We decided that this scoring metric was not producing the results that we wanted. If two users each have missing data, then they are scored as similar. This somewhat depends on how the data sets are constructed and how the code runs, but it can very easily count a ‘0’ in one user’s data as the same as a ‘0’ in another user’s data. We might not want that in the end.

Another metric is the ‘cosine’ similarity score. This requires one to vectorize the data in a consistent way, because the cosine metric is defined as the dot product of the two user’s data normalized by the product of the magnitudes. Recall that the dot product can be written: A dot B = A*B*cosine(theta). So this method simply calculates the value of cosine(theta). At first I found this concept to be slightly difficult to apply to “shopping habits” and “likes” from facebook. However, it is fairly straightforward once you organize the data into a vector.

For example, with the movie ‘likes’ from facebook, we created an array/vector where every element is a possible movie title. We populated the array/vector with a value of 1 if the user ‘likes’ the title, or a zero if the user has not ‘liked’ that title. Then we had an array with all possible dimensions (movie titles) for every user. The cosine metric is simply the dot product of the arrays. For our data this meant that if two users did not declare a ‘like’ for a given movie, then they would not be considered similar - we don’t actually know if they liked the movie or if they didn’t like it. Not declaring a preference is not the same as dis-liking a thing.

Once we measured the similarity between users with a given metric, we organized the data into a large similarity matrix. This matrix simply contained all the similarity scores for every user pair, and allowed us to find the users who are most similar to one another.

Finally, we could test whether or not users who were similar in one data set were also similar in another. We selected the most similar users in one set to use as a control sample. Then in the other set we selected the most similar users and compared the two resulting selections of users to see how many we recalled and to what precision.

Recall and precision are two metrics by which we measured the ability of our method to find similar users. Recall is defined as the number of true positives in the set divided by the total number we could possibly find (the true positives plus false negatives). Precision is defined as the number of true positives divided by the number you’ve actually selected (true positives plus false positives). Depending on your application you might want to sacrifice some precision in order to obtain better recall, or vice versa.

Lastly, in order to determine if our method was actually any good, we conducted a simple monte-carlo simulation where we randomly selected users from the experimental data set to compare to the control set. In the end, we were able to determine the recall and precision for random selections and for our selections using the similarity scores as a function of the number of users in the sample selection.

Now if there actually is a correlation between people’s shopping habits and the movies and TV shows that they like on facebook, then we expect to more accurately select users from our similarity matrix than we would just by selecting at random. As I said, I won’t tell you the result, but you can hazard a guess on your own.

Coming up we will present our results to other participants and partner companies on Tuesday afternoon. Then on Wednesday afternoon there will be a job fair where will will be able to discuss opportunities with the partner companies.

As I said, I have found this experience very useful, and now I have a definite project that I can point to - with results - that shows what I can do as a data scientist. I am hoping that I will be able to develop more skills in data science and machine learning techniques as I move forward with my career.

Kyle D Hiner

Saturday, August 16, 2014

S2DS and python, etc. - Pt 2

Week 2 of S2DS is down, and this week we began working full-time on our projects. As I mentioned before, my team has some proprietary data to work with. We’ll be combining offline shopping habits with online social network preferences to try to build models about consumer demographics and behaviors. As of now we’ve just been looking at the data to get a feel for what is in there, and we’ve started to brainstorm some ideas about what we can do with them.

Our team only has three members. One is working with the offline data, while I have partnered with the other to work with our Facebook data. We are currently using a python package called GraphLab to do some modeling and analysis of the data. To begin, we wanted to see if we could create a movie recommendation engine based on the movies that users had ‘liked’ on their Facebook profiles.

The GraphLab tool makes implementing machine learning techniques so stream-lined and black-boxed that it is almost trivial. Almost. And, actually, as a scientist, that is extraordinarily frustrating. I’m dying to get into the details about the algorithms and theory behind this stuff. That, as you might expect, is non-trivial. At a basic level, you don’t need to know all that stuff, but if you want to understand what is going on and how best to use the tools, you had better know it. As a scientist I can’t stand using a tool if I don’t know how it works. I can’t defend my work if I don’t know what it is that I actually did - that is fundamental to science. The good news is that everyone here understands that.

The bad news is that I still only get a couple weeks to dive into all of it. It turns out that industry/business works on a much shorter timescales than academia. That’s what we are told, at least. For now, I am trying to learn as much as possible. But, as with any new endeavor, you can only learn so much in a short amount of time. I know I won’t be an expert after this program, but I will have a great deal more knowledge. (Aside: One thing I learned from working in the CA Senate - if you know one thing about anything that no one else knows, then you are the expert in the room.)

I’ve been getting more comfortable with programming in python now. After taking the Udacity courses and playing around on my own, I feel like I can understand many of the basic structures that python is using. Even though I don’t know a lot right now, I am more confident in my ability to quickly learn python and its various packages (pandas, numpy, scipy, graphlab, etc.) now that I have more experience with them - and that is going to be really important going forward in my career.

At the end of this week, we (my S2DS team and I) have created a basic recommendation function for movies based on people’s Facebook likes. Almost immediately I’ve learned a good lesson about testing code. As I mentioned above, we are using a python package that works a lot like a black-box. So, to figure out how it works, we have to test various inputs and see if their output makes sense. For example, we’ve tried generating fake users that are copies of our real users to see if we get consistent recommendations for the invented data. At first, we didn’t, which indicated we were doing something wrong. So, the value of testing code is very high. Our background in science has given us this attention to systematics and details, so that we understand what it is exactly that we are doing, and we can produce consistent results.

We are planning to expand on our work by increasing complexity. We would like to incorporate other variables like music and book preferences as well. Hopefully we’ll be able to take in a given set of music preferences and recommend both additional music and movies based on other users’ preferences. Collectively this is sometimes referred to as a look-alike algorithm.

A lot of the work that I’ve just described is just exploratory. Next week we are going to have to focus some more on combining our data sets and doing some real hypothesis testing. In the end, we are hoping to develop an/some algorithm(s) that will actually be useful. I think I will be a very strong candidate for data science jobs by the end of the program.

Kyle D Hiner

Studious data scientist

Sunday, August 10, 2014

Science to Data Science, Pt 1

I’ve traveled to London for a 5-week program on data science techniques. The program is called S2DS (Science to Data Science), and so far it has been really great. There are about 80 participants, who are divided into small (~4 person) teams. Each team partners with a company for the month in order to complete a project that is relevant to the business and serves to teach the participant valuable data science techniques. I actually have an some open questions regarding intellectual property, but glossing over that . . .

We’ve just completed the first week, and we’ve covered a lot of material. It is going pretty fast, and I feel like I’ve been here a while already even though I still haven’t met all of the participants. The lectures have been rather superficial. One simply cannot realistically give an substantial introduction into any of these topics in the small time frames that we have. You can’t learn python in 2 hours, similar for Hadoop, SQL, and machine learning. That said, the introductory lectures are valuable. I am still trying to grasp all the concepts, but hopefully I will get much a more detailed experience as the program progresses.

The thing I am most excited about in this program is expanding my skill set. I’ve always been somewhat uncomfortable with my level of competency with various tasks and programming. I’ve previously written about trying to learn more python, including through the Udacity online coursework. This workshop will really be a time that I can dive into some new things that are applicable both in and outside of astronomy.

So far we have had lectures on good coding practices, Hadoop, Database introduction, SQL, NoSQL, statistics, R, natural language processing, machine learning techniques, and even some more business oriented seminars on economics and marketing. Meanwhile, we’ve also met with our project mentors to outline the project scope for our group work over the next few weeks. I am most excited to dive deeper into the machine learning techniques, because I feel like those are most useful across a wide variety of potential future jobs.

For my team’s project, we’ll be doing a lot of machine learning and trying to do some predictive analysis. The data set that we are getting access to is proprietary, so I’m not allowed to share information about it. But, generally, we’ll be working with customer data to predict purchasing habits and/or demographics of the customer. It is kind of crazy to think about the amount of information people are willingly (or unwittingly) handing over to various companies. Regardless, I am looking forward to learning the techniques of the project and potentially applying them in academia or other industries.

Kyle D Hiner
Data scientist?

Sunday, July 27, 2014

Tostada Granos y Bliss

Here is my first post about one of my most rewarding hobbies: coffee roasting.

I started drinking coffee in college. I tried coffee as a teenager, but never really liked the stuff that my parents would buy from the store. Mostly this was your regular low quality brand. Sometimes they would buy Kona blends, and almost always they bought decaf. A lot of people like Kona, but I don’t, and it wasn’t until much later that I developed a taste for other coffees.

Because I had never really enjoyed coffee, I mostly stuck with mocha drinks if I went to coffee shops with friends. But, somewhere along the way I switched to just drinking regular black coffee. The more I payed attention to it, the more I enjoyed it. I think that is part of why I enjoy it - to fully appreciate coffee it takes time, patience, and focus.

A few years ago I started roasting my own coffee at home. I found an online store that fortunately had a physical store-front in my city. That store actually specialized in home-brew beer kits, but they also sold coffee for home roasting. I purchased their introductory kit online, and went to the store to get some green coffee beans.

The roasting kit is actually very simple. It is a stove-top whirly pop popcorn maker and a thermometer. That’s it. I punched a hole in the top of the popcorn maker with a corkscrew so I could insert the thermometer. The thermometer itself is actually pretty special - it goes up to 500 F. I’ve never seen one in a store, although they must exist. If you look for a thermometer like this you will probably find candy thermometers, but that isn’t what this is. This one is thin and more like a meat thermometer that goes up to 500 F. It’s actually quite important for the process.

Anyways, I ended up getting some green coffee beans from that store and I dumped them into the popcorn popper pot and started roasting. The first thing I noticed was steam! Lots of steam! And then smoke! Lots of smoke! and then chaff! Lots of chaff! It was a bit of a mess. And it was delicious. The aroma of freshly roasted coffee is intoxicating.

Since I started roasting my own coffee I have learned a lot more about what goes into a good cup. I bought a book and started reading it; although I still haven’t actually finished it. I may write more about coffee roasting in the future, including my dreams for running my own café, but this time I just wanted to say . . .

I miss coffee roasting!!!!

Since living in Chile, I haven’t been able to roast my own coffee. That’s mostly because I don’t know anyone who sells green coffee beans. There is one decent coffee roaster here in Concepcion, and he wouldn’t sell me green beans. I was/am very disappointed. There are many things I miss about the US, and coffee roasting is probably number 3 on the list (the first two being a person and a cat).

You can probably already tell that I enjoy coffee from my chosen background image for this blog. Although, I have to say that that is a sorry-looking batch of roasted coffee. My coffee comes out much more evenly. You can see in the background image that some beans are barely roasted at all, while others are charred. tsk tsk.

I’ve included an image of some coffee that I roasted a long time ago (and subsequently ground up, soaked in hot water, and drank to my delight). Le sigh. I miss good coffee. Also, I am including an image of an antique coffee grinder that I saw in an historical German house outside the town of Frutillar in the south of Chile. Both are original photos, so . . . copyright me?

Until next time, chao!

Kyle D Hiner 
Coffee-Roaster Extraordinaire

Tuesday, July 22, 2014

Proposaling!

Foreword: I actually wrote this post a few months ago while I was working on some proposals for telescope time. Also, yes, I was thinking of starting a blog a few months ago, but didn't commit to it until recently.

Proposals are an interesting beast. In astronomy, the proposals for funding and those for data are typically decoupled. In some cases (e.g., when apply for Hubble Space Telescope), support funding comes with the granted time for data acquisition. These applications are rare, and highly sought after.

I have a fellowship sponsored by the Chilean national government. This fellowship pays my salary and some other costs such as travel and publication costs associated with doing science. However, there is no guarantee that I will actually get time on telescopes in order to collect the data necessary for the project that the fellowship is intended to fund! Thus, in addition to the initial fellowship proposal, I have to write many observing proposals in order to get data. One would think that I am decently good at it by now.

In my experience, every proposal is a snarling jabberwocky that needs to be wrestled and tamed until it fits in a cage of predetermined size and material.

These times are often great for learning a lot about my science, as they require some level of reading. I need to know what I’m talking about right? They are also exciting times, because I think a lot about the projects that I am doing or want to do. But my main point for this post is this:

The amount of time you invest in a proposal needs to be carefully considered. I know people who work to the very last minute on proposals. The idea here is that every change you think of can only make the proposal better, and if that one change makes it good enough to get time, then it is worth making. But I think a little bit differently . . . in part, because that method leads to tremendous amounts of stress, late hours, and sweating bullets trying to get the proposal submitted in the last 10 minutes before the deadline and oh my god, did the website crash or is it just my internet - freak out! I have a typo!

No. Here’s the way I look at it:

The “goodness” of a proposal is a function of the amount of time/effort one spends on it. However, that function is not a power-law, not a quadratic function, not even linear. That function quickly closes in on an asymptote (see figure). And no matter how much you work on that proposal it’s never going above that asymptote. Now, somewhere on the “goodness” scale is a cut-off for how good your proposal has to be in order to be granted time, and many factors go into determining that critical value. As long as your proposal reaches that level, then it is “good enough” - you get the time, you get the data, you go on with your life. And it may in fact be that no matter how much time and effort you dump into your proposal, it will never meet that critical value.

Furthermore, it is nearly impossible to predict what that value is. You’ll go insane if you try to squeeze every last moment into proposaling. So, try not to go down the path of infinite time/effort. At some point there is nothing more you can do to make your proposal get time.

At least that’s what I try to do. Inevitably, I get sucked into the vortex of proposal deadlines.

Figure: Blue - your proposal "goodness". Green - the maximum "goodness" your proposal will ever have (determined by quality of proposed science). Red - critical value at which your proposal will make it past review and be granted time (determined by your writing ability and the mood of the review committee).

Kyle D Hiner
More blogs to come

Epilogue: Since the writing of that initial post I've found out the results of the applications I had submitted. One of my proposals was successful, while two others were denied time. I could probably do an analysis to see if I actually beat the odds, but that would require me doing some more research than I really care to do at this point. In any case, I (and my research group) was granted 3 nights of observing with the Magellan telescope at Las Campanas Observatory, Chile. That's a big deal. Each night costs about $30k to run the telescope. Effectively, I was given ~$90k to do some research. There's more to say on that topic, but this post is long enough already.

Thursday, July 17, 2014

Astro-Reports: The Teacup AGN

Recently I read an article regarding the “Teacup AGN”, so named because of its apparent shape in visual images. Although, if you ask me, it doesn’t look much like a teacup, even if it does have a loop.

This is a very interesting case that is reminiscent of Hanny’s Voorwerp galaxy. Both cases exhibit extended emission line regions, which are very large swaths of gas that are ionized. Ionization occurs when diffuse gas is illuminated by very energetic photons. The “ionizing photons” interact with the gas, give energy to electrons, and kick the electrons off of their atoms. Subsequently, an electron freed in this manner occasionally falls back down onto an atom. When it does so, it gives off energy by emitting another photon. These are the photons that we see when we observe the gas.

If you take a spectrum of the gas, you’ll be able to identify discrete bright spots - what we call emission lines. Some of those emission lines show up stronger than others. This is the result of various physical conditions such as the density of the gas, and the light that is causing it to be ionized. What that means is that if you observe the emission line intensities and their ratios, you can infer things about the ionizing source. In the Teacup AGN, that ionizing source is an active galactic nucleus. I can write more about them later.

In the paper I read recently, the investigators (Gagne, et al.) measured the spectrum at discrete locations along the extended emission line region (the ionized gas). Then, they predicted properties of the ionizing source based on all those individual data points. Specifically, they calculated the luminosity (or brightness) of the active nucleus that would be needed to create all the emission line strengths and ratios that they observed at the specific locations in the galaxy. And because light has a finite travel speed, this allowed them to track the luminosity of the source over time.

By analogy, the analysis worked, because each spot in the gas acted like a photograph of the active nucleus. The "photographs" from the outer edges of the galaxy showed a very bright source and those from the inner regions showed a dim source. Because the "photos" from the inner region are more recent than those from the outer regions, it can be inferred that the active nucleus dimmed over that time period.

The authors’ main result is that the illuminating source decreased its brightness by a factor of 100 over the course of 46,000 years. That is not that long relative to the lifetime of stars and galaxies. The result, suggests that active nuclei can turn on and off rather quickly.

Kyle D Hiner
Learning to ~~crawl~~ blog.

Friday, July 11, 2014

Python Pt 1

Python pt. 1

In the past couple months, I have been trying to get more experienced with the python programming language. I do not have a very strong programming background. I took a C/C++ course in undergraduate, but at the graduate level, I’ve had to teach myself IDL. I never had a statistics course, or a computational methods course. Therefore, I feel quite behind when it comes to scientific computing and data science. In order to rectify this, and to make myself a stronger candidate for positions outside academia, I am trying to develop my skill set.

Python is a completely free programing language that anyone with time and desire can learn. Python is available for nearly all operating systems (Linux/Unix, OS X, Windows). Fortunately, there are plenty of freely available tools to help one learn python. I have decided to work through The Python Tutorial and I am using version 2.7.5. This is not the most recent release of python, but I don’t really want to dig around in my computer to see if I have the most recent release somewhere or not.

After spending some time with the python tutorial, I had also learned about many other free resources for python online. I decided to take a look at the Udacity course Intro to Computer Science, which gives the basics of programming using the python language. I was already familiar with a lot of the basic logical statements like loops and conditions from my previous experience in programming (C/C++ class and working knowledge of IDL), but the course was a good intro to the syntax of python, and showed off some of its features.

The main objective in the course is to build a web search engine, which I found very interesting. There are obviously many good search engines that already exist, which makes the task of building one very accessible to a general audience, if a bit less practical. Probably no one is going to substitute their homemade search engine in place of google, but it is nice to know that one could, given sufficient resources, and without an incredible amount of sophistication.

Actually building the search engine was broken down into parts - first building a web crawler that searches web pages for links to other web pages. Then one has to build up an index of pages and match that index with keywords on the page. Finally, to make the engine practical, one needs to build a ranking system that returns pages in an order that, hopefully, matches the user’s intentions. The ranking system implemented in the Intro to Computer Science course is called PageRank, and is the ranking system that Larry Page used in the early days of google. It basically ranks pages based on the number of other useful pages that link to that page, as a kind of “popularity” measure of web pages.

It took me about two months to actually complete the free-access version of the Intro to Computer Science course on Udacity. I found the exercise quite useful, and I’d like to move to primarily using python for my research purposes, when I can. That is a big move, and opens another can of worms that I won’t get into - mainly because I don’t have experience doing it yet.

I enjoyed the course so much that I started working on another Udacity course: Intro to Data Science, which gives an overview of some other very interesting software packages including Pandas for Python, SQL, and API. I’ve only worked through the first two lessons so far, but I am really enjoying it. I will write another post on the topic after completing more of the course.

Kyle D Hiner, Ph.D.

Saturday, July 5, 2014

Hello World!

Hello World!

Let me introduce myself. My name is Kyle Hiner, and I am a post-doctoral scholar working in Chile. I earned my PhD from UC Riverside in 2012, and I study astronomy. I have many interests that go beyond astronomy, and I will definitely post about them as this blog progresses.

This is my first blog post, in which I will outline a set of goals to achieve/maintain during the course of this blog. The main purpose of this blog is for me to work on communicating various things to a general audience, and to participate in larger discussions that are occurring in the big wide world. Mostly I will try to stick with topics that I know about or am interested in. I may post on something that I have an interest but no knowledge, or knowledge but no interest. Hopefully we'll keep the later to a minimum.

Post frequency:
I will make an attempt to publish at least two posts per month.

Post length/depth:
I will post at least 3 paragraphs on every topic. I will make an attempt to include relevant data and citations when appropriate.

Post subject matter:
Astronomy, data science, independent learning, academia and higher education, other professional activities, US expat living/working, personal activities, and probably a bunch of other things that strike my fancy.

Well I believe that covers the very basics. I know you don't know anything about me yet, but stay tuned. There just might be more to come.

Kyle D Hiner, PhD
Newborn Blogger