This is a first post in an ongoing series where the BSR team sketches out their creative process.  Look out for more posts in the coming weeks!

 

As Alexis Fedorchak pointed out in her letter from the editor (1), several articles in the latest issue of Berkeley Science Review focus on meta-science – describing many sides of being a scientist and the process of science. The cover article by Anum Azam, “The first rule of data science” (2), explores a new scientific field rising to prominence at UC Berkeley, one that can be seen as a type of meta-science itself. Data science is inherently a collaborative endeavor, bridging many areas of research and diverse sets of skills.

Data science skillset. Design: Natalia Bilenko, modified from Drew Conway; Book: MTchemik; network: Qwertyus

Data science skillset. Design: Natalia Bilenko, modified from Drew Conway; Book: MTchemik; network: Qwertyus

 

I wanted to capture this spirit of collaboration when illustrating Azam’s article. The image I chose for the title page fit the bill perfectly – the visualization by data scientist Olivier H. Beauchesne shows scientific collaborations across the world (3), based on co-authorships in the Elsevier’s Scopus database (4). Numerous connections, depicted as sweeping arcs, evoke a web that unites the world in the scientific process. When the decision was made to feature Azam’s article on the cover, I wanted to extend the theme of scientific collaboration and take it closer to home. The article describes many connections across the UC Berkeley campus, and I thought that BSR readers might want to see UC Berkeley collaborations (I certainly did!). This is how I found myself solving a data science problem to illustrate a data science article.

Finding a dataset

The first step was to find an appropriate dataset. Though co-authorships lingered on my mind, I did not have full access to the database used by Beauchesne. I went through a few alternatives: cross-linked websites on the berkeley.edu network, cross-listed classes, faculty appointments. While interesting, none of those sources of information captured collaboration explicitly. Finally, I made a fortuitous discovery – PubMed database (5) allows searches by affiliation. Just as I was dreading writing a tedious script to parse the HTML returned by a PubMed search, it turned out that I could export those search results in XML format, a much easier structure to understand and pull information from. We were in business.

I retrieved all articles published in the past 20 years (1994-2014) that listed “University of California, Berkeley” (and several variants) in the affiliations. PubMed’s listings dropped rapidly for articles before 1994, likely due to indexing problems, and twenty years seemed like a good range. I then loaded the data in my favorite exploratory tool – IPython Notebook (6), created by UC Berkeley’s own Fernando Perez and his team of pythonistas. I parsed the data using a Python library called BeautifulSoup (7). It’s a library for working with web data that I picked up from the excellent visualization blog FlowingData (8). I used it to get a list of affiliations for each article, and only kept affiliations that included UC Berkeley. That list included 3826 articles.

Cleaning the data

 

It was time to switch to manual cleaning. The affiliations were quite inconsistent: non-normative department names, various ways of listing affiliations for multiple authors, even misspellings of “Department.” Next time I publish a paper, I might fill out that little field more carefully. I kept articles with multiple Berkeley affiliations, including a few major independent collaborators, such as Lawrence Berkeley National Lab and Howard Hughes Medical Institute. I then loaded the affiliations for remaining 611 articles back into IPython Notebook and created a co-occurrence matrix characterizing co-authorships between programs. With the matrix in hand, it was time to make the first graphic draft.

Co-occurrence matrix characterizing co-authorships between UC Berkeley programs.

Co-occurrence matrix characterizing co-authorships between UC Berkeley programs.

 

 

To display the data, I used D3.js (9), a powerful JavaScript library for interactive data visualization. D3 was created by Mike Bostock, graphic editor for the New York Times, while he was a graduate student at Stanford. It’s immensely flexible and has a very supportive community with many tutorials and resources. The idea of making a geographical map appealed to me, but the correspondence between departments and buildings at Berkeley is quite convoluted. I went with a minimal circular layout, borrowing the concept from Mike Bostock’s visualization of Uber rides by San Francisco neighborhood (10).

 

The first draft ended up looking colorful and overwhelming. After further data cleaning and forgoing color, the graph looked more manageable. However, it was still impossible to know what it represented. To get a glimpse at the meaning behind the lines, I included the department names in the graph, resulting in the least attractive draft of all.

 

Early drafts of the collaboration graphic.

Early drafts of the collaboration graphic.

 

Designing the graphic

 

Satisfied that the visualization was sensible, I focused on aesthetics. The number of programs and departments was too large to encode visually. I assigned each to one of six disciplines (physical sciences, engineering, biological sciences, social sciences, math and computer sciences, and health and medicine). This may have been the hardest part of the process – many programs don’t fit into a discipline neatly – but it was a design compromise that had to be made. I encoded each discipline with a color, and the penultimate draft was born.

 

Penultimate draft of the collaboration graphic –­­ the program order is random.

Penultimate draft of the collaboration graphic –­­ the program order is random.

 

In the final graph, the programs are grouped by discipline. The number of interdisciplinary connections across Berkeley is truly striking, and the colorful collaboration network made for a beautiful magazine cover. The abstract visualization in print is enriched by the interactive version on the website – hover over each program’s sector, and its name is displayed.  To see the interactive graphic in action, see here.

There are of course some caveats to this visualization, much like any project based on real world data. For example, PubMed articles heavily sway towards biomedical research and mathematical sciences are underrepresented. Further, affiliation lists represent multi-appointed faculty, as well as work of authors from multiple departments. But those considerations aside, it provides a fascinating view of the strength of collaboration among scientists at UC Berkeley. Throughout my time in graduate school, I’ve felt that this feature is at the core of UC Berkeley’s academics. It is inspiring to see that hunch confirmed visually.

 

References

1. http://sciencereview.berkeley.edu/article/from-the-editor-spring-2014/

2. http://sciencereview.berkeley.edu/article/first-rule-data-science/

3. http://olihb.com/2011/01/23/map-of-scientific-collaboration-between-researchers/

4. http://www.elsevier.com/online-tools/scopus

5. http://www.ncbi.nlm.nih.gov

6. http://ipython.org/notebook.html

7. http://www.crummy.com/software/BeautifulSoup/bs4/doc/

8. http://flowingdata.com/

9. http://d3js.org/

10. http://bost.ocks.org/mike/uberdata/­­

 

Leave a Reply