This is a first post in an ongoing series where the BSR team sketches out their creative process. Look out for more posts in the coming weeks!
As Alexis Fedorchak pointed out in her letter from the editor (1), several articles in the latest issue of Berkeley Science Review focus on meta-science – describing many sides of being a scientist and the process of science. The cover article by Anum Azam, “The first rule of data science” (2), explores a new scientific field rising to prominence at UC Berkeley, one that can be seen as a type of meta-science itself. Data science is inherently a collaborative endeavor, bridging many areas of research and diverse sets of skills.
I wanted to capture this spirit of collaboration when illustrating Azam’s article. The image I chose for the title page fit the bill perfectly – the visualization by data scientist Olivier H. Beauchesne shows scientific collaborations across the world (3), based on co-authorships in the Elsevier’s Scopus database (4). Numerous connections, depicted as sweeping arcs, evoke a web that unites the world in the scientific process. When the decision was made to feature Azam’s article on the cover, I wanted to extend the theme of scientific collaboration and take it closer to home. The article describes many connections across the UC Berkeley campus, and I thought that BSR readers might want to see UC Berkeley collaborations (I certainly did!). This is how I found myself solving a data science problem to illustrate a data science article.
Finding a dataset
The first step was to find an appropriate dataset. Though co-authorships lingered on my mind, I did not have full access to the database used by Beauchesne. I went through a few alternatives: cross-linked websites on the berkeley.edu network, cross-listed classes, faculty appointments. While interesting, none of those sources of information captured collaboration explicitly. Finally, I made a fortuitous discovery – PubMed database (5) allows searches by affiliation. Just as I was dreading writing a tedious script to parse the HTML returned by a PubMed search, it turned out that I could export those search results in XML format, a much easier structure to understand and pull information from. We were in business.
I retrieved all articles published in the past 20 years (1994-2014) that listed “University of California, Berkeley” (and several variants) in the affiliations. PubMed’s listings dropped rapidly for articles before 1994, likely due to indexing problems, and twenty years seemed like a good range. I then loaded the data in my favorite exploratory tool – IPython Notebook (6), created by UC Berkeley’s own Fernando Perez and his team of pythonistas. I parsed the data using a Python library called BeautifulSoup (7). It’s a library for working with web data that I picked up from the excellent visualization blog FlowingData (8). I used it to get a list of affiliations for each article, and only kept affiliations that included UC Berkeley. That list included 3826 articles.
Cleaning the data
It was time to switch to manual cleaning. The affiliations were quite inconsistent: non-normative department names, various ways of listing affiliations for multiple authors, even misspellings of “Department.” Next time I publish a paper, I might fill out that little field more carefully. I kept articles with multiple Berkeley affiliations, including a few major independent collaborators, such as Lawrence Berkeley National Lab and Howard Hughes Medical Institute. I then loaded the affiliations for remaining 611 articles back into IPython Notebook and created a co-occurrence matrix characterizing co-authorships between programs. With the matrix in hand, it was time to make the first graphic draft.
The first draft ended up looking colorful and overwhelming. After further data cleaning and forgoing color, the graph looked more manageable. However, it was still impossible to know what it represented. To get a glimpse at the meaning behind the lines, I included the department names in the graph, resulting in the least attractive draft of all.
Designing the graphic
Satisfied that the visualization was sensible, I focused on aesthetics. The number of programs and departments was too large to encode visually. I assigned each to one of six disciplines (physical sciences, engineering, biological sciences, social sciences, math and computer sciences, and health and medicine). This may have been the hardest part of the process – many programs don’t fit into a discipline neatly – but it was a design compromise that had to be made. I encoded each discipline with a color, and the penultimate draft was born.
In the final graph, the programs are grouped by discipline. The number of interdisciplinary connections across Berkeley is truly striking, and the colorful collaboration network made for a beautiful magazine cover. The abstract visualization in print is enriched by the interactive version on the website – hover over each program’s sector, and its name is displayed. To see the interactive graphic in action, see here.
There are of course some caveats to this visualization, much like any project based on real world data. For example, PubMed articles heavily sway towards biomedical research and mathematical sciences are underrepresented. Further, affiliation lists represent multi-appointed faculty, as well as work of authors from multiple departments. But those considerations aside, it provides a fascinating view of the strength of collaboration among scientists at UC Berkeley. Throughout my time in graduate school, I’ve felt that this feature is at the core of UC Berkeley’s academics. It is inspiring to see that hunch confirmed visually.