Final design, presentation and a wrap up

We presented yesterday the final result of the project. The slides are up here at GitHub http://htmlpreview.github.io/?https://github.com/sergejkaiser/reveal.js/blob/master/presentation2.html#/14 .

Our final design solved the visual overload using edgebundling, and it includes node highlighting with click on nodes. Further, it allows the user to pick the year based on a button to choose the corresponding json file. Additionally, the tension can be interactively set by the user. The tension allows to shift the focus from local to global features of the data.

Final_edge_bundle

Moreover, we feel that a response is neccsary to certain questions about the visualisation.

The visual design focuses on nodes and groups (defined by coreness). Implementing it in d3 and creating the necessary json files was a big struggle.We think that the radial design is a starting point with additional filters it could provide a better understanding of the tapestry complex.

Certain extensions as data filters and implementing color coding to the edges based on categorical information in the edges data set, were beyond our d3 abilities.

Another extension we thought of is adding different levels of opacity to indicate the degree of missing information in the edges data set. Conceptually it is not ideal to encode that infromation in the edge opacity. We would need to decide to either choose the sender or reciever level of missingness or a mean of both.

Finaly, the search problem of desiging a visualisation for a complex data set with much missing information, a temporal dimension and uncertainty about the quality of the underlying data is not solved.

 

Tapes’try: Missingness

As we have showed in our data overview, there  is a lot of missingness in this data. Given that for our visualisations the most interesting information is about group-affiliations of actors and and at which places these actors had lived, we focused on the parish-data and the group-data (which also includes places):

  • Parish-data: That is church-affiliation (interesting for location), birth-year, marriage-year, death-year
    • Generally missingness is nearly identically distributed for the all variables of the parish data
  • Group-affiliation: To which group did the actor belong?
  • Places: Where did this actor live?
    • Missingness in place and Group-affiliation are also nearly identically distributed

To get an overview what kind of data and how much data we have on actor level we created an index taking different values conditional upon which combination of these three data-sources are available for the actor:

Table 1: Missingness on actor level
NA Pl G G,P P P,Pl G,P,Pl
absolute 1924 258 1 1095 739 29 51
relative 0.47 0.06 0 0.27 0.18 0.01 0.01
NA – everything is missing, Pl – Place data available, G – Group affiliation data available,
P – Parish data available

Not only do we lack for about half of the actors, additionally only for about one fourth both group- and place-data are available. This sounds pretty bad. But is it?

This does not answer how the missingness is distributed over the network. We could imagine that we have missingness is either high for often mentioned actors or that it is high for rarely mentioned actors. Both arguments illustrate that we need to look at the tie level for a final evaluation of missingness:

 

Table 2: Absolute missingness on tie level
Target
NA Pl G G,P P P,Pl G,P,Pl Sum
Source
  NA 8019 3926 3 1505 2966 698 2271 19388
  Pl 3932 4076 221 1287 640 1275 11431
  G 3 9 12
  G,P 1517 223 9 13662 139 109 281 15940
  P 2966 1287 139 2510 507 1598 9007
  P,Pl 698 640 109 507 326 586 2866
  G,P,Pl 2271 1275 281 1598 586 880 6891
Sum
19406 11427 12 15926 9007 2866 6891 65535
NA – everything is missing, Pl – Place data available, G – Group affiliation data available,
P – Parish data available

 

This does not reveal much unless you are pretty good at calculate fractions in your head. But together with the relative distribution this table should help:

Table 3: Relative missingness on tie level
Target
NA Pl G G,P P P,Pl G,P,Pl Sum
Source
  NA 0.12 0.06 0 0.02 0.05 0.01 0.03 0.3
  Pl 0.06 0.06 0 0.02 0.01 0.02 0.17
  G 0  – 0  – 0
  G,P 0.02 0 0 0.21 0 0 0 0.24
  P 0.05 0.02  – 0 0.04 0.01 0.02 0.14
  P,Pl 0.01 0.01 0 0.01 0 0.01 0.04
  G,P,Pl 0.03 0.02 0 0.02 0.01 0.01 0.11
Sum
0.3 0.17 0 0.24 0.14 0.04 0.11 1
NA – everything is missing, Pl – Place data available, G – Group affiliation data available, P – Parish data available

 

Generally the distribution over network-sources and -targets is quite similar to the one over actors. However there are now only 12 % of ties without any information on place or group. Especially interesting is that the proportions of complete missingness in source and target actor are lower than on actor level, indicating that actors with more ties are indeed better researched but that also.

However the proportion of ties were there is complete information in the network is about the same as on actor level. Hence we can conclude that while there is more information on tie-level, the additional information is only partially better distributed.

What do we learn from this exercise?

  • By merging place-data we should be able to locate both actors for about 70 % of the network.
  • Additionally the last table shows that imputation should at least be possible to some extent given that only for 12 % there is no data for both source and target.
  • Furthermore this overview should provide us with some foundation for any missingness-feature in our final design.

Multi-relations network plot

This post is about the visualisation of the relations and their timing in our historical data set about the Antwerp-Brussels-Oudenaarde tapestry complex ( more details about the data set are at the project website).

This plot will be a first attempt to visualize the complex data, therefore in a first step we reduced the roles of each vertex to the following categories: “tapissier” , “mother”, “father”, “child”, “painter”,”legatee”, “erfgenaam”.  We chose those categories, because they appear many times in the dataset. Therefore, we thought that they are a nice starting point.

Before we show the network visualizations, we have a first look at a basic histrogram of the time evolution of the edges in our network. This step turned out to be very helpful to decide how to plot the dynamic network.

hist_edges_time

Based on this graph, we decided that we will work with discrete time steps of 10-20 year steps to capture the evolution of the network over time. We used color coding to represent the different types of roles each person may take.

 

 

The networks viz starts with the  year 1595-1615 and each picture is a step omtp tje future of 20 years. The size of each node is the log of the degree (which is the number of connecitons of an edge to other edges).

A drawback of the visualization is that it is quite difficult to interpretat beyond the basic insights we already saw in the histogram. Especially, the third and fourth network visulization are too dense to be useful for visual analytics.

Hence, we need to look beyond classical node-link diagrams for our task.

For the interested reader here are the relevant sections of R codes. For the plotting part code we adapated the following great contribution.

The idea of ploting the dynamic network in discrete time slices and the implement this, is inspired by the examples in “Statistical Analysis of Network Data with R”, Ch. 10,  written by Kolaczyk, Eric D., Csárdi, Gábor.

 ##ETN is the edges data frame with numerical columns Source, Target (that are the #outgoing node ID and the incoming node ID), the edge attributes year and label. The data #frame vids is the data.frame with the relevant node informations.</pre>
library(igraph)
library(plyr)
library(dplyr)</pre>
<pre>#load network structured data
edges.to.network=read.csv('Edges-to-Network.csv',header=T)
#reduce number of roles #first define role vector and then subset the edges
# to network data frame
roles.inc=c('tapissier', 'mother', 'father', 'child', 'painter','legatee', 'erfgenaam')
edges.to.network$Rol1=as.character(edges.to.network$Rol1)
edges.to.network$Rol2=as.character(edges.to.network$Rol2)
etn.net=edges.to.network[edges.to.network$Rol1 %in% roles.inc,] etn.net=etn.net[etn.net$Rol2 %in% roles.inc,] #create edge attribute time since minimum year
etn.net$time=etn.net$Year-min(etn.net$Year,na.rm=T) #vertex data frame vids = sort(unique(c(etn.net$Source, etn.net$Target)))
g.week = graph.data.frame(etn.net[, c('Source', 'Target', 'Year','Label')],
vertices=data.frame(vids), directed=T)
g.sl10 = lg.sl10 <- lapply(1:8, function(i) { g = subgraph.edges(g.week,
E(g.week)[Time \> 20*(i-1) & Time \<= 20*i],
delete.vertices=FALSE)
simplify(g)
})png(file='try2%03d.png', width=1600,height=900) #Output for each frame will be a png with HD size 1600x900 #Time loop starts
#first number in seq determinates starting value, second number the end value, the third
#number is the step size
for(time in seq(2,7,1)){ gt = g.sl10[[time]] #use only network present at t=time #color code the roles
V(gt)$color[V(gt)$Status=='tapissier' ] = '#66C2A5' #lime green
V(gt)$color[V(gt )$Status=='mother'] = '#FC8D62' #orange
V(gt)$color[V(gt )$Status=='father'] = '#8DA0CB' #lila
V(gt)$color[V(gt )$Status=='child'] = '#E78AC3' # Very soft pink
V(gt)$color[V(gt )$Status=='painter']='#A6D854' # Moderate green
V(gt)$color[V(gt )$Status=='legatee'] ='#FFD92F' #Vivid yellow
V(gt)$color[V(gt)$Status=='erfgenaam']='#E5C494' #Very soft orange #with the new graph, we update the layout a little bit
layout.new = layout_with_fr(gt,coords=layout.old,niter=10,start.temp=0.05,grid='nogrid') #plot the new graph
plot(gt,layout=layout.new,vertex.label='',vertex.size=log(degree(gt)),vertex.frame.color=V(gt)$color,edge.width=1.5,asp=9/16,margin=-0.15) #use the new layout in the next round
} dev.off() 

tapes’try underestimated actors (#2.2)

Generally any kind of variable to capture missingness will likely be the same for about 80 % of the actors and thus only distinguish one fifth of the actors from the rest. Therefore we decided to start off with the underestimated actor feature. This we do since underestimated actors also tell a story about missingnes.

Underestimated actors are less researched, so they will not have the most ties. In addition they have still enough ties to be very important to their level-1-neighbors. Ideally they link actors (or even groups of actors) who would otherwise have no connection. Thus these actors are victim of missingness in the sense that these actors are actually in the network but there is too little ties (or data) to whom they are linked to.

To simplify feature I ignore all data other than sources and target. I can still add year , place or sex of the tie later on. So how can we find these underestimated actors? Some ideas:

Look at betweeness-centrality:
Betweenness-centrality captures how many shortest paths of the network go through this specific node (or actor). Generally we would expect the nodes with more ties to have a higher betweenness-centrality. Thus this measure naturally overestimates the actors with the most ties.

Calculate the proportion of in-degree/out-degree:
Actors with a higher proportion of incoming ties are more popular and the higher the proportion the more important these actors should be in their immediate ’neighbourhood’. In difference to betweenness-centrality we can expect this measure to overestimate actors with few ties. However there is a caveat: In-degree and out-degree in our network are exactly the same for 99 % of the actors (why? we don’t know. Yet!)

Find the articulation points:
Articulation points are when the ties of one actor bind two separate network-components. However as before this measure will overestimate actors with few ties since actors with
few ties are more likely to be articulation point than actors with a lot of ties.

The easiest implementation for the start are the articulation points. In the following pictures I plotted the sub-network of all nodes who have ties to the articulation points:

The network on the left was ordered by a Fruchterman-Reingold-algorithm, the network on the righthandside is the same but was ordered after the Kamada-Kawai-algorithm. In red we can see how the articulation points clue components together.

 

Summary statistics for the all-degree distribution of (sub-)networks
full.netw art.points† bet.cent†
Min. 0 10 8
1st Qu. 0 22 22
Median 8 34 53
Mean 32 88 168
3rd Qu. 28 74 198
Max. 4060 2424 4060
† Subsamples of (all articulation points or betweeness-centrality>0)

 

Generally articulation points or nodes with a positive betweenness-centrality have a higher all-degree distribution. While the first half of the distribution is somewhat similar for both columns, the all-degree distribution for the betweenness-centrality is way higher for the other half (than for the articulation-points). This shows that betweenness-centrality is positive for a bigger set of nodes. Thus as a measure for underestimated actors it will indicate also the highest-degree nodes. The articulation points seem more balanced in this sense. They seemingly cover nodes from the median of whole network until the highest values of the network,without covering all the highest values as well.

 

 

 

Five sheet design #1

Before we program data visualisations, we started to think divergently of visualisations ideas and draw sketches. Some of the sketches are on the first sheet below. In the next step, we eliminated many ideas as e.g. a timeline with the five or ten most cited artists, a pie chart of missing values, etc., which we didn’t find satisfying.
In the next steps, we sketched three ideas in greater detail. The first sketch is a tree summarising information about actors and places (or occupation). The places or occupations are named centrally in the middle of the tree, and each actor is in a circle in the center. Lines are drawn between every actor-place or actor-occupation connection. Further, we thought of refinements like a timeline animation or a different font size given the number of citations are possible.

Our third design focuses how much information we miss for each actor. The big idea of the sketch is to create an overview of the distribution of missingness. For example, we would create categories illustrating how many actors have 25 % or 50 % missing records. Within each circle, a dot would refer to an actor with the particular degree of missing records. We might extend this design by relating missingness to places or occupations (think of shares within each circle for different places/occupations).

Our third sketch is a multi-dimensional network. The idea is that we have different network layers for every broad category of social relations (e.g. social/economic and artistic). These categories will be based on the detailed historical data in our data set. The visualisations goal is to shed some light on the association between social relations and the degree of missing records.
Concretely, the network would visualize all actors and the connections between them. Each layer would represent connections of one type e.g. economic ties. Further, we would use transparency to indicate how much information we miss about a particular actor in the network. The position of each actor within a layer would reflect his/hers importance. We thought of extensions like a timeline animation and a filter to display only important nodes with many connections.

In our next blog post, we will describe which of three more detailed sketches we think to visualize and some thoughts about the next steps.

Tapes’ try #2

Over the last seven days we looked over the database and realised that some tables were doubles and others were not interesting for visualisation. We exported those we deemed important and started to investigate them…

In general, our data-set consists of information about the actors and their social structure in the Brussels Oudenarde tapestry complex between  1600-1700.  By social structure we refer to family, baptism,  workspace,  collaborations, financial ties. Further, we have information about several hundred works of tapestry art, the tapestry artists and a classification of the art-work.

Finally we found five tables worth considering for visualisation.  Hence we created for each of them a table consisting of the number of different values the variable has (# categories), the data type and the percentage of missing variables.

In the following we present the tables (descending in importance) and by doing so we will explain some of the (non-obvious) variables. In addition we visualise some characteristics of the (few) most central variables.

 
 Variable names # categories data type missing (in %)
ID_sender 2291 numerical 0.00
ID_target 2290 numerical 0.00
Rol1_Rol2 5377 categorical 0.00
Source 836 categorical 0.00
Year 122 numerical 0.00
Naam1 2291 categorical 0.00
Rol1 204 categorical 0.00
Rol2 201 categorical 0.00
Naam2 2290 categorical 0.00
TypeActe 20 categorical 0.00
City 14 categorical 0.00

Table 1 shows the summary statistics of the main table of the dataset. There are no missing values and most variables are factors. Most important are definitely the connection types (here called “Rol1-Rol2”), which describes what role ID_sender has for ID_target and vice a versa. In addition Naam1 and Naam2 give the names.

One of the most central and most important variables is what we called here “Rol1-Rol2”. Thus we were interested to know how these combinations of roles are distributed. We expected them to have a heavy tail on the right hand side. But the extent was still somewhat surprising.

The graph indicates that around 3/4 of the labels account for 1/4 of the cumulative rel. frequency. Therefore, we have many labels occurring only rarely, whereas a small share of the labels occurs very frequently. This skewed distribution might help to visualize the network as a  multidimensional social network. This kind of visualization is insightful only for a reasonable number of dimensions (layers) of networks.

Cum. distribution of  Label categories

 

Another central part is what role actors are having and how often this kind of role is mentioned.

A second visualization of the categories focuses on the relationship of how many times categories of the edge labels are observed and how important (percent of total count of categories) the categories are.

Another datavis we are presenting here is the disribution of roles.
The graph shows that roles, which are observed more than 700 times in the dataset account for 50 percent of the total count of roles. Since only very few categories are observed in high numbers, this graph shows a high skewness.

 

The following table summarises other information on the actors.

Table 2: Nodes to Network

 Variable names # categories data type missing (in %)
Id 4097.00 numerical 0.00
Nodes 3864.00 categorical 0.00
Label 3864.00 categorical 0.00
First_Name 940.00 categorical 0.02
Surname 2014.00 categorical 0.00
Parishes 22.00 categorical 0.84
Parishes_1 16.00 categorical 0.95
Parishes_2 9.00 categorical 1.00
Parishes_4 10.00 categorical 0.98
Gender 2.00 categorical 0.00
GebJaar 128.00 numerical 0.84
MarJaar 62.00 numerical 0.95
MarJaar2 16.00 numerical 1.00
MarJaar3 2.00 numerical 1.00
BurJaar 55.00 numerical 0.98
Parishes_3_Parochie 2.00 cateogrical 1.00

Table 2 deals with all the other information which is available for each actor. Label gives the actors full name (Id is the id-number). The only (nearly) complete variable is gender. Parish and birth year  have about 84 % missing values. The other variables consist of mostly of missing data. FYI: Yes the variables Nodes and Label are exactly the same.

Table 3: Actors and places

 Variable names # categories data type missing (in %)
IDWoonplaats 590 numerical 0.00
Persoon 413 numerical 0.00
Stad 31 numerical 0.01
Street 56 numerical 0.56
Huisnaam 10 numerical 0.95
Acte 370 numerical 0.00
Opmerkingen 99 categorical 0.73
Woning 2 binary 0.00
Production.Unit 2 binary 0.00
Dye.Works 2 binary 0.00
Extra.Muros 2 binary 0.00
Duur 14 numerical 0.01
Einde.huur 0 binary 1.00
Rollen 9 numerical 0.00
Pakhuis 2 binary 0.00

This table summarises the places where actors are known to have lived at. The length (of the first variable) already tells that for most actors the living place is a missing variable. The source is given by “Acte” and the actor(-id) is given by ‘Persoon’. Here ‘Rollen’ gives the roles. And eventhough most variables are numerical they often refer to categories who were assigned a number.

Table 4: Actors and affiliated groups

 Variable names # categories data type missing (in %)
IDmembership 2901.00 numerical 0.00
Acte 732.00 numerical 0.00
Persoon 1463.00 numerical 0.00
Role 108.00 numerical 0.01
Organisation 33.00 numerical 0.32
Plaats 15.00 numerical 0.02
Phase 2.00 categorical 0.52
Opmerkingen 123.00 categorical 0.94
Status 12.00 numerical 0.42

Table 4 summarises the table of actors who are known to be affiliated to some group. Actually all variables are categorical here. The numeric variables are just categorical variables who were coded to numeric for some (unknown) reason. The actor is given by the ID-number in ‘Persoon’. ‘Acte’ is the source, while ‘Organisation’ refers to the group.

Table 5: Actors and their work

 Variable names # categories var_type missing (in %)
IDeditie.cartoon 890.00 numerical 0.00
Source 115.00 categorical 0.00
ctor 188.00 categorical 0.00
Rol 25.00 categorical 0.00
Series 146.00 categorical 0.17
ProductionPlace 3.00 categorical 0.79
Location 15.00 categorical 0.50
Remarks 155.00 categorical 0.47
Fijn.werk 2.00 binary 0.00
Oud.werk 2.00 binary 0.00
Grof.werk 2.00 binary 0.00
Sijde.loecht 2.00 binary 0.00
Gold.silver 2.00 binary 0.00
Scenes.identified 2.00 binary 0.00
Borders.identified 2.00 binary 0.00
Reeksvariant 96.00 categorical 0.66
Rang 3.00 numerical 0.00
Subjects.Identified 2.00 binary 0.00
Currency 2.00 categorical 0.93
Prijs1 11.00 numerical 0.79
Prijs2 7.00 numerical 0.79
Prijs3 2.00 numerical 0.78

Table 5 describes which actor made which tapestry and to which series it belongs. All the binary variables give details to the tapestries’ composition. The variable ‘Reeksvariant’ seems to give what kind of scene was created.

Tapes’ try #1

The data was delivered to us as Microsoft access database consisting of dozens of tables. About eight of them seem to contain most of the information. The biggest table describes connections between all pairs of actors who they were to each other (e.g. lover, wood-cutter etc pp) and in which year. Another table describes the works of the artists and what material they used. These two have among the lowest number of missing values. Other tables can be described as a very loose amount of data. E.g. often birthyear is missing but you have some data about the parish the community the person lived in first. Or you don’t. Missingness seems to be quite random.

Initially we wanted to give some ideas about the missingness of the values and of the data in general by showing some summary statistics. However as is also obvious from this small data description that we are still in the process of getting to know the various tables and understanding the structure.