Imagine you have thousands of people commenting something on a Facebook page multiple times. It would be nice to construct a network from it, however, the computations are complicated due to the large number of people. Thus, a sample is needed. A question arises immediately: how would you do it? Network samples are known to be a tough task. There are two quick answers – sample be people who make comments, or sample by post. Which way is better? An intuitive suggestion would be to sample by people. Because one may investigate the whole communication pattern when people talk to each other throughout the page. Although it’s possible (by chance) to lose an important person who is very active and ties a network. When the posts are sampled its possible to have a case when all typical topics are presented in the sample and due to homophily effects people are likely to be allocated within these conversations. A risk here is to lose an important post consequently tossing off many comments and dialogs.
Lets check both ways!
For the exercise the Euromiadan data for the first week of January 2014 is utilised. I run permutations and sample 10 different sets (this is enough to see the pattern). Then I generate 10 networks per each permutation and see if mean density and betweenness correspond to the real values in the complete dataset. Also I make different samples. For example, I run 10 permutations for a sample that is 5% size of the total data, 20%, 40%, 60%, and 80%. I make samples by commentators and by posts separately.
The first graph shows the results for betweenness centrality. It is evident that the sample by commentators works better. It is enough to sample 20% of people to get the same results as in the original data. And the samples by posts are working very poorly. Which is quite logical since many posts are deleted by chance, therefore, no different classes of conversations were selected, thus, reducing a likelihood of a node to be present in on a path between different classes.
However, the story is not so bright for density. Both samples work pretty badly. One must select 80% of the people in order to get a bit closer to the real value. Again, this makes sense. Since density reflects a ratio between all real connections and all possible connections, its intuitive that any sample is going to toss off by chance a number of nodes with connections.
Its nice to have robust samples for betweenness since the diffusion of information cam be studied. A fact that density is not represented is upsetting. It looks like some dyad and triad effects can be missed due to the wrong sampling.
Another thing to check is a power of an each object selected for a sample. For instance, in this data I have 330 posts. Imagine if you toss out only one of them – does it really harm? How many edges are you going to lose? It appears that a lot! A minimum number of edges that are loosed is 1 and the maximum is… 360. So in the worse case scenario just by chance you can lose hundreds of ties by not selecting particular posts. In case of samples by people the maximum number of “killed” ties by not selecting 1 person in my data is 70. This explains why the samples by people are more efficient for betweenness centrality and both samples are so fragile in case of density.