We study a subset of the movie collaboration network, http://www.imdb.com, where only adult movies are included. We show that there are many benefits in using such a network, which can serve as a prototype for studying social interactions. We find that the strength of links, i.e., how many times two actors have collaborated with each other, is an important factor that can significantly influence the network topology. We see that when we link all actors in the same movie with each other, the network becomes small-world, lacking a proper modular structure. On the other hand, by imposing a threshold on the minimum number of links two actors should have to be in our studied subset, the network topology becomes naturally fractal. This occurs due to a large number of meaningless links, namely, links connecting actors that did not actually interact. We focus our analysis on the fractal and modular properties of this resulting network, and show that the renormalization group analysis can characterize the self-similar structure of these networks.  相似文献   

Many large network data sets are noisy and contain links representing low-intensity relationships that are difficult to differentiate from random interactions. This is especially relevant for high-throughput data from systems biology, large-scale ecological data, but also for Web 2.0 data on human interactions. In these networks with missing and spurious links, it is possible to refine the data based on the principle of structural similarity, which assesses the shared neighborhood of two nodes. By using similarity measures to globally rank all possible links and choosing the top-ranked pairs, true links can be validated, missing links inferred, and spurious observations removed. While many similarity measures have been proposed to this end, there is no general consensus on which one to use. In this article, we first contribute a set of benchmarks for complex networks from three different settings (e-commerce, systems biology, and social networks) and thus enable a quantitative performance analysis of classic node similarity measures. Based on this, we then propose a new methodology for link assessment called z* that assesses the statistical significance of the number of their common neighbors by comparison with the expected value in a suitably chosen random graph model and which is a consistently top-performing algorithm for all benchmarks. In addition to a global ranking of links, we also use this method to identify the most similar neighbors of each single node in a local ranking, thereby showing the versatility of the method in two distinct scenarios and augmenting its applicability. Finally, we perform an exploratory analysis on an oceanographic plankton data set and find that the distribution of microbes follows similar biogeographic rules as those of macroorganisms, a result that rejects the global dispersal hypothesis for microbes.  相似文献   

Link prediction plays an important role in both finding missing links in networked systems and complementing our understanding of the evolution of networks. Much attention from the network science community are paid to figure out how to efficiently predict the missing/future links based on the observed topology. Real-world information always contain noise, which is also the case in an observed network. This problem is rarely considered in existing methods. In this paper, we treat the existence of observed links as known information. By filtering out noises in this information, the underlying regularity of the connection information is retrieved and then used to predict missing or future links. Experiments on various empirical networks show that our method performs noticeably better than baseline algorithms.  相似文献   

Topological properties of networks are widely applied to study the link-prediction problem recently. Common Neighbors, for example, is a natural yet efficient framework. Many variants of Common Neighbors have been thus proposed to further boost the discriminative resolution of candidate links. In this paper, we reexamine the role of network topology in predicting missing links from the perspective of information theory, and present a practical approach based on the mutual information of network structures. It not only can improve the prediction accuracy substantially, but also experiences reasonable computing complexity.  相似文献   

The effect of missing data on phylogenetic methods is a potentially important issue in our attempts to reconstruct the Tree of Life. If missing data are truly problematic, then it may be unwise to include species in an analysis that lack data for some characters (incomplete taxa) or to include characters that lack data for some species. Given the difficulty of obtaining data from all characters for all taxa (e.g., fossils), missing data might seriously impede efforts to reconstruct a comprehensive phylogeny that includes all species. Fortunately, recent simulations and empirical analyses suggest that missing data cells are not themselves problematic, and that incomplete taxa can be accurately placed as long as the overall number of characters in the analysis is large. However, these studies have so far only been conducted on parsimony, likelihood, and neighbor-joining methods. Although Bayesian phylogenetic methods have become widely used in recent years, the effects of missing data on Bayesian analysis have not been adequately studied. Here, we conduct simulations to test whether Bayesian analyses can accurately place incomplete taxa despite extensive missing data. In agreement with previous studies of other methods, we find that Bayesian analyses can accurately reconstruct the position of highly incomplete taxa (i.e., 95% missing data), as long as the overall number of characters in the analysis is large. These results suggest that highly incomplete taxa can be safely included in many Bayesian phylogenetic analyses.  相似文献   

The importance of lipids for cell function and health has been widely recognized, e.g., a disorder in the lipid composition of cells has been related to atherosclerosis caused cardiovascular disease (CVD). Lipidomics analyses are characterized by large yet not a huge number of mutually correlated variables measured and their associations to outcomes are potentially of a complex nature. Differential network analysis provides a formal statistical method capable of inferential analysis to examine differences in network structures of the lipids under two biological conditions. It also guides us to identify potential relationships requiring further biological investigation. We provide a recipe to conduct permutation test on association scores resulted from partial least square regression with multiple imputed lipidomic data from the LUdwigshafen RIsk and Cardiovascular Health (LURIC) study, particularly paying attention to the left-censored missing values typical for a wide range of data sets in life sciences. Left-censored missing values are low-level concentrations that are known to exist somewhere between zero and a lower limit of quantification. To make full use of the LURIC data with the missing values, we utilize state of the art multiple imputation techniques and propose solutions to the challenges that incomplete data sets bring to differential network analysis. The customized network analysis helps us to understand the complexities of the underlying biological processes by identifying lipids and lipid classes that interact with each other, and by recognizing the most important differentially expressed lipids between two subgroups of coronary artery disease (CAD) patients, the patients that had a fatal CVD event and the ones who remained stable during two year follow-up.  相似文献   

Many complex systems present an intrinsic bipartite structure where elements of one set link to elements of the second set. In these complex systems, such as the system of actors and movies, elements of one set are qualitatively different than elements of the other set. The properties of these complex systems are typically investigated by constructing and analyzing a projected network on one of the two sets (for example the actor network or the movie network). Complex systems are often very heterogeneous in the number of relationships that the elements of one set establish with the elements of the other set, and this heterogeneity makes it very difficult to discriminate links of the projected network that are just reflecting system's heterogeneity from links relevant to unveil the properties of the system. Here we introduce an unsupervised method to statistically validate each link of a projected network against a null hypothesis that takes into account system heterogeneity. We apply the method to a biological, an economic and a social complex system. The method we propose is able to detect network structures which are very informative about the organization and specialization of the investigated systems, and identifies those relationships between elements of the projected network that cannot be explained simply by system heterogeneity. We also show that our method applies to bipartite systems in which different relationships might have different qualitative nature, generating statistically validated networks in which such difference is preserved.  相似文献   

The field of social network analysis has received increasing attention during the past decades and has been used to tackle a variety of research questions, from prevention of sexually transmitted diseases to humanitarian relief operations. In particular, social network analyses are becoming an important component in studies of criminal networks and in criminal intelligence analysis. At the same time, intelligence analyses and assessments have become a vital component of modern approaches in policing, with policy implications for crime prevention, especially in the fight against organized crime. In this study, we have a unique opportunity to examine one specific Swedish street gang with three different datasets. These datasets are the most common information sources in studies of criminal networks: intelligence, surveillance and co-offending data. We use the data sources to build networks, and compare them by computing distance, centrality, and clustering measures. This study shows the complexity factor by which different data sources about the same object of study have a fundamental impact on the results. The same individuals have different importance ranking depending on the dataset and measure. Consequently, the data source plays a vital role in grasping the complexity of the phenomenon under study. Researchers, policy makers, and practitioners should therefore pay greater attention to the biases affecting the sources of the analysis, and be cautious when drawing conclusions based on intelligence assessments and limited network data. This study contributes to strengthening social network analysis as a reliable tool for understanding and analyzing criminality and criminal networks.  相似文献   

Longitudinal data are common in clinical trials and observational studies, where missing outcomes due to dropouts are always encountered. Under such context with the assumption of missing at random, the weighted generalized estimating equation (WGEE) approach is widely adopted for marginal analysis. Model selection on marginal mean regression is a crucial aspect of data analysis, and identifying an appropriate correlation structure for model fitting may also be of interest and importance. However, the existing information criteria for model selection in WGEE have limitations, such as separate criteria for the selection of marginal mean and correlation structures, unsatisfactory selection performance in small‐sample setups, and so forth. In particular, there are few studies to develop joint information criteria for selection of both marginal mean and correlation structures. In this work, by embedding empirical likelihood into the WGEE framework, we propose two innovative information criteria named a joint empirical Akaike information criterion and a joint empirical Bayesian information criterion, which can simultaneously select the variables for marginal mean regression and also correlation structure. Through extensive simulation studies, these empirical‐likelihood‐based criteria exhibit robustness, flexibility, and outperformance compared to the other criteria including the weighted quasi‐likelihood under the independence model criterion, the missing longitudinal information criterion, and the joint longitudinal information criterion. In addition, we provide a theoretical justification of our proposed criteria, and present two real data examples in practice for further illustration.  相似文献   

Multiple imputation (MI) has emerged in the last two decades as a frequently used approach in dealing with incomplete data. Gaussian and log‐linear imputation models are fairly straightforward to implement for continuous and discrete data, respectively. However, in missing data settings that include a mix of continuous and discrete variables, the lack of flexible models for the joint distribution of different types of variables can make the specification of the imputation model a daunting task. The widespread availability of software packages that are capable of carrying out MI under the assumption of joint multivariate normality allows applied researchers to address this complication pragmatically by treating the discrete variables as continuous for imputation purposes and subsequently rounding the imputed values to the nearest observed category. In this article, we compare several rounding rules for binary variables based on simulated longitudinal data sets that have been used to illustrate other missing‐data techniques. Using a combination of conditional and marginal data generation mechanisms and imputation models, we study the statistical properties of multiple‐imputation‐based estimates for various population quantities under different rounding rules from bias and coverage standpoints. We conclude that a good rule should be driven by borrowing information from other variables in the system rather than relying on the marginal characteristics and should be relatively insensitive to imputation model specifications that may potentially be incompatible with the observed data. We also urge researchers to consider the applied context and specific nature of the problem, to avoid uncritical and possibly inappropriate use of rounding in imputation models.  相似文献   

In this paper, we evaluate the relative performance of competing approaches for estimating phylogenies from incomplete distance matrices. The direct approach proceeds with phylogenetic reconstruction while ignoring missing cells, whereas the indirect approach proceeds by estimating the missing distances prior to phylogenetic analysis. Two distinct indirect procedures based on the ultrametric inequality and the four-point condition are further compared. Using simulations, we show that more reliable results are obtained when such indirect methods are used. Expectedly, the phylogenies become less accurate as the percentage of missing cells increases, but combining different estimation methods greatly improves the accuracy. An application to bat phylogeny confirms the results obtained in the simulation study and illustrates the effect of missing distances in the construction of supertrees.  相似文献   

The problem of missing data is often considered to be the most important obstacle in reconstructing the phylogeny of fossil taxa and in combining data from diverse characters and taxa for phylogenetic analysis. Empirical and theoretical studies show that including highly incomplete taxa can lead to multiple equally parsimonious trees, poorly resolved consensus trees, and decreased phylogenetic accuracy. However, the mechanisms that cause incomplete taxa to be problematic have remained unclear. It has been widely assumed that incomplete taxa are problematic because of the proportion or amount of missing data that they bear. In this study, I use simulations to show that the reduced accuracy associated with including incomplete taxa is caused by these taxa bearing too few complete characters rather than too many missing data cells. This seemingly subtle distinction has a number of important implications. First, the so-called missing data problem for incomplete taxa is, paradoxically, not directly related to their amount or proportion of missing data. Thus, the level of completeness alone should not guide the exclusion of taxa (contrary to common practice), and these results may explain why empirical studies have sometimes found little relationship between the completeness of a taxon and its impact on an analysis. These results also (1) suggest a more effective strategy for dealing with incomplete taxa, (2) call into question a justification of the controversial phylogenetic supertree approach, and (3) show the potential for the accurate phylogenetic placement of highly incomplete taxa, both when combining diverse data sets and when analyzing relationships of fossil taxa.  相似文献   

We consider the problem of estimating the marginal mean of an incompletely observed variable and develop a multiple imputation approach. Using fully observed predictors, we first establish two working models: one predicts the missing outcome variable, and the other predicts the probability of missingness. The predictive scores from the two models are used to measure the similarity between the incomplete and observed cases. Based on the predictive scores, we construct a set of kernel weights for the observed cases, with higher weights indicating more similarity. Missing data are imputed by sampling from the observed cases with probability proportional to their kernel weights. The proposed approach can produce reasonable estimates for the marginal mean and has a double robustness property, provided that one of the two working models is correctly specified. It also shows some robustness against misspecification of both models. We demonstrate these patterns in a simulation study. In a real‐data example, we analyze the total helicopter response time from injury in the Arizona emergency medical service data.  相似文献   

Summary The generalized estimating equation (GEE) has been a popular tool for marginal regression analysis with longitudinal data, and its extension, the weighted GEE approach, can further accommodate data that are missing at random (MAR). Model selection methodologies for GEE, however, have not been systematically developed to allow for missing data. We propose the missing longitudinal information criterion (MLIC) for selection of the mean model, and the MLIC for correlation (MLICC) for selection of the correlation structure in GEE when the outcome data are subject to dropout/monotone missingness and are MAR. Our simulation results reveal that the MLIC and MLICC are effective for variable selection in the mean model and selecting the correlation structure, respectively. We also demonstrate the remarkable drawbacks of naively treating incomplete data as if they were complete and applying the existing GEE model selection method. The utility of proposed method is further illustrated by two real applications involving missing longitudinal outcome data.  相似文献   

Traditional measures of success for film, such as box-office revenue and critical acclaim, lack the ability to quantify long-lasting impact and depend on factors that are largely external to the craft itself. With the growing number of films that are being created and large-scale data becoming available through crowd-sourced online platforms, an endogenous measure of success that is not reliant on manual appraisal is of increasing importance. In this article we propose such a ranking method based on a combination of centrality indices. We apply the method to a network that contains several types of citations between more than 40,000 international feature films. From this network we derive a list of milestone films, which can be considered to constitute the foundations of cinema. In a comparison to various existing lists of ‘greatest’ films, such as personal favourite lists, voting lists, lists of individual experts, and lists deduced from expert polls, the selection of milestone films is more diverse in terms of genres, actors, and main creators. Our results shed light on the potential of a systematic quantitative investigation based on cinematic influences in identifying the most inspiring creations in world cinema. In a broader perspective, we introduce a novel research question to large-scale citation analysis, one of the most intriguing topics that have been at the forefront of scientific enquiries for the past fifty years and have led to the development of various network analytic methods. In doing so, we transfer widely studied approaches from citation analysis to the the newly emerging field of quantification efforts in the arts. The specific contribution of this paper consists in modelling the multidimensional cinematic references as a growing multiplex network and in developing a methodology for the identification of central films in this network.  相似文献   

Missing data are a widely recognized nuisance factor in phylogenetic analyses, and the fear of missing data may deter systematists from including characters that are highly incomplete. In this paper, I used simulations to explore the consequences of including sets of characters that contain missing data. More specifically, I tested whether the benefits of increasing the number of characters outweigh the costs of adding missing data cells to a matrix. The results show that the addition of a set of characters with missing data is generally more likely to increase phylogenetic accuracy than decrease it, but the potential benefits of adding these characters quickly disappear as the proportion of missing data increases. Furthermore, despite the overall trend, adding characters with missing data does decrease accuracy in some cases. In these situations, the missing data entries are not themselves misleading, but their presence may mimic the effects of limited taxon sampling, which can positively mislead. Criteria are discussed for predicting whether adding characters with missing data may increase or decrease accuracy. The results of this study also suggest that accuracy can be increased to a surprising degree by (1) "filling the holes" in a data matrix as much as possible (even when relatively few taxa are missing data), and (2) adding fewer characters scored for all taxa rather than adding a larger number of characters known for fewer taxa. Missing data can also be eliminated from an analysis through the exclusion of incomplete taxa rather than incomplete characters, but this approach may reduce the usefulness of the analysis and (in some cases) the accuracy of the estimated trees.  相似文献   

Ecological networks are complexes of interacting species, but not all potential links among species are realized. Unobserved links are either missing or forbidden. Missing links exist, but require more sampling or alternative ways of detection to be verified. Forbidden links remain unobservable, irrespective of sampling effort. They are caused by linkage constraints. We studied one Arctic pollination network and two Mediterranean seed-dispersal networks. In the first, for example, we recorded flower-visit links for one full season, arranged data in an interaction matrix and got a connectance C of 15 per cent. Interaction accumulation curves documented our sampling of interactions through observation of visits to be robust. Then, we included data on pollen from the body surface of flower visitors as an additional link ‘currency’. This resulted in 98 new links, missing from the visitation data. Thus, the combined visit–pollen matrix got an increased C of 20 per cent. For the three networks, C ranged from 20 to 52 per cent, and thus the percentage of unobserved links (100 − C) was 48 to 80 per cent; these were assumed forbidden because of linkage constraints and not missing because of under-sampling. Phenological uncoupling (i.e. non-overlapping phenophases between interacting mutualists) is one kind of constraint, and it explained 22 to 28 per cent of all possible, but unobserved links. Increasing phenophase overlap between species increased link probability, but extensive overlaps were required to achieve a high probability. Other kinds of constraint, such as size mismatch and accessibility limitations, are briefly addressed.  相似文献   

In recent years, discussion around memory modification interventions has gained attention. However, discussion around the use of memory interventions in the criminal justice system has been mostly absent. In this paper we start by highlighting the importance memory has for human well-being and personal identity, as well as its role within the criminal forensic setting; in particular, for claiming and accepting legal responsibility, for moral learning, and for retribution. We provide examples of memory interventions that are currently available for medical purposes, but that in the future could be used in the forensic setting to modify criminal offenders’ memories. In this section we contrast the cases of (1) dampening and (2) enhancing memories of criminal offenders. We then present from a pragmatic approach some pressing ethical issues associated with these types of memory interventions. The paper ends up highlighting how these pragmatic considerations can help establish ethically justified criteria regarding the possibility of interventions aimed at modifying criminal offenders’ memories.  相似文献   

We investigate the community structure of the global ownership network of transnational corporations. We find a pronounced organization in communities that cannot be explained by randomness. Despite the global character of this network, communities reflect first of all the geographical location of firms, while the industrial sector plays only a marginal role. We also analyze the meta-network in which the nodes are the communities and the links are obtained by aggregating the links among firms belonging to pairs of communities. We analyze the network centrality of the top 50 communities and we provide a quantitative assessment of the financial sector role in connecting the global economy.  相似文献   

Mirror-image doublets of Stylonychia mytilus include 2 sets of cortical structures, one with the normal "right-handed" (RH) arrangement, the other with a reversed "left-handed" (LH) arrangement. These sets, however, are incomplete, with certain structures, most notably cirri of the right marginal type, missing near the line of symmetry. When a mirror-image doublet is bisected longitudinally to separate the RH and LH components physically, each fragment undergoes a regeneration process that restores a complete set of cortical structures, including the previously missing cirri of the right marginal type. In the resulting LH cell, all ciliary structures are present in an arrangement that is globally reversed in relation to that found in RH cells; in particular, marginal cirri of the left-marginal type are formed at the cell's right margin, and marginal cirri of the right-marginal type are produced at the cell's left margin. Whereas the regenerated RH fragment always divides and initiates a clone of normal singlets, the LH fragment, though structurally nearly complete, in all cases eventually dies without dividing. The cause of death is starvation due to the formation of an abnormal oral apparatus. In the Discussion, we consider the nature and consequences of a reversal of global positional information.  相似文献   

