-
Notifications
You must be signed in to change notification settings - Fork 0
/
survey-data-notes.tex
84 lines (59 loc) · 5.33 KB
/
survey-data-notes.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
\section{Data Management Editing notes}
\subsubsection{A few notes by Maxim}
Dear All,
I created this file because I wanted to write down a few thoughts but didn't want to obfuscate
the latex source for this section or create a conflict at check-in.
Maxim -mxp-
\begin{itemize}
\item Do we need to define file catalog as something complementary to ``true'' metadata? In the sense that it's a translation to path/localization but doe not necessarily carry other info. [awb] But it may contain a lot of metadata. For example, the Fermi Space Telescope Data Catalog contains a large set of meta-data which enables users to do queries. In that sense, the Data Catalog is (just) a meta-database i.e. it's a convenient way to organize meta-data.
[mhk] I agree with Anders that one of the things that we should discuss is the usefulness of File Catalogs that allow the selection of a subset of files based on metadata. From my experience, I have the impression that this feature for a File Catalog is more important that the translation of filename -> location. I havenÕt used anything other that SAM from Fermilab over the years, but feel that this is itÕs best feature. Querying the meta-database to get only the files that I want to process. And really, it should be separated out from location determination. (I think that it actually does have the capability to use several different backends for file delivery, but the delivery of a URI and filename is also part of the default information returned.)
\item Evolution of Metadata - do we need to mention event indexing?
[mhk] Not sure what you mean by event indexing? Is this a listing of the event or frame numbers available within a file? I know that indexing the range of run and event numbers in files has been very useful for me in the past, but the listing of all of the events hasnÕt been required. And with xrootd and sparse access protocols coming, IÕm not sure it would be necessary. But IÕm happy to hear about other use cases where it was useful to have the events indexed in the metadata.
\item File catalog - right now there is always a centralized DB which may become a bottleneck in some instances.
Is it possible or desirable to have a truly distributed DB for better availability and speed? Consistency is the main problem
[mhk] I know of at least one instance where there was a one-time replication of a file catalog to a remote site (and matching relocation of a large dataset) that became necessary due to bandwidth limits. The use case was to have a remote site reprocessing archived data with new versions of the reconstruction and then once completed, everything transferred back to what you can call the Tier-0 location (this was CDF and not LHC). Anyway, this is a very different situation that having a truly distributed DB which you asked the question about. I think that this model worked well, and IÕm not sure how much difficulty there has been with CMS/ATLAS/LHCb/etc File Catalogs becoming the bottleneck for process data. Also, I thought that the possibility of using squids in front of the database would limit this from being a bottleneck?
\item Coordinated push to sites vs on-demand access via network (cf. xrootd) - to which extent can the latter dominate technology choices?
[mhk] I had the impression that there wasnÕt much push to go beyond having Tier-1 data distribution. That network speed really had made delivering data to the jobs the most reasonable approach when considering personpower vs computing efficiency.
\item Do we need to take a look at Globus Online and its possible limitations with regard to policy/pricing/scalability?
\item I already covered the mutual dependency of three domains in my WMS section:
\begin{itemize}
\item Workflow mgt
\item Workload mgt
\item Data mgt
\end{itemize}
There needs to be a tie-in in this section as well. Data management is congruent with workflow (data component) but is ultimately
translated to workload and what the latter does with the data.
\item Most systems deployed at scale in HEP are more or less experiment specific.
\item Network performance monitoring and its role in data distribution.
\end{itemize}
\subsection{Original ``Editing notes''}
\fixme{This is not a real section and will be deleted}
\begin{itemize}
\item Define data management and it's major elements (one suggestion is below)
\item reference that it is related to workflow management
\item List ways that DM can tend to be particularly parochial and how in some cases it must be specifically tailored to the computing facility and/or experiment.
\item List what elements are general.
\item Take care not to overlap with the systems group, but to point out where there are areas of overlap.
\end{itemize}
Possible partitioning of DM into smaller parts:
\begin{description}
\item[Distribution] issues:
\begin{itemize}
\item authentication/authorization
\item caching / purging
\item side hardware and software requirements
\item on-demand vs. scheduled
\end{itemize}
\item[Metadata] issues:
\begin{itemize}
\item What job produced what file and with what input parameters/data
\item Where is my file? Data catalogs ...
\item Fileset definitions
\item File popularity to drive cache purges
\item Important analysis summary info
\item Locating files/events/objects
\item Everything as metadata
\item Provenance tracking
\item Namespace issues
\end{itemize}
\end{description}