Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year?

868 visualizaciones

Publicado el

A: Data! But do you know where this data is duplicated, by whom and exactly how it’s scattered across laptops, desktops, file servers and IBM Domino databases?

Let us show you how to analyze local drives, network drives and server based apps to get a grasp of what data is out there and what it means to your business. Learn how to collect, aggregate and analyze file sizes and types, as well as identify knowledge sharing patterns. This session will empower you to work towards reducing your data storage costs and increasing collaboration efficiency!

Publicado en: Software
  • Sé el primero en comentar

BP301: Q: What’s Your Second Most Valuable Asset and Nearly Doubles Every Year?

  1. 1. BP 301: What’s your second most valuable asset and nearly doubles every year? Henning Kunz, panagenda Consulting Florian Vogler, panagenda
  2. 2. Introduction  Henning Kunz – For about 20 years Services and Consulting guy in the Collaboration space – More infrastructure than development – With panagenda more and more analytics as a basis for agile transformation projects  Florian Vogler – For almost all his life Client Management guru – Development and infrastructure – panagendas visionary figurehead
  3. 3. Agenda  Speaking of the 2nd most valuable asset and introduction  Why are we doing this?  Where in the world are files?  Collecting BIG data – Basics  Statistics – Basics  Collecting from the file system  Collecting from IBM Notes & Domino  Sample reports  Possibilities are endless (this session is not)
  4. 4. Before we start with the introduction  Answer to 2nd most valuable asset  1st most valuable asset?
  5. 5. What can you expect from this session?  Thoughts on companies file inventory  Some code snippets to gain inventory information  Demo is based on inventory information collected from our personal production notebooks (and a demo backend system) using the code snippets – Visualization is prepared using a Visual Analytics Tool  Some ideas on how to use the outcome
  6. 6. FILES ARE EVERYWHERE
  7. 7. A file – from easy …  In the easiest sense, a file has – a potentially mind-boggling number of attributes, e.g. • folder structure • filename • size – Content (which may result in attributes, too)
  8. 8. A file – … to complex  Content is king! – Zip files – header vs. files vs. file • Zipping the same files twice creates a unique hash for both zip files … – Office files (pptx, xlsx, …) • Contains a lot of information „inside“
  9. 9. Why are we doing this (=Why are files so important / interesting)?  Storage Amount = Storage (and backup!) Cost – Increase free disk space, Reduce cost – Beware of DAOS, Centera, … before you get too excited  Understand which (types of) files are created (rather: originated), updated, …  … and by whom  identify knowledge / working-together clusters  Social Business Going further (not covered in this session)  Security & Compliance  Content  Beyond Windows (Linux, Mac, Mobile, …)
  10. 10. Mostly for French and German attendees  Some of the use cases and examples covered could be a problem with regards to Worker‘s Council regulations   Rethink use case without end user information – E.g. instead of „who all has (created) PowerPoint files“  „how many PowerPoint files do we have across how many users (min/avg/max – without information about actual end users)
  11. 11. For everyone: Things to be aware of  The name of a file (or folder) can be a big problem on its own – 2015-01-27_money_transfers_to_carribean_account_789XA3_PW_richmaker.xls – Layoff_in_german_office_Q2_2015.docx – Increase_salary_of_mr_jones_to_200000.txt  The mere existence of a file (or folder) can create (at least an ethical) problem on its own – On someone‘s laptop you find confidential, unauthorized, inappropriate information • e.g. internal DWG (CAD) files, a copy of the meeting minutes from the last meeting of the board of management, customer data, performance figures, … – And now?
  12. 12. Where files are stored  „Local“ file system – „Fixed“ disks (C:, D:, …) – Local removable disks - A:, B:, USB Sticks, CD-Rom, …  Network file system – Mounted / mapped / UNC / synched (offline files) – File server  NSFs (Email / Applications) – Local (with or without consistent ACL, with or without DB level encryption) – Server – Beware of reader fields, author fields, …  Connections Files, FileNet, Documentum, SharePoint, Dropbox, Teamdrive, …
  13. 13. How to collect: WYSIWYG or AYCE  “WYSIWYG” – Local execution = in context of current OS user • Other users have to login, too (may never happen) – Network scanning in context of current OS user • Shared network drives across departments/company  “AYCE” – Local execution as Admin (e.g. with SuRunAs) • Includes Windows profiles from all users – Batch network scanning – Root mount scanning
  14. 14. What to collect  Simple File attributes – Name, “extension”, size, created, last modified, … (Dates and Time zoning!)  Complex (but much more useful) file attributes – Office properties like Author, Subject, last printed, last whatever, … – Zip / Rar / 7z / gzip / … – (e.g. MD5) hash (same  same vs. similar)  Very complex file attributes – Security (R/W/…) – NSF & File system – Fingerprints (“Linux magic numbers”)  Hilariously complex: Content (also: similar instead of just same)
  15. 15. Mission impossible  “Impossible” File attributes – Not accessible – Not visible from viewpoint of scanner – Not used (e.g. multiuser PCs where a user doesn’t log on again) – Encrypted (e.g. Zip with password)
  16. 16. Examples of what not to do  Do not harm human beings, animals, plants or goods with your findings – Be good, do good, be a hero!  Do not analyze for files with same filename – Approx. 60-70% of all files on a single machine  Do not just delete duplicates  Also: do not do nothing
  17. 17. A VERY SHORT STATISTICS PITCH
  18. 18. Frequency distribution  In statistics, a frequency distribution is a table that displays the frequency of various outcomes in a sample.  i.e. session survey feedback by 100 session participants Answer COUNT Speaker skill was brilliant 15 Speaker skill was good 60 Speaker skill was ok 12 Speaker skill was somewhat poor 8 Speaker skill was very poor 5
  19. 19. Grouped data  A raw dataset can be organized by constructing a table showing the frequency distribution of the variable (whose values are given in the raw dataset). Such a frequency table is often referred to as grouped data.  i.e. time taken to answer a survey by 15 participants  sorted in symmetric intervals (bins) or qualitative characteristics Time taken [s] 10 11 9 10 14 20 11 9 14 10 9 13 12 21 24 Interval Count <5 s 0 5s<=t<10s 3 10s<=t<15s 9 15s<=t<20s 0 20s<=t<25s 3 Interval Count Fast <10s 3 Normal 10s<=t<20s 9 Slow >=20s 3
  20. 20. Histogram A histogram is a graphical representation of the distribution of data. To construct a histogram, the first step is to "bin" the range of values and then count how many values fall into each interval. i.e. time needed in [s] to rush from Dolphin Southern Hemisphere 1 to Swan Mockingbird 1-2 (Sample of 50 Participants) rushtime[s] Count 140 1 150 2 160 5 170 10 180 13 190 11 200 6 210 1 220 0 230 1 0 2 4 6 8 10 12 14 140 150 160 170 180 190 200 210 220 230 Count Rushtime [s] 197 187 186 179 156 179 181 173 188 188 163 202 174 178 193 169 192 170 185 172 192 169 179 174 164 181 161 137 204 167 198 185 186 148 148 185 197 231 175 184 176 175 176 187 210 180 174 180 204 158 Bin and Count Collect/Measure
  21. 21. SCAN FILESYSTEMS
  22. 22. Local  Scan local Windows based drives (locally mounted hard disks, portable drives or mounted)  Using PowerShell – Script 1. Collect file system information with MD5 and SHA1 hashes – Needs PowerShell V4 – Uses: Scripting.FileSystemObject, get-acl cmdlet, get-hash cmdlet – Run locally with ‘super user’ rights  3 Result files – Folders (Folder Path, LastWriteTime, Size, FileCount, Depth , FolderName) – ACLs (Folder Path, IdentityReference, AccessControlType) – Files (Folder Path, FileName, CreationTime, LastWriteTime, Size, Extension, MD5, SHA1)
  23. 23. A short note on PowerShell Execution Policy  There is something like execution security in PowerShell  Execution Policy is set to undefined by default – Thus it permits individual commands from console, but will not run scripts  Policytypes – Restricted, AllSigned, RemoteSigned, Unrestricted, Bypass, Undefined  Scope – Local Workstation ,CurrentUser, Process
  24. 24. A short note on PowerShell Execution Policy  To see current settings get-ExecutionPolicy –List  To set set-ExecutionPolicy RemoteSigned –Scope CurrentUser  RemoteSigned allows execution of “own” unsigned scripts – “own” means scripts written/edited/saved in PowerShell ISE on local machine – we will not talk about signing PowerShell scripts in this session, its not like “sign using current users id” http://technet.microsoft.com/en-us/library/hh847748.aspx
  25. 25. PowerShell Snippet
  26. 26. Enhancement: Collecting Office attributes for .doc* files  Scan local Widows based drives (locally mounted hard disks, portable drives or mounted )  Using PowerShell – Script 2. Collect file system information with MD5 and SHA1 hashes and .doc* attributes – Uses: -ComObject Word.Application BuiltInDocumentProperties  3 Result files – Folders (Folder Path, LastWriteTime, Size, FileCount, Depth , FolderName) – ACLs (Folder Path, IdentityReference, AccessControlType) – Files (Folder Path, FileName, CreationTime, LastWriteTime, Size, Extension, MD5, SHA1, Created, Author, Title, Last print date)
  27. 27. Snippet 2 BuiltinDocumentProperties 1 Title 2 Subject 3 Author 4 Keywords 5 Comments 6 Template 7 Last author 8 Revision number 9 Application name 10 Last print date 11 Creation date 12 Last save time 13 Total editing time 14 Number of pages 15 Number of words 16 Number of characters 17 Security 18 Category 19 Format 20 Manager 21 Company 22 Number of bytes 23 Number of lines 24 Number of paragraphs 25 Number of slides 26 Number of notes 27 Number of hidden Slides 28 Number of multimedia clips 29 Hyperlink base 30 Number of characters (with spaces)
  28. 28. Collecting inventory from “Fileserver 2.0”  Scan SharePoint Inventory  Using PowerShell – Script 3. Collect item information from SharePoint Server – Uses: SharePoint cmdlets – Result: Web Application, Site, Web, List, Item ID, Item URL, Item Title, Item Created, Item Modified, File Size, Author, Versions, Filename
  29. 29. Snippet 3
  30. 30. SCAN FILES IN NSF CONTAINERS
  31. 31. IBM Notes & Domino  NSFs (Email / Applications) – Local (with or without consistent ACL, with or without DB level encryption) – Server – ACL, reader fields, author fields, document / field encryption, … – zip-file content – Fields in general (Subject, from, to, cc:, bcc:, created, modified, Body, …) • The Subject of a Notes document can be just as problematic as the name of a file (attachment) • Actually this may apply to pretty much any field • Note: Message Tracking ID – ATTNQ# (today‘s *00#.*)
  32. 32. Fs_free_main.exe ConnectED 2015 Edition  Special Stand-alone version to scan local file system and nsf files  Inspects zip file content (deliberately limited to filesystem)  Runs from command line with parameters – Uses local notes.ini and user.id / server.id – Therefore in security context of used id-file (ACLs, Reader Fields, DB/Document Encryption) – Lists (unprotected) zip file content – Based on C-API  Result: Path,Size,Modified,md5,sha-1
  33. 33. CHART TIME ….EXAMPLE RESULTS DEMO… Script 1: 16,728 folders 127,000 files Script 2: 1,150 doc files Script 3: 1,316 SP files Fs.freemain: 1,200,000 records (250 MB)
  34. 34. POSSIBILITIES ARE ENDLESS….
  35. 35. Beyond the shown  Until now we just analyzed what's out there  How could we use that information?  Lets think about some interesting use cases
  36. 36. File Server Migrations – File Consolidations  Use the analysis to understand your file inventory  With respect to – File types  which files fit into the target system (i.e. office files, pdf, jpg, png, wav versus xml, properties, files from non office applications) – And their • Volume distribution • Count distribution – Uniqueness of local files – Time stamps (retention, usage hint)  And act/size based on that information
  37. 37. Suggest Community Clusters  Based on analysis outcomes – Inventory overlap – Same authors, editors – Same access rights – Metadata  Think of it as a one time functionality to rearrange your files world in the first step  Could be used in the context of an attachment like SwiftFile* in the second step – may require content analysis *http://www-01.ibm.com/support/docview.wss?uid=swg24034409
  38. 38. Companies File Locations  You do not have to store this file again….  As a hint for a so far unknown collaboration cluster/ community  Used in the context of an attachment inside notes – Shows all MD5 identical files found at formerly scanned locations inside the company  Biggest challenges – Real time performance (needs ongoing periodic scanning of all sources) – Security trimming (the accounts & groups of all scanned sources have to be resolved/mapped)
  39. 39. THANK YOU NOTE: POSSIBILITIES ARE ENDLESS – MORESO BEYOND FILES florian.vogler@panagenda.com, henning.kunz@panagenda.com come and visit us in the TechnOasis #PED G3 A-C! Download the latest slide deck and code snippets www.panagenda.com/connected2015files

×